Thread: DISTINCT and ORDER BY in aggregate functions on expressions(!)

DISTINCT and ORDER BY in aggregate functions on expressions(!)

From

mirabilos

Date:

27 February 2023, 23:11:49

Hi,

I’ve got a… rather large query (see below), in which I join a complex
data structure (whose exact contents do not, at this point, matter)
together to get some auxiliary data to expose as JSON field.

In this query I can use, for example…

jsonb_build_object('user_permissions',
    jsonb_agg(DISTINCT ap.name ORDER BY ap.name))

… to get a distinct, sorted, list of “user permissions” from a table
ap which I joined to the user table which is the main subject of the
query. (For some reason I need to add DISTINCT because else duplica‐
tes are shown.)

Wrapping this as…

jsonb_build_object('user_permissions',
    COALESCE(
    jsonb_agg(DISTINCT ap.name ORDER BY ap.name)
    FILTER (WHERE ap.id IS NOT NULL)))

… gets me the JSON object’s value for the user_permissions set to
null if there’s nothing in the m:n intermediate table for the user
in question.

This works well. However, what I seem to be not allowed to do is
(without the extra COALESCE, to simplify):

jsonb_build_object('opening_times',
    jsonb_agg(DISTINCT jsonb_build_object(
    'weekday', cot.weekday,
    'from_hour', cot.from_hour,
    'to_hour', cot.to_hour)
    ORDER BY cot.weekday, cot.from_hour, cot.to_hour)
)

This is because, when I use DISTINCT (but only then‽), the ORDER BY
arguments must be… arguments to the function, or something. In the
above case, the sort key is ap.name which is also the argument to
the jsonb_agg function, so no problem there, but here, the jsonb_agg
argument is the return value of a function so… it has no name.

What I’m looking for is something like…

jsonb_build_object('opening_times',
    jsonb_agg(DISTINCT jsonb_build_object(
    'weekday', cot.weekday,
    'from_hour', cot.from_hour,
    'to_hour', cot.to_hour) AS jbo
    ORDER BY jbo->>'weekday', jbo->>'from_hour', jbo->>'to_hour')
)

… except I cannot define aliases in that place. Any other syntax
would also work.

The suggested solution for this is apparently to do…

CREATE OR REPLACE FUNCTION core_openingtime_jsonb_sort(JSONB)
    RETURNS JSONB AS $$
SELECT jsonb_agg(e ORDER BY e->>'weekday', e->>'from_hour', e->>'to_hour')
    FROM jsonb_array_elements($1) AS e
$$ LANGUAGE SQL IMMUTABLE;

… and then query as…

jsonb_build_object('opening_times',
    core_openingtime_jsonb_sort(jsonb_agg(DISTINCT jsonb_build_object(
    'weekday', cot.weekday,
    'from_hour', cot.from_hour,
    'to_hour', cot.to_hour)))
)

… which involves internally subquerying for each output row (i.e.
row of the user table) times amount of sub-JSONArrays that need
to be sorted like this, which is currently 3.

All other solutions I can find involve subqueries in the first
place; I am somewhat proud I even managed to write this with
JOINs and without any subqueries in the first place so I’m
hesitant to go that route.

Any advice here?

And, somewhat related: when outputting JSONB, the order of
JSONObject elements is indeterminate, which is… okay, but
forcing sorted (ASCIIbetically, i.e. by codepoint order)
keys would be very very welcome, for reproducibility of
the output. (I’m sure the reproducible-builds project would
also love if this could be changed, or at least added, in a
way it can be enabled for queries, as session parameter
perhaps?)

The query in its current incarnanation is as follows:

-- -----BEGIN SQL-----
CREATE OR REPLACE FUNCTION core_openingtime_jsonb_sort(JSONB)
    RETURNS JSONB AS $$
SELECT jsonb_agg(e ORDER BY e->>'weekday', e->>'from_hour', e->>'to_hour')
    FROM jsonb_array_elements($1) AS e
$$ LANGUAGE SQL IMMUTABLE;

CREATE OR REPLACE FUNCTION core_generalworkavailability_jsonb_sort(JSONB)
    RETURNS JSONB AS $$
SELECT jsonb_agg(e ORDER BY e->>'weekday', e->>'forenoon', e->>'afternoon', e->>'evening')
    FROM jsonb_array_elements($1) AS e
$$ LANGUAGE SQL IMMUTABLE;

COPY (SELECT cp.email, cp.first_name, cp.last_name, cp.street, cp.number,
    jsonb_build_object('groups', COALESCE(jsonb_agg(DISTINCT ag.name ORDER BY ag.name)
        FILTER (WHERE ag.id IS NOT NULL))) ||
    jsonb_build_object('help_operations', COALESCE(jsonb_agg(DISTINCT cho.name ORDER BY cho.name)
        FILTER (WHERE cho.id IS NOT NULL))) ||
    jsonb_build_object('emergency_opening_times', COALESCE(
        core_openingtime_jsonb_sort(jsonb_agg(DISTINCT jsonb_build_object(
        'weekday', ceot.weekday,
        'from_hour', ceot.from_hour,
        'to_hour', ceot.to_hour))
        FILTER (WHERE ceot.id IS NOT NULL)))) ||
    jsonb_build_object('opening_times', COALESCE(
        core_openingtime_jsonb_sort(jsonb_agg(DISTINCT jsonb_build_object(
        'weekday', cot.weekday,
        'from_hour', cot.from_hour,
        'to_hour', cot.to_hour))
        FILTER (WHERE cot.id IS NOT NULL)))) ||
    jsonb_build_object('possible_work_times', COALESCE(
        core_generalworkavailability_jsonb_sort(jsonb_agg(DISTINCT jsonb_build_object(
        'weekday', cgwa.weekday,
        'forenoon', cgwa.forenoon,
        'afternoon', cgwa.afternoon,
        'evening', cgwa.evening))
        FILTER (WHERE cgwa.id IS NOT NULL)))) ||
    jsonb_build_object('qualifications',
     jsonb_build_object('administrative', COALESCE(jsonb_agg(DISTINCT cqa.name ORDER BY cqa.name)
        FILTER (WHERE cqa.id IS NOT NULL))) ||
     jsonb_build_object('health', COALESCE(jsonb_agg(DISTINCT cqh.name ORDER BY cqh.name)
        FILTER (WHERE cqh.id IS NOT NULL))) ||
     jsonb_build_object('language', COALESCE(jsonb_agg(DISTINCT cqlang.name ORDER BY cqlang.name)
        FILTER (WHERE cqlang.id IS NOT NULL))) ||
     jsonb_build_object('license', COALESCE(jsonb_agg(DISTINCT cqlic.name ORDER BY cqlic.name)
        FILTER (WHERE cqlic.id IS NOT NULL))) ||
     jsonb_build_object('technical', COALESCE(jsonb_agg(DISTINCT cqt.name ORDER BY cqt.name)
        FILTER (WHERE cqt.id IS NOT NULL)))) ||
    jsonb_build_object('restrictions', COALESCE(jsonb_agg(DISTINCT cr.name ORDER BY cr.name)
        FILTER (WHERE cr.id IS NOT NULL))) ||
    jsonb_build_object('user_permissions', COALESCE(jsonb_agg(DISTINCT ap.name ORDER BY ap.name)
        FILTER (WHERE ap.id IS NOT NULL))) AS "other_data"
    FROM core_person cp
    LEFT JOIN core_person_emergency_opening_times cpeot ON cpeot.person_id=cp.id
    LEFT JOIN core_openingtime ceot ON ceot.id=cpeot.openingtime_id
    LEFT JOIN core_person_groups cpg ON cpg.person_id=cp.id
    LEFT JOIN auth_group ag ON ag.id=cpg.group_id
    LEFT JOIN core_person_help_operations cpho ON cpho.person_id=cp.id
    LEFT JOIN core_helpoperation cho ON cho.id=cpho.helpoperation_id
    LEFT JOIN core_person_opening_times cpot ON cpot.person_id=cp.id
    LEFT JOIN core_openingtime cot ON cot.id=cpot.openingtime_id
    LEFT JOIN core_person_possible_work_times cppwt ON cppwt.person_id=cp.id
    LEFT JOIN core_generalworkavailability cgwa ON cgwa.id=cppwt.generalworkavailability_id
    LEFT JOIN core_person_qualifications_administrative cpqa ON cpqa.person_id=cp.id
    LEFT JOIN core_qualificationadministrative cqa ON cqa.id=cpqa.qualificationadministrative_id
    LEFT JOIN core_person_qualifications_health cpqh ON cpqh.person_id=cp.id
    LEFT JOIN core_qualificationhealth cqh ON cqh.id=cpqh.qualificationhealth_id
    LEFT JOIN core_person_qualifications_language cpqlang ON cpqlang.person_id=cp.id
    LEFT JOIN core_qualificationlanguage cqlang ON cqlang.id=cpqlang.qualificationlanguage_id
    LEFT JOIN core_person_qualifications_license cpqlic ON cpqlic.person_id=cp.id
    LEFT JOIN core_qualificationlicense cqlic ON cqlic.id=cpqlic.qualificationlicense_id
    LEFT JOIN core_person_qualifications_technical cpqt ON cpqt.person_id=cp.id
    LEFT JOIN core_qualificationtechnical cqt ON cqt.id=cpqt.qualificationtechnical_id
    LEFT JOIN core_person_restrictions cpr ON cpr.person_id=cp.id
    LEFT JOIN core_restriction cr ON cr.id=cpr.restriction_id
    LEFT JOIN core_person_user_permissions cpup ON cpup.person_id=cp.id
    LEFT JOIN auth_permission ap ON ap.id=cpup.permission_id
    GROUP BY cp.email, cp.first_name, cp.last_name, cp.street, cp.number
    ORDER BY cp.email
) TO STDOUT WITH (FORMAT csv, HEADER, FORCE_QUOTE *, ENCODING 'UTF-8');

DROP FUNCTION core_openingtime_jsonb_sort(JSONB);
DROP FUNCTION core_generalworkavailability_jsonb_sort(JSONB);
-- -----END SQL-----

Thanks in advance,
//mirabilos
--
„Cool, /usr/share/doc/mksh/examples/uhr.gz ist ja ein Grund,
mksh auf jedem System zu installieren.“
    -- XTaran auf der OpenRheinRuhr, ganz begeistert
(EN: “[…]uhr.gz is a reason to install mksh on every system.”)

Re: DISTINCT and ORDER BY in aggregate functions on expressions(!)

From

Tom Lane

Date:

28 February 2023, 00:11:05

mirabilos <tg@evolvis.org> writes:
> This works well. However, what I seem to be not allowed to do is
> (without the extra COALESCE, to simplify):
> ...
> This is because, when I use DISTINCT (but only then‽), the ORDER BY
> arguments must be… arguments to the function, or something.

Well, yeah.  Simplify it to

  SELECT array_agg(DISTINCT x ORDER BY y) FROM mytable;

If there are several rows containing the same value of x and different
values of y, which y value are we supposed to sort the unique-ified x
value by?  It's an ill-defined query.

For the specific example you give, it's true that any specific
possible output of jsonb_build_object() would correspond to
a unique set of cot.weekday, cot.from_hour, cot.to_hour values.
But the parser can't be expected to know that.  Many functions
can produce the same output for different sets of inputs.

I'd suggest moving the distinct-ification into an earlier
processing step (i.e. a sub-select), or maybe thinking harder
about why you're getting duplicates in the first place.

            regards, tom lane

Re: DISTINCT and ORDER BY in aggregate functions on expressions(!)

From

"David G. Johnston"

Date:

28 February 2023, 00:22:46

On Mon, Feb 27, 2023 at 4:11 PM mirabilos <tg@evolvis.org> wrote:

jsonb_build_object('opening_times',
jsonb_agg(DISTINCT jsonb_build_object(
'weekday', cot.weekday,
'from_hour', cot.from_hour,
'to_hour', cot.to_hour)
ORDER BY cot.weekday, cot.from_hour, cot.to_hour)
)

This is because, when I use DISTINCT (but only then‽), the ORDER BY
arguments must be… arguments to the function, or something.

So long as the function call itself is at least stable...:

DISTINCT func_call(...) ORDER BY func_call(...)

Order By is evaluated AFTER and over the contents of the distinct-ified expression

All other solutions I can find involve subqueries in the first
place; I am somewhat proud I even managed to write this with
JOINs and without any subqueries in the first place so I’m
hesitant to go that route.

That pride seems misplaced. Related to Tom's comment, the presence of the DISTINCTs is telling you that what you did is not good. DISTINCT is almost always a code smell, and given the prevalence of direct table joins in your query, it is indeed a valid signal.

Lastly, if you do need to care about normalizing the output of JSON you should consider writing a function that takes arbitrary json input and reformats it, rather than trying to build up json from scratch where every individual component needs to be aware and take action. i.e., get rid of the ORDER BY also. Maybe this belongs in an application layer with tooling that already provides this capability.

David J.

Re: DISTINCT and ORDER BY in aggregate functions on expressions(!)

From

"David G. Johnston"

Date:

28 February 2023, 00:44:47

On Mon, Feb 27, 2023 at 5:22 PM David G. Johnston <david.g.johnston@gmail.com> wrote:

Lastly, if you do need to care about normalizing the output of JSON you should consider writing a function that takes arbitrary json input and reformats it, rather than trying to build up json from scratch where every individual component needs to be aware and take action. i.e., get rid of the ORDER BY also. Maybe this belongs in an application layer with tooling that already provides this capability.

Just to clarify/correct myself - if building up json arrays you'll just remove the distinct and then do a normal aggregate order by based upon whatever ordering the base data presents. I was thinking you were doing order by to get keys in order (i.e., jsonb_object_agg) but that isn't the case here.

David J.

Re: DISTINCT and ORDER BY in aggregate functions on expressions(!)

From

Thorsten Glaser

Date:

28 February 2023, 01:22:47

On Mon, 27 Feb 2023, Tom Lane wrote:

>Well, yeah.  Simplify it to
>
>  SELECT array_agg(DISTINCT x ORDER BY y) FROM mytable;

That’s… a bit too simple for this case.

>If there are several rows containing the same value of x and different
>values of y, which y value are we supposed to sort the unique-ified x
>value by?  It's an ill-defined query.

The problem here is that I do not have an ‘x’.

>For the specific example you give, it's true that any specific
>possible output of jsonb_build_object() would correspond to
>a unique set of cot.weekday, cot.from_hour, cot.to_hour values.

Not necessarily (see directly below), but why would that matter?
It should sort the generated JSON objects within the array.

>, or maybe thinking harder
>about why you're getting duplicates in the first place.

The application developer informed me that it’s entirely possible
that some user entered matching information twice. I don’t have
the exact query that produced duplicates easily in the history
(the downside of working with \i) and tests on other users didn’t
produce duplicates.

So, yes, filtering them out is indeed part of the task here.

On Mon, 27 Feb 2023, David G. Johnston wrote:

>So long as the function call itself is at least stable...:
>
>DISTINCT func_call(...) ORDER BY func_call(...)

aieee really?

(I’d hope jsonb_build_object to be.)

Is that better or worse than using the extra functions to sort…?

>Order By is evaluated AFTER and over the contents of the distinct-ified
>expression

That’s right and good, but the problem is that I do not seem to
have a syntax with which to refer to the distinct-ified expression
to use in the ORDER BY clause.

>> place; I am somewhat proud I even managed to write this with
>> JOINs and without any subqueries in the first place so I’m
>> hesitant to go that route.
>
>That pride seems misplaced.  Related to Tom's comment, the presence of the
>DISTINCTs is telling you that what you did is not good.  DISTINCT is almost
>always a code smell

Not about the DISTINCTs. I haven’t used JOIN much (usually WHERE
as inner join) nor on tables this massive, and this is my second
foray into aggregate functions only.

>Lastly, if you do need to care about normalizing the output of JSON you
>should consider writing a function that takes arbitrary json input and
>reformats it, rather than trying to build up json from scratch where every
>individual component needs to be aware and take action.  i.e., get rid of
>the ORDER BY also.  Maybe this belongs in an application layer with tooling
>that already provides this capability.

Funnily enough, both here and in the other place where I tried to
use JSON output, PostgreSQL (with COPY TO STDOUT) *is* the application
layer. Here I’m generating a CSV file; in the other situation I was
going to put the generated JSON directly into an HTTP result filehandle.

In the latter, I dropped that approach, output CSV and converted that
(by replacing newlines with “],[” and prepending “[[” and appending
“]]”) to JSON myself, which worked there as it was all-numeric. But
the frustration there was about unnecessary whitespace instead.

Both cases have in common that a, possibly huge, result set can be
directly streamed from PostgreSQL to the consumer, but the former
lacks just that tiny bit of functionality that would make it really
rock :/

I was asking here because perhaps either that missing functionality
can be considered, or to find out if there’s better ways to produce
that output, due to my inexperience with SQL. The ways I’m using do
work, and I’m relatively happy, but…

bye,
//mirabilos
--
15:41⎜<Lo-lan-do:#fusionforge> Somebody write a testsuite for helloworld :-)

Re: DISTINCT and ORDER BY in aggregate functions on expressions(!)

From

Tom Lane

Date:

28 February 2023, 01:32:58

Thorsten Glaser <tg@evolvis.org> writes:
> On Mon, 27 Feb 2023, Tom Lane wrote:
>> Well, yeah.  Simplify it to
>> SELECT array_agg(DISTINCT x ORDER BY y) FROM mytable;

> That’s… a bit too simple for this case.

Sure, I was just trying to explain the rule.

>> For the specific example you give, it's true that any specific
>> possible output of jsonb_build_object() would correspond to
>> a unique set of cot.weekday, cot.from_hour, cot.to_hour values.

> Not necessarily (see directly below), but why would that matter?
> It should sort the generated JSON objects within the array.

Well, that may be what you want, but it's not what you wrote in
the query.  Follow David's advice and do

jsonb_agg(DISTINCT jsonb_build_object(
    'weekday', cot.weekday,
    'from_hour', cot.from_hour,
    'to_hour', cot.to_hour)
          ORDER BY jsonb_build_object(
    'weekday', cot.weekday,
    'from_hour', cot.from_hour,
    'to_hour', cot.to_hour))

I'm pretty sure that this will only incur one evaluation of the
common subexpression, so even though it's tedious to type it's not
inefficient.

            regards, tom lane

Re: DISTINCT and ORDER BY in aggregate functions on expressions(!)

From

"David G. Johnston"

Date:

28 February 2023, 01:48:42

On Mon, Feb 27, 2023 at 6:22 PM Thorsten Glaser <tg@evolvis.org> wrote:

Not about the DISTINCTs. I haven’t used JOIN much (usually WHERE
as inner join) nor on tables this massive, and this is my second
foray into aggregate functions only.

Fair.

Consider this then as a jumping point to a more precise query form:

https://dbfiddle.uk/wz2MKtuF

create table base (base_id integer primary key, base_val text not null);

insert into base values (1, 'one');

create table subone (subone_id serial primary key, base_id integer, subone_value text not null);

insert into subone (base_id, subone_value) values (1, 'subone-one'), (1, 'subone-two');

create table subtwo (subtwo_id serial primary key, base_id integer, subtwo_value text not null);

insert into subtwo (base_id, subtwo_value) values (1, 'subtwo-one'), (1, 'subtwo-two');

--BAD cross joining going on with multiple one-to-many relationships

select * from base join subone using (base_id) join subtwo using (base_id); -- not good

--GOOD, only joining one-to-one relationships

select jsonb_build_object('base_id', base_id, 'subone_arr', subone_arr, 'subtwo_arr', subtwo_arr)
from base
join lateral (select json_agg(jsonb_build_object('key', subone_value) order by subone_value)
from subone where subone.base_id = base.base_id) as so (subone_arr) on true
join (select base_id, json_agg(subtwo_value order by subtwo_value desc)
from subtwo group by base_id) as st (base_id, subtwo_arr) using (base_id)

I used a mix of forms in the two joins, and there are other variants, but the basic concept holds - produce single rows in subqueries then join those various single rows together to produce your desired json output.

David J.

Re: DISTINCT and ORDER BY in aggregate functions on expressions(!)y

From

Thorsten Glaser

Date:

28 February 2023, 02:54:58

On Mon, 27 Feb 2023, David G. Johnston wrote:

>Consider this then as a jumping point to a more precise query form:
[…]
>the basic concept holds - produce single rows in subqueries then join those
>various single rows together to produce your desired json output.

Ouch. I’ll have to read up and experiment with that, I guess.
But wouldn’t I rather then do a CTA for each referenced table
that does the aggregating and GROUP BY person_id, then join that
(which has single rows for each user row) to users?

>--BAD cross joining going on with multiple one-to-many relationships
>--GOOD, only joining one-to-one relationships

Not quite. This is many-to-many created by Django…

On Mon, 27 Feb 2023, Tom Lane wrote:

>Sure, I was just trying to explain the rule.

Aaah, okay. Sorry, I misunderstood that, and the… general direction
of the replies, then.

>Well, that may be what you want, but it's not what you wrote in
>the query.  Follow David's advice and do
[…]
>I'm pretty sure that this will only incur one evaluation of the
>common subexpression, so even though it's tedious to type it's not
>inefficient.

Thanks. But I fear it’s not as simple as you wrote. More like:

jsonb_build_object('opening_times',
    jsonb_agg(DISTINCT jsonb_build_object(
        'weekday', cot.weekday,
        'from_hour', cot.from_hour,
        'to_hour', cot.to_hour)
    ORDER BY
        jsonb_build_object(
        'weekday', cot.weekday,
        'from_hour', cot.from_hour,
        'to_hour', cot.to_hour)->>'weekday',
        jsonb_build_object(
        'weekday', cot.weekday,
        'from_hour', cot.from_hour,
        'to_hour', cot.to_hour)->>'from_hour',
        jsonb_build_object(
        'weekday', cot.weekday,
        'from_hour', cot.from_hour,
        'to_hour', cot.to_hour)->>'to_hour')
)

Isn’t that more like it?

(Though I guess at that point I could just drop at least
the to_hour fallback sort, hoping nobody inserts overlapping
hours with indentical start times…)

Now that I see that, it sort of re-raises a question I had
during developing this but didn’t pursue.

How about, instead of creating a JSONB object here, I create
a (not-JSON) ARRAY['weekday', cot.weekday, 'from_hour',
cot.from_hour, 'to_hour', cot.to_hour] in the aggregate function.
The array is something I could naturally use to sort (its elements
already are in sort key order), and its elements *also* are in
the order jsonb_build_object expects its arguments (assuming I can
do something like passing the array to it instead of “spreading”
the arguments). Pseudocode:

… hmm, not that easy. The aggregate would return e.g. this…

ARRAY[['weekday',1,…],['weekday',2,…]]

… and there’s no array_map function that could be used to pass
each inner array, one by one, to jsonb_build_object; converting
the outer array to JSON gets us json_array_elements (same problem),
other ways I can think of also don’t seem to have matching functions
(jq has… interesting ones).

As I write this, I fear that won’t fly because nōn-JSON arrays
cannot contain mixed types (text and numbers) in PostgreSQL… at
least I ran into that at some point in the past vaguely remembering…

Oh well,
//mirabilos
--
Solange man keine schmutzigen Tricks macht, und ich meine *wirklich*
schmutzige Tricks, wie bei einer doppelt verketteten Liste beide
Pointer XORen und in nur einem Word speichern, funktioniert Boehm ganz
hervorragend.        -- Andreas Bogk über boehm-gc in d.a.s.r

Re: DISTINCT and ORDER BY in aggregate functions on expressions(!)y

From

Alban Hertroys

Date:

28 February 2023, 08:10:54

> On 28 Feb 2023, at 3:54, Thorsten Glaser <tg@evolvis.org> wrote:

(…)

>> Well, that may be what you want, but it's not what you wrote in
>> the query.  Follow David's advice and do
> […]
>> I'm pretty sure that this will only incur one evaluation of the
>> common subexpression, so even though it's tedious to type it's not
>> inefficient.
>
> Thanks. But I fear it’s not as simple as you wrote. More like:
>
> jsonb_build_object('opening_times',
>     jsonb_agg(DISTINCT jsonb_build_object(
>         'weekday', cot.weekday,
>         'from_hour', cot.from_hour,
>         'to_hour', cot.to_hour)
>     ORDER BY
>         jsonb_build_object(
>         'weekday', cot.weekday,
>         'from_hour', cot.from_hour,
>         'to_hour', cot.to_hour)->>'weekday',
>         jsonb_build_object(
>         'weekday', cot.weekday,
>         'from_hour', cot.from_hour,
>         'to_hour', cot.to_hour)->>'from_hour',
>         jsonb_build_object(
>         'weekday', cot.weekday,
>         'from_hour', cot.from_hour,
>         'to_hour', cot.to_hour)->>'to_hour')
> )
>
> Isn’t that more like it?


Perhaps you can use a lateral cross join to get the result of jsonb_build_object as a jsonb value to pass around?


Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.

Re: DISTINCT and ORDER BY in aggregate functions on expressions(!)y

From

Thorsten Glaser

Date:

02 March 2023, 23:02:06

On Tue, 28 Feb 2023, Alban Hertroys wrote:

>Perhaps you can use a lateral cross join to get the result of
>jsonb_build_object as a jsonb value to pass around?

I don’t see how. (But then I’ve not yet worked with lateral JOINs.)

But I c̲a̲n̲ just generate the objects first, I t̲h̲i̲n̲k̲, given one
of them corresponds to exactly one of the rows of an m:n-linked
table and nothing else. Something like…

WITH
    cgwaj AS (
    SELECT cgwa.id AS id, jsonb_build_object(
        'weekday', cgwa.weekday,
        'forenoon', cgwa.forenoon,
        'afternoon', cgwa.afternoon,
        'evening', cgwa.evening) AS obj
    FROM core_generalworkavailability cgwa
    ),
    -- … same for opening times
SELECT cp.email, …,
    -- …
    jsonb_build_object('possible_work_times', COALESCE(
        jsonb_agg(DISTINCT cgwaj.obj ORDER BY cgwaj.obj->>'weekday',
        cgwaj.obj->>'forenoon', cgwaj.obj->>'afternoon',
        cgwaj.obj->>'evening')
        FILTER (WHERE cgwaj.id IS NOT NULL))) ||
    -- …
    FROM core_person cp
    -- …
    LEFT JOIN core_person_possible_work_times cppwt ON cppwt.person_id=cp.id
    LEFT JOIN cgwaj ON cgwaj.id=cppwt.generalworkavailability_id
    -- …

That is, add a CTE for each m:n-attached table whose “value” is
an object, not a single field, keep the id field; LEFT JOIN that
(instead of the original table), then we have a field to use in
ORDER BY.

I think. I’ve not yet tried it (I don’t have access to that DB
normally, I was just helping out).

This avoids sub-SELECTs in the sense of needing to run one for
each user row, because the innermost JSON object building needs
to be done for each (connected (if the query is not filtering on
specific users)) row of the “property table”, anyway. (And even
if filtered, that can be passed down.)

bye,
//mirabilos
--
Solange man keine schmutzigen Tricks macht, und ich meine *wirklich*
schmutzige Tricks, wie bei einer doppelt verketteten Liste beide
Pointer XORen und in nur einem Word speichern, funktioniert Boehm ganz
hervorragend.        -- Andreas Bogk über boehm-gc in d.a.s.r

Re: DISTINCT and ORDER BY in aggregate functions on expressions(!)y

From

Alban Hertroys

Date:

03 March 2023, 08:31:03

> On 3 Mar 2023, at 0:02, Thorsten Glaser <tg@evolvis.org> wrote:
>
> On Tue, 28 Feb 2023, Alban Hertroys wrote:
>
>> Perhaps you can use a lateral cross join to get the result of
>> jsonb_build_object as a jsonb value to pass around?
>
> I don’t see how. (But then I’ve not yet worked with lateral JOINs.)

You posted this bit:

> jsonb_build_object('opening_times',
>     jsonb_agg(DISTINCT jsonb_build_object(
>         'weekday', cot.weekday,
>         'from_hour', cot.from_hour,
>         'to_hour', cot.to_hour)
>     ORDER BY
>         jsonb_build_object(
>         'weekday', cot.weekday,
>         'from_hour', cot.from_hour,
>         'to_hour', cot.to_hour)->>'weekday',
>         jsonb_build_object(
>         'weekday', cot.weekday,
>         'from_hour', cot.from_hour,
>         'to_hour', cot.to_hour)->>'from_hour',
>         jsonb_build_object(
>         'weekday', cot.weekday,
>         'from_hour', cot.from_hour,
>         'to_hour', cot.to_hour)->>'to_hour')
> )


You can rewrite that into something like this:

select jsonb_build_object('opening_times’,
    obj
    ORDER BY
        obj->>'weekday’,
        obj->>'from_hour’,
        obj->>'to_hour')
)
from cot
cross join lateral jsonb_agg(jsonb_build_object(
        'weekday', cot.weekday,
        'from_hour', cot.from_hour,
        'to_hour', cot.to_hour) obj

That’s off the top of my head and I did leave out the DISTINCT. Just to show the concept here. A bit of experimenting
andreading should get you there, I’m keeping $work waiting :P 

(…)

> WITH
>    cgwaj AS (
>     SELECT cgwa.id AS id, jsonb_build_object(
>         'weekday', cgwa.weekday,
>         'forenoon', cgwa.forenoon,
>         'afternoon', cgwa.afternoon,
>         'evening', cgwa.evening) AS obj
>     FROM core_generalworkavailability cgwa
>    ),
>    -- … same for opening times
> SELECT cp.email, …,
>     -- …
>     jsonb_build_object('possible_work_times', COALESCE(
>         jsonb_agg(DISTINCT cgwaj.obj ORDER BY cgwaj.obj->>'weekday',
>         cgwaj.obj->>'forenoon', cgwaj.obj->>'afternoon',
>         cgwaj.obj->>'evening')
>         FILTER (WHERE cgwaj.id IS NOT NULL))) ||
>     -- …
>    FROM core_person cp
>     -- …
>     LEFT JOIN core_person_possible_work_times cppwt ON cppwt.person_id=cp.id
>     LEFT JOIN cgwaj ON cgwaj.id=cppwt.generalworkavailability_id
>     -- …
>
> That is, add a CTE for each m:n-attached table whose “value” is
> an object, not a single field, keep the id field; LEFT JOIN that
> (instead of the original table), then we have a field to use in
> ORDER BY.
>
> I think. I’ve not yet tried it (I don’t have access to that DB
> normally, I was just helping out).
>
> This avoids sub-SELECTs in the sense of needing to run one for
> each user row, because the innermost JSON object building needs
> to be done for each (connected (if the query is not filtering on
> specific users)) row of the “property table”, anyway. (And even
> if filtered, that can be passed down.)
>
> bye,
> //mirabilos
> --
> Solange man keine schmutzigen Tricks macht, und ich meine *wirklich*
> schmutzige Tricks, wie bei einer doppelt verketteten Liste beide
> Pointer XORen und in nur einem Word speichern, funktioniert Boehm ganz
> hervorragend.        -- Andreas Bogk über boehm-gc in d.a.s.r
>

Alban Hertroys
--
There is always an exception to always.

Re: DISTINCT and ORDER BY in aggregate functions on expressions(!)y

From

Thorsten Glaser

Date:

03 March 2023, 19:32:14

On Fri, 3 Mar 2023, Alban Hertroys wrote:

>You can rewrite that into something like this:
>
>select jsonb_build_object('opening_times’,
>    obj
>    ORDER BY
>        obj->>'weekday’,
>        obj->>'from_hour’,
>        obj->>'to_hour')
>)
>from cot
>cross join lateral jsonb_agg(jsonb_build_object(
>        'weekday', cot.weekday,
>        'from_hour', cot.from_hour,
>        'to_hour', cot.to_hour) obj

But isn’t that the same as with a regular LEFT JOIN?

>>    cgwaj AS (
>>     SELECT cgwa.id AS id, jsonb_build_object(
>>         'weekday', cgwa.weekday,
>>         'forenoon', cgwa.forenoon,
>>         'afternoon', cgwa.afternoon,
>>         'evening', cgwa.evening) AS obj
>>     FROM core_generalworkavailability cgwa

plus

>>     LEFT JOIN cgwaj ON cgwaj.id=cppwt.generalworkavailability_id

With the addition that I can aggregate…

bye,
//mirabilos
--
15:41⎜<Lo-lan-do:#fusionforge> Somebody write a testsuite for helloworld :-)

Re: DISTINCT and ORDER BY in aggregate functions on expressions(!)y

From

Alban Hertroys

Date:

04 March 2023, 10:48:24

> On 3 Mar 2023, at 20:32, Thorsten Glaser <tg@evolvis.org> wrote:
>
> On Fri, 3 Mar 2023, Alban Hertroys wrote:
>
>> You can rewrite that into something like this:
>>
>> select jsonb_build_object('opening_times’,
>>     obj
>>     ORDER BY
>>         obj->>'weekday’,
>>         obj->>'from_hour’,
>>         obj->>'to_hour')
>> )
>> from cot
>> cross join lateral jsonb_agg(jsonb_build_object(
>>         'weekday', cot.weekday,
>>         'from_hour', cot.from_hour,
>>         'to_hour', cot.to_hour) obj
>
> But isn’t that the same as with a regular LEFT JOIN?

Similar, but not the same, I’d say.

I do now notice that I made some copying errors there, I was a bit nauseous at that time.
That should have read:

>> select jsonb_build_object('opening_times’,
>>     jsonb_agg(obj
>>     ORDER BY
>>         obj->>'weekday’,
>>         obj->>'from_hour’,
>>         obj->>'to_hour')
>> )
>> from cot
>> cross join lateral jsonb_build_object(
>>         'weekday', cot.weekday,
>>         'from_hour', cot.from_hour,
>>         'to_hour', cot.to_hour) obj

The lateral join applies the function to each row returned from the left side of the join and enriches that row with
thefunction result. 
I used a cross join because there is no join condition to apply to the lateral, otherwise you could also use an inner
joinon true. 

I think you could also have used an implicit Cartesian product (using ‘,’ for the join), and that in that case the
lateralwould be implied. I prefer explicit notation though. 

A left join wouldn’t make much sense here, unless the function could return NULL - for example if it were a function
markedas STRICT and some of the input parameter values (from the table) could be NULL. 

>>>   cgwaj AS (
>>>     SELECT cgwa.id AS id, jsonb_build_object(
>>>         'weekday', cgwa.weekday,
>>>         'forenoon', cgwa.forenoon,
>>>         'afternoon', cgwa.afternoon,
>>>         'evening', cgwa.evening) AS obj
>>>     FROM core_generalworkavailability cgwa
>
> plus

There are some differences.

You need a sub-select, which in turn creates its own result set. It’s up to the planner whether the left or the right
sidegets executed first, after which the results of the other side of the join get merged to this, or whether this can
allbe collected in one go. That’s up to the query planner to decide though, and it could be right. 

>>>     LEFT JOIN cgwaj ON cgwaj.id=cppwt.generalworkavailability_id
>
> With the addition that I can aggregate…

You can do so in both situations, but I guess that confusion stems from my copy/paste mistake.

In my experience, lateral joins go well with the jsonb functions. They tend to reduce code repetition when referencing
objectmembers, such as in your case. 

Regards,

Alban Hertroys
--
There is always an exception to always.

Re: DISTINCT and ORDER BY in aggregate functions on expressions(!)y

From

Thorsten Glaser

Date:

04 March 2023, 20:00:33

On Sat, 4 Mar 2023, Alban Hertroys wrote:

>> But isn’t that the same as with a regular LEFT JOIN?
>
>Similar, but not the same, I’d say.
>
>I do now notice that I made some copying errors there, I was a bit
>nauseous at that time.

np, I’m under pollen attack currently so also not at my best.

>That should have read:
>
>>> select jsonb_build_object('opening_times’,
>>>     jsonb_agg(obj
>>>     ORDER BY
>>>         obj->>'weekday’,
>>>         obj->>'from_hour’,
>>>         obj->>'to_hour')
>>> )
>>> from cot
>>> cross join lateral jsonb_build_object(
>>>         'weekday', cot.weekday,
>>>         'from_hour', cot.from_hour,
>>>         'to_hour', cot.to_hour) obj
>
>The lateral join applies the function to each row returned from the
>left side of the join and enriches that row with the function result. I
>used a cross join because there is no join condition to apply to the
>lateral, otherwise you could also use an inner join on true.

Okay, so I got it now. But that’s still identical to the LEFT JOIN
which I’m using in that example, because…

>A left join wouldn’t make much sense here, unless the function could

… I have a LEFT JOIN already and can just use the CTE there,
so I don’t have to add an extra lateral join.

But good to know for the future/when I don’t have that.

>return NULL - for example if it were a function marked as STRICT and
>some of the input parameter values (from the table) could be NULL.

OK.

>You need a sub-select, which in turn creates its own result set. It’s
>up to the planner whether the left or the right side gets executed
>first, after which the results of the other side of the join get merged
>to this, or whether this can all be collected in one go. That’s up to
>the query planner to decide though, and it could be right.

OK, but that shouldn’t make a difference here as it needs to run over
all rows of the cgwa table anyway (possibly reduced by filtering on
users).

While not the case here, I see that for other entries the lateral
join would cause more work: for the “qualification” kinds of tables,
for example, the individual qualification table has very few rows
(these are entered by the site admin), but the m:n connection table
(whatever the correct name for these is) has a lot because many of
the users have many of these qualifications. If I use a CTE to add
a JSON object to the individual qualification table first, it doesn’t
run on each qualification multiple times; if I use a lateral join,
it possibly, modulo planner optimisations, runs the jsonb_build_object
function many times per qualification despite them all giving the same
result. And, even if the optimisations catch that, it’s mentally not
the same.

>In my experience, lateral joins go well with the jsonb functions. They
>tend to reduce code repetition when referencing object members, such as
>in your case.

Right. In my case I can get the same by adding a CTE instead though,
and it’s hard to see which is better, performance-wise.

This is a lot to take in, and I r̲e̲a̲l̲l̲y̲ appreciate the detailed
explanations given alongside ☻

bye,
//mirabilos
--
15:41⎜<Lo-lan-do:#fusionforge> Somebody write a testsuite for helloworld :-)

Thread: DISTINCT *and* ORDER BY in aggregate functions on expressions(!)

Thread: DISTINCT and ORDER BY in aggregate functions on expressions(!)