Re: [GENERAL] The best way to deal with hierarchical data (was:Postgresql query HAVING do not work) - Mailing list pgsql-general

From Vitaly Burovoy
Subject Re: [GENERAL] The best way to deal with hierarchical data (was:Postgresql query HAVING do not work)
Date
Msg-id CAKOSWNmFLnxdct=v5A=f=GzWMpT7opwtVY0qZzgpeeetb5Z47A@mail.gmail.com
Whole thread Raw
List pgsql-general
On 1/4/17, Gwork <nnj@riseup.net> wrote:
> On 1/5/17 2:51 AM, Vitaly Burovoy wrote:
>> On 1/4/17, Gwork <nnj@riseup.net> wrote:
>>> On 1/5/17 2:22 AM, Vitaly Burovoy wrote:
>>>> On 1/4/17, Vitaly Burovoy <vitaly.burovoy@gmail.com> wrote:
>>>>> On 1/4/17, Gwork <nnj@riseup.net> wrote:
>>>>>> Version: Postgresql 9.5
>>>>>> OS: Debian 8 jessie run on docker
>>>>>>
>>>>>> Following this tutorial The Nested Set Model on
>>>>>> http://mikehillyer.com/articles/managing-hierarchical-data-in-mysql/
>>>>>>
>>>>>>
>>>>>> Section: Depth of a Sub-Tree.
>>>>>> SELECT node.name, (COUNT(parent.name) - (sub_tree.depth + 1)) AS
>>>>>> depth
>>>>>> FROM nested_category AS node,
>>>>>>         nested_category AS parent,
>>>>>>         nested_category AS sub_parent,
>>>>>>         (
>>>>>>                 SELECT node.name, (COUNT(parent.name) - 1) AS depth
>>>>>>                 FROM nested_category AS node,
>>>>>>                 nested_category AS parent
>>>>>>                 WHERE node.lft BETWEEN parent.lft AND parent.rgt
>>>>>>                 AND node.name = 'PORTABLE ELECTRONICS'
>>>>>>                 GROUP BY node.name, node.lft
>>>>>>                 ORDER BY node.lft
>>>>>>         )AS sub_tree
>>>>>> WHERE node.lft BETWEEN parent.lft AND parent.rgt
>>>>>>         AND node.lft BETWEEN sub_parent.lft AND sub_parent.rgt
>>>>>>         AND sub_parent.name = sub_tree.name
>>>>>> GROUP BY node.name, node.lft, sub_tree.depth
>>>>>> ORDER BY node.lft;
>>>>>> +----------------------+---------+
>>>>>> | name                 |   depth |
>>>>>> |----------------------+---------|
>>>>>> | PORTABLE ELECTRONICS |       0 |
>>>>>> | MP3 PLAYERS          |       1 |
>>>>>> | FLASH                |       2 |
>>>>>> | CD PLAYERS           |       1 |
>>>>>> | 2 WAY RADIOS         |       1 |
>>>>>> +----------------------+---------+
>>>>>>
>>>>>>
>>>>>> Section: Find the Immediate Subordinates of a Node.
>>>>>> SELECT node.name, (COUNT(parent.name) - (sub_tree.depth + 1)) AS
>>>>>> depth
>>>>>> FROM nested_category AS node,
>>>>>>         nested_category AS parent,
>>>>>>         nested_category AS sub_parent,
>>>>>>         (
>>>>>>                 SELECT node.name, (COUNT(parent.name) - 1) AS depth
>>>>>>                 FROM nested_category AS node,
>>>>>>                 nested_category AS parent
>>>>>>                 WHERE node.lft BETWEEN parent.lft AND parent.rgt
>>>>>>                 AND node.name = 'PORTABLE ELECTRONICS'
>>>>>>                 GROUP BY node.name, node.lft
>>>>>>                 ORDER BY node.lft
>>>>>>         )AS sub_tree
>>>>>> WHERE node.lft BETWEEN parent.lft AND parent.rgt
>>>>>>         AND node.lft BETWEEN sub_parent.lft AND sub_parent.rgt
>>>>>>         AND sub_parent.name = sub_tree.name
>>>>>> GROUP BY node.name, node.lft, sub_tree.depth
>>>>>> HAVING depth <= 1
>>>>>> ORDER BY node.lft;
>>>>>> Adding 'HAVING depth <= 1' to the query still return the same results
>>>>>> as
>>>>>> above instead of this:
>>>>>> +----------------------+---------+
>>>>>> | name                 |   depth |
>>>>>> |----------------------+---------|
>>>>>> | PORTABLE ELECTRONICS |       0 |
>>>>>> | MP3 PLAYERS          |       1 |
>>>>>> | FLASH                |       1 |
>>>>>> | CD PLAYERS           |       1 |
>>>>>> | 2 WAY RADIOS         |       1 |
>>>>>> +----------------------+---------+
>>>>>>
>>>>>> I don't know if I'm doing anything wrong?
>>>>>>
>>>>>> Note: Edit the post query by adding node.lft, sub_tree.depth to the
>>>>>> GROUP BY.
>>>>> Hello, Gwork,
>>>>>
>>>>> HAVING works fine, it is just confusing because of naming. HAVING
>>>>> works with column names from sources (which is "sub_tree.depth" in
>>>>> your example), not with names of final columns (because they get
>>>>> aliases later).
>>>>>
>>>>> You can check it adding depth to your SELECT part:
>>>>> SELECT node.name, (COUNT(parent.name) - (sub_tree.depth + 1)) AS depth
>>>>>     ,array_agg(depth)
>>>>> FROM nested_category AS node,
>>>>> ...
>>>>>
>>>>> and you can see that values there are not bigger than 1.
>>>>>
>>>>> You must use the same expression in HAVING clause as in SELECT one to
>>>>> get what you want:
>>>>> HAVING (COUNT(parent.name) - (sub_tree.depth + 1)) <= 1
>>>>>
>>>>> but the result will not have "FLASH" because it has "2" even in your
>>>>> example.
>>>>> +----------------------+-------+
>>>>> |         name         | depth |
>>>>> +----------------------+-------+
>>>>> | PORTABLE ELECTRONICS |     0 |
>>>>> | MP3 PLAYERS          |     1 |
>>>>> | CD PLAYERS           |     1 |
>>>>> | 2 WAY RADIOS         |     1 |
>>>>> +----------------------+-------+
>>>>> (4 rows)
>>>> I'm sorry, forgot to mention: If you want to deal with hierarchical
>>>> data, Postgres has better solution - recursive query[1]. When you
>>>> understand principles, it will be much easier for you to write queries
>>>> instead of mentioned in the article.
>>>>
>>>> For example, "Retrieving a Single Path" from "Adjacency model" can be
>>>> written as:
>>>> WITH RECURSIVE
>>>> sel(name, parent, depth) AS (
>>>>     SELECT name, parent, 0 FROM category WHERE name='FLASH'
>>>>     UNION ALL
>>>>     SELECT c.name, c.parent, depth + 1 FROM category c, sel WHERE
>>>> c.category_id=sel.parent
>>>> )
>>>> SELECT name FROM sel
>>>> ORDER BY depth DESC;
>>>>
>>>> which gives the same result and not depends on "parent.lft" which
>>>> don't have to increase.
>>>>
>>>> Moreover, you don't need to lock a table when you change data and you
>>>> can even add a constraint to keep consistency:
>>>> ALTER TABLE category ADD FOREIGN KEY (parent) REFERENCES
>>>> category(category_id) ON UPDATE CASCADE ON DELETE RESTRICT;
>>>>
>>>> [1]https://www.postgresql.org/docs/current/static/queries-with.html
>>>>
>>> Hi Vitaly,
>>>
>>> Your first solution worked great!
>>>
>>> I'll like try your second suggestion, I feel is gonna be a better
>>> solution
>>> very important to eliminate lock while updating table.
>>>
>>> I'll keep you posted if I have any further issue relating to the query.
>>>
>>> Thank you for helping out.
>> Feel free to ask, but do not forget to add the mailing list in CC (via
>> "Reply to all").
>> Other people (new users) also can be interested in ways to solve issues.
>>
>> P.S. Moved from -bugs[2] to -general.
>>
>> [2]https://www.postgresql.org/message-id/flat/7582ea1e-6146-fd8d-b564-c2fe251210b2%40riseup.net
>
> Looking at tutorial I can not replicate those queries to Postgresql
> without serious editing. But, I simply want to create a hierarchical
> model tree that look like Amazon.
>
> What's your general solution on that can work better and easy to
> maintain than Nested Set Model with update lock?

Of course I'd use adjacency list model because of simplicity and
therefore harder to make a mistake.
You can't adapt Nested Set Model queries to Adjacency List Model with
recursive queries by small changes because their bases are very
different.

The article hides several troubles you'll impact with later:
1. What if you have to rename a category but want to get results sorted?
2. What if you have to sort depending on user's locale (internationalization)?
3. What if you have to change a "parent" of one of a category?
4. I made a mistake: "parent.lft" is always increasing, because in DML
they use _two_ UPDATEs of all the rightmost rows. If a user has to
insert a new 2nd level category "Books", he has to update twice _all_
the table (except the first row), then insert a new row.
5. There is "GROUP BY name" clause used which is longer than comparing
integers (IDs); see [3], i.e. it is wise not to get examples "as is"
from everywhere.

If you make a mistake in SQL-select in adjacency list model, you just
replace it but it does not touch your data.
But if you make a mistake in DML in nested set model, it is hard to
restore the original linkage.

Recursive queries sometimes are more complicated (I've just wrote
analog for "Aggregate Functions in a Nested Set"), but in such cases
they can be cached (for example in materialized views, which can have
indexes and can be refreshed concurrently[4] without locking).

Some queries are much simple. "Find the Immediate Subordinates of a
Node" looks that way:
WITH RECURSIVE
sel(name, category_id, parent, depth, order_) AS (
    SELECT name::text, category_id, parent, 0
        , array[row_number()OVER(ORDER BY name)]
    from category where name='PORTABLE ELECTRONICS'

    UNION ALL

    SELECT c.name::text, c.category_id, c.parent, depth + 1
        , order_ || row_number()OVER(ORDER BY order_, c.name)
    from category c, sel where sel.category_id=c.parent
        AND depth < 1  -- max_depth is changed here
)
SELECT name, depth
FROM sel
-- WHERE depth <> 0  -- without root node
ORDER BY order_;

Note there is no GROUP BY (the original article has two of them and
four walks through the "nested_category" table).


P.S.: please, don't top-post, it is against the local etiquette[5].

[3]http://mikehillyer.com/articles/managing-hierarchical-data-in-mysql/#comment-4489
[4]https://www.postgresql.org/docs/current/static/sql-refreshmaterializedview.html
[5]https://wiki.postgresql.org/wiki/Mailing_Lists#Email_etiquette_mechanics
--
Best regards,
Vitaly Burovoy


pgsql-general by date:

Previous
From: Gwork
Date:
Subject: Re: [GENERAL] [BUGS] Postgresql query HAVING do not work
Next
From: marcin kowalski
Date:
Subject: Re: [GENERAL] vacuum of empty table slows down as database tablecount grows