Re: Puzzline CROSS JOIN when doing RECURSIVE CTE - Mailing list pgsql-general

From Alban Hertroys
Subject Re: Puzzline CROSS JOIN when doing RECURSIVE CTE
Date
Msg-id 0E6648DF-32A4-426C-BCE2-C02392CADDD2@gmail.com
Whole thread Raw
In response to Re: Puzzline CROSS JOIN when doing RECURSIVE CTE  (Pól Ua Laoínecháin <linehanp@tcd.ie>)
List pgsql-general
> On 18 Apr 2022, at 14:51, Pól Ua Laoínecháin <linehanp@tcd.ie> wrote:
>
> Hi Alban, and many thanks for your input.
>
>> My first question is why you’re using a recursive CTE here? This doesn’t appear to be hierarchical data (such as a
tree),unless perhaps you intended to actually traverse the HTML document hierarchy? 
>
> This is basically an exercise on my part.
>
> The question that I'm trying to answer  is here:
>
> https://stackoverflow.com/questions/70574881/how-can-get-html-inner-tag-in-posgresql
>
> I've already answered it in 3 different ways - but I was trying to do
> it with RCTEs in order to improve my comprehension of them.
>
> So, basically, I want to pick out a subsection of text from a "passage".
>
> So then, I wanted to establish a true/false state for the lines that I
> want and don't want, going through line by line. I know that the RCTE
> is  a very contrived way of doing this, but it's for learning really.

Considering that you’re already looking at the elements of a parsed DOM tree, the exercise boils down to traversing
thattree. Due to how xmlparse() is implemented, you probably already get them in the right order even when not using an
explicitorder by. That is, if you’re looking for a DFT (depth first traversal) as opposed to a BFT (breadth first). 

One of the difficulties here is that there are some CDATA sections involved with more XML in them. My guess is that
that’sthe data that you’re actually after, but that’s just a matter of entering the document with the correct path I
suppose?


> I wonder if you could be so kind as to give me  a "skeleton" RCTE for
> this - I've been staring at this for hours - and it's not that I'm
> lazy or haven't studied RCTEs - I wrote this RCTE
>
> https://stackoverflow.com/a/71674990/470530
>
> recently, so it's not as if I'm completely ignorant of RCTEs - I'm
> just stuck in a rut. Any help would be appreciated.

You would first need to determine the root node(s). Those are the ones w/o parents, or you may have some other way of
determiningthose. 

Next is finding all nodes that have an earlier node as their parent.
You could go an extra step here with preserving the order of the siblings in the document, by numbering nodes
(directly)under the same parent. 
I usually build an ltree structure with that information, while traversing the tree - that gets you an ltree with
entries(1, 1.1, 1.1.1, 1.1.2, 1.2.1, etc) that you then can use for the final order by, for example. 

In case you didn’t know, ltree is a module you can install. I find it still very useful in tree traversals. The one
drawbackI see is that for these scenario’s you’d ideally want an ltree based on integers, such that 10 sorts after 9
insteadof between 1 and 2. Padding enough zeroes before the ltree text items is a bit of an extra hassle that I’d
preferto do without. 

I haven’t actually looked at what DOM navigation functions exist for PG, so this is more or less pseudo code. Worse, my
localcopy of PG was compiled w/o XML support, so I don’t know what kind of result the query from that SO article
produces.But then again, I don’t really know what you’re after anyway, so... 

This is basically how I would go about it.

with recursive
-- First we need to get the DOM-tree parsed (this is not actually recursive)
domtree as (
    select node
      from xmlparse(document(‘<root>...</root>'))
),
-- Next we can traverse it
cte (node, hierarchy, n) as (
    select node, 1::text::ltree, 1
      from domtree
     where parent(node) is null

    union all

    select node, cte.hierarchy || (cte.n+1)::text::ltree, n+1
      from domtree t
      join cte on parent(t.node) = cte.node
)
select *
  from cte
 order by hierarchy;

Function parent() is made-up. It would return the parent node of a node, so that there is some way to connect the
differentparts in the hierarchy. I guess xpath() could fulfil that purpose, but I have no way of testing that
hypothesis.

I hope that’s a good enough starting point for you?


Alban Hertroys
--
There is always an exception to always.







pgsql-general by date:

Previous
From: Stephen Frost
Date:
Subject: Re: Feedback about hybrid SAN snap and rsync'd approach for large systemcloning
Next
From: "Pete O'Such"
Date:
Subject: No psql md5 auth, psql 14.1 to PG 11