Thread: Is EXISTS the most efficient approach for PostgreSql to check for existence of nodes in a tree?

I have a key value table in my Postgresql db, which represents hierarchical data through parent_feature_mapping column that points to id of feature_mapping_id column of the same table.

I need to select root nodes that has children which satisfy various conditions. The conditions may extend to children of children, so I'm trying to find roots of trees that contain paths that satisfy  the given constraints.

An example is finding the trees where the root node has type 'COMPOSITION' and root node's archetypeNodeId attribute has value 'openEHR-EHR-COMPOSITION.discharge.v1' another constraint is root node having a child of type 'CONTENTITEM' that in turn has a child of type 'ITEMSTRUCTURE'
All nodes in a tree have the same payload Id. The fastest query that I could write so far is given below.


SELECT root.id from path_value as root
    WHERE
        root.rm_type_name = 'COMPOSITION'
    AND
        root.feature_name = 'composition'
    AND
    EXISTS (SELECT 1 from path_value as anodeid
            WHERE
                    anodeId.parent_feature_mapping_id = root.feature_mapping_id
            AND       
                    anodeId.payload_id = root.payload_id
            AND
                    anodeId.feature_name = 'archetypeNodeId'
            AND
                    anodeId.val_string = 'openEHR-EHR-COMPOSITION.discharge.v1'
            LIMIT 1
            )
           
    AND
    EXISTS (SELECT 1 from path_value as node1
             WHERE
                node1.payload_id = root.payload_id             
            AND
                node1.parent_feature_mapping_id = root.feature_mapping_id
            AND
                node1.feature_name = 'content'
            AND
                node1.rm_type_name = 'CONTENTITEM'
            AND
            EXISTS (SELECT 1 from path_value as node2
                    WHERE
                    node2.payload_id = node1.payload_id
                    AND
                    node2.parent_feature_mapping_id = node1.feature_mapping_id
                    AND
                    node2.rm_type_name = 'ITEMSTRUCTURE'
                    LIMIT 1)
            LIMIT 1)

My question is: is this the best approach in terms of performance? This is an attempt to identify XML payloads that fit certain criteria. I have also considered using an ltree column that will contain the tree in a from that I can query as an alternative to sql based method, or I can use xpath queries on XML payload.

The create statement for my table is as follows:

CREATE TABLE public.path_value (
  val_string TEXT,
  feature_mapping_id INTEGER NOT NULL,
  parent_feature_mapping_id INTEGER,
  feature_name TEXT,
  rm_type_name TEXT,
  path INTEGER NOT NULL,
  payload_id INTEGER NOT NULL,
  id INTEGER NOT NULL,
  ehr_id INTEGER,
  CONSTRAINT path_value_pkey PRIMARY KEY(id)
) WITHOUT OIDS;


Best regards
Seref

On 05/17/2012 03:06 AM, Seref Arikan wrote:
I have a key value table in my Postgresql db, which represents hierarchical data through parent_feature_mapping column that points to id of feature_mapping_id column of the same table.

I need to select root nodes that has children which satisfy various conditions. The conditions may extend to children of children, so I'm trying to find roots of trees that contain paths that satisfy  the given constraints.

An example is finding the trees where the root node has type 'COMPOSITION' and root node's archetypeNodeId attribute has value 'openEHR-EHR-COMPOSITION.discharge.v1' another constraint is root node having a child of type 'CONTENTITEM' that in turn has a child of type 'ITEMSTRUCTURE'
All nodes in a tree have the same payload Id. The fastest query that I could write so far is given below.


SELECT root.id from path_value as root
    WHERE
        root.rm_type_name = 'COMPOSITION'
    AND
        root.feature_name = 'composition'
    AND
    EXISTS (SELECT 1 from path_value as anodeid
            WHERE
                    anodeId.parent_feature_mapping_id = root.feature_mapping_id
            AND       
                    anodeId.payload_id = root.payload_id
            AND
                    anodeId.feature_name = 'archetypeNodeId'
            AND
                    anodeId.val_string = 'openEHR-EHR-COMPOSITION.discharge.v1'
            LIMIT 1
            )
           
    AND
    EXISTS (SELECT 1 from path_value as node1
             WHERE
                node1.payload_id = root.payload_id             
            AND
                node1.parent_feature_mapping_id = root.feature_mapping_id
            AND
                node1.feature_name = 'content'
            AND
                node1.rm_type_name = 'CONTENTITEM'
            AND
            EXISTS (SELECT 1 from path_value as node2
                    WHERE
                    node2.payload_id = node1.payload_id
                    AND
                    node2.parent_feature_mapping_id = node1.feature_mapping_id
                    AND
                    node2.rm_type_name = 'ITEMSTRUCTURE'
                    LIMIT 1)
            LIMIT 1)

My question is: is this the best approach in terms of performance? This is an attempt to identify XML payloads that fit certain criteria. I have also considered using an ltree column that will contain the tree in a from that I can query as an alternative to sql based method, or I can use xpath queries on XML payload.

The create statement for my table is as follows:

CREATE TABLE public.path_value (
  val_string TEXT,
  feature_mapping_id INTEGER NOT NULL,
  parent_feature_mapping_id INTEGER,
  feature_name TEXT,
  rm_type_name TEXT,
  path INTEGER NOT NULL,
  payload_id INTEGER NOT NULL,
  id INTEGER NOT NULL,
  ehr_id INTEGER,
  CONSTRAINT path_value_pkey PRIMARY KEY(id)
) WITHOUT OIDS;


Best regards
Seref

Any other constraints or indexes on that table?
Trying to reply to Rob:

Apologies if this does not end up in the thread (gmail is just driving me mad, I can't seem to receive messages, so I've subscribed again)

For some reason Limit 1 cause my query to go on for minutes without a response, which was not the case.

The following query takes about 10.6 seconds, with a table that has about 20 million rows. First, the indices of the table with create statement:

CREATE TABLE public.path_value (
  val_string TEXT,
  feature_mapping_id INTEGER NOT NULL,
  parent_feature_mapping_id INTEGER,
  feature_name TEXT,
  rm_type_name TEXT,
  path INTEGER NOT NULL,
  payload_id INTEGER NOT NULL,
  id INTEGER NOT NULL,
  ehr_id INTEGER,
  CONSTRAINT path_value_pkey PRIMARY KEY(id)
) WITHOUT OIDS;

CREATE INDEX path_value_idx ON public.path_value
  USING btree (rm_type_name COLLATE pg_catalog."default", feature_name COLLATE pg_catalog."default");

CREATE INDEX path_value_idx1 ON public.path_value
  USING btree (feature_mapping_id, parent_feature_mapping_id);

CREATE INDEX path_value_idx2 ON public.path_value
  USING btree (payload_id);


Now results of EXPLAIN ANALYZE:

"QUERY PLAN"
"Nested Loop Semi Join  (cost=543422.34..571013.45 rows=5 width=4) (actual time=5854.267..10684.798 rows=82003 loops=1)"
"  Join Filter: (root.feature_mapping_id = anodeid.parent_feature_mapping_id)"
"  ->  Merge Join  (cost=543422.34..543426.29 rows=12 width=24) (actual time=5854.182..5953.531 rows=82003 loops=1)"
"        Merge Cond: ((root.feature_mapping_id = node1.parent_feature_mapping_id) AND (root.payload_id = node2.payload_id))"
"        ->  Sort  (cost=1150.65..1151.35 rows=283 width=12) (actual time=359.940..380.707 rows=82003 loops=1)"
"              Sort Key: root.feature_mapping_id, root.payload_id"
"              Sort Method: external merge  Disk: 1768kB"
"              ->  Index Scan using path_value_idx on path_value root  (cost=0.00..1139.12 rows=283 width=12) (actual time=0.073..261.240 rows=82003 loops=1)"
"                    Index Cond: ((rm_type_name = 'COMPOSITION'::text) AND (feature_name = 'composition'::text))"
"        ->  Sort  (cost=542271.69..542272.31 rows=247 width=12) (actual time=5494.236..5516.788 rows=82003 loops=1)"
"              Sort Key: node1.parent_feature_mapping_id, node2.payload_id"
"              Sort Method: external sort  Disk: 2088kB"
"              ->  HashAggregate  (cost=542259.40..542261.87 rows=247 width=12) (actual time=5377.761..5394.453 rows=82003 loops=1)"
"                    ->  Hash Semi Join  (cost=506297.37..542253.38 rows=1205 width=12) (actual time=2788.758..5325.838 rows=164004 loops=1)"
"                          Hash Cond: ((node1.payload_id = node2.payload_id) AND (node1.feature_mapping_id = node2.parent_feature_mapping_id))"
"                          ->  Bitmap Heap Scan on path_value node1  (cost=77.61..9421.62 rows=2464 width=12) (actual time=76.628..2430.473 rows=246005 loops=1)"
"                                Recheck Cond: ((rm_type_name = 'CONTENTITEM'::text) AND (feature_name = 'content'::text))"
"                                ->  Bitmap Index Scan on path_value_idx  (cost=0.00..77.00 rows=2464 width=0) (actual time=75.640..75.640 rows=246005 loops=1)"
"                                      Index Cond: ((rm_type_name = 'CONTENTITEM'::text) AND (feature_name = 'content'::text))"
"                          ->  Hash  (cost=498674.38..498674.38 rows=399092 width=8) (actual time=2711.653..2711.653 rows=410015 loops=1)"
"                                Buckets: 4096  Batches: 16  Memory Usage: 1013kB"
"                                ->  Bitmap Heap Scan on path_value node2  (cost=10345.32..498674.38 rows=399092 width=8) (actual time=71.718..2607.467 rows=410015 loops=1)"
"                                      Recheck Cond: (rm_type_name = 'ITEMSTRUCTURE'::text)"
"                                      ->  Bitmap Index Scan on path_value_idx  (cost=0.00..10245.55 rows=399092 width=0) (actual time=69.634..69.634 rows=410015 loops=1)"
"                                            Index Cond: (rm_type_name = 'ITEMSTRUCTURE'::text)"
"  ->  Index Scan using path_value_idx2 on path_value anodeid  (cost=0.00..2298.91 rows=1 width=8) (actual time=0.057..0.057 rows=1 loops=82003)"
"        Index Cond: (payload_id = node2.payload_id)"
"        Filter: ((feature_name = 'archetypeNodeId'::text) AND (val_string = 'openEHR-EHR-COMPOSITION.discharge.v1'::text))"
"Total runtime: 10692.277 ms"

and finaly, the query that gave this result:

EXPLAIN ANALYZE SELECT root.id from path_value as root
    WHERE
        root.rm_type_name = 'COMPOSITION'
    AND
        root.feature_name = 'composition'
    AND
    EXISTS (SELECT anodeid.id from path_value as anodeid
            WHERE
                    anodeId.parent_feature_mapping_id = root.feature_mapping_id
            AND       
                    anodeId.payload_id = root.payload_id
            AND
                    anodeId.feature_name = 'archetypeNodeId'
            AND
                    anodeId.val_string = 'openEHR-EHR-COMPOSITION.discharge.v1'           
            )
           
    AND
    EXISTS (SELECT 1 from path_value as node1
             WHERE
                node1.payload_id = root.payload_id             
            AND
                node1.parent_feature_mapping_id = root.feature_mapping_id
            AND
                node1.feature_name = 'content'
            AND
                node1.rm_type_name = 'CONTENTITEM'
            AND
            EXISTS (SELECT 1 from path_value as node2
                    WHERE
                    node2.payload_id = node1.payload_id
                    AND
                    node2.parent_feature_mapping_id = node1.feature_mapping_id
                    AND
                    node2.rm_type_name = 'ITEMSTRUCTURE'
                    )
            )

Is there a glaring error in my approach? Should I be better off with another SQL query, or Ltree/XPATH queries?

Best regards
Seref







On Thu, May 17, 2012 at 8:40 PM, Seref Arikan <serefarikan@gmail.com> wrote:
> Is there a glaring error in my approach? Should I be better off with another
> SQL query, or Ltree/XPATH queries?

For the particular query you posted, I would suggest the following indexes:

(rm_type_name, payload_id, parent_feature_mapping_id)
And maybe:
(rm_type_name, feature_name, payload_id, parent_feature_mapping_id)

But overall, storing a hierarchical XML structure as rows in a table
might not be the best approach. If performance is problematic, you
might consider storing whole XML documents -- or fragments -- in an
xml field and create expression indexes for the queries that you need,
possibly with GIN/GiST.

Now I haven't needed to do this myself, so what follows is just me
trying out stuff to give you some ideas and certainly not "best
practice" -- there are lots of different indexing strategies and
different ways to do this.

For example:

CREATE TABLE foo (doc_id serial primary key, doc xml not null);
CREATE INDEX foo_doc_id_exists_root_element_test ON foo (doc_id) WHERE
xpath_exists('/root/element[text()="test"]', doc);
CREATE INDEX foo_root_element_text_gin ON foo USING
gin((xpath('/root/element/text()', doc)::text[]));

To find documents which have <element>test</element>, using the above indexes:

# explain analyze select * from foo where
xpath_exists('/root/element[text()="test"]', doc);
 Bitmap Heap Scan on foo  (cost=3.33..450.22 rows=4311 width=36)
(actual time=0.025..0.026 rows=1 loops=1)
   Recheck Cond: xpath_exists('/root/element[text()="test"]'::text,
doc, '{}'::text[])
   ->  Bitmap Index Scan on foo_doc_id_exists_root_element_test
(cost=0.00..2.26 rows=4311 width=0) (actual time=0.014..0.014 rows=1
loops=1)
 Total runtime: 0.067 ms

# explain analyze select * from foo where
(xpath('/root/element/text()', doc)::text[]) @> array['test'];
 Bitmap Heap Scan on foo  (cost=8.50..105.51 rows=65 width=32) (actual
time=0.025..0.025 rows=1 loops=1)
   Recheck Cond: ((xpath('/root/element/text()'::text, doc,
'{}'::text[]))::text[] @> '{test}'::text[])
   ->  Bitmap Index Scan on foo_root_element_text_gin
(cost=0.00..8.49 rows=65 width=0) (actual time=0.020..0.020 rows=1
loops=1)
         Index Cond: ((xpath('/root/element/text()'::text, doc,
'{}'::text[]))::text[] @> '{test}'::text[])
 Total runtime: 0.046 ms
(5 rows)

The GIN index lets you search for documents that have both "test" and "testing":
(xpath('/root/element/text()', doc)::text[]) @> array['test','testing'];

Regards,
Marti

Hi Marti,
Thanks, this is exactly the kind of feedback I was looking for.
I am already storing the whole XML in a payload table actually. My problem is, the queries are actually created in a domain specific langauge, and then they are transformed to SQL.
There is a no way of knowing what kind of queries would be run over the XML docs, so I'd be creating indices over and over with each incoming new query. Still, I may be able to find a path in the middle, maybe using a combination of the XML path based index and row based representation. Thaks for the pointers to relevant index types.

Finally, ltree may be an alternative to xpath based indices, but I don't know if that would be faster. The database may need to go beyond 100 milion rows, and I'm not sure what would happen then with a row based representation. I'll probably generate dummy data and compare performance of the options.

Kind regards
Seref


On Fri, May 18, 2012 at 10:55 AM, Marti Raudsepp <marti@juffo.org> wrote:
On Thu, May 17, 2012 at 8:40 PM, Seref Arikan <serefarikan@gmail.com> wrote:
> Is there a glaring error in my approach? Should I be better off with another
> SQL query, or Ltree/XPATH queries?

For the particular query you posted, I would suggest the following indexes:

(rm_type_name, payload_id, parent_feature_mapping_id)
And maybe:
(rm_type_name, feature_name, payload_id, parent_feature_mapping_id)

But overall, storing a hierarchical XML structure as rows in a table
might not be the best approach. If performance is problematic, you
might consider storing whole XML documents -- or fragments -- in an
xml field and create expression indexes for the queries that you need,
possibly with GIN/GiST.

Now I haven't needed to do this myself, so what follows is just me
trying out stuff to give you some ideas and certainly not "best
practice" -- there are lots of different indexing strategies and
different ways to do this.

For example:

CREATE TABLE foo (doc_id serial primary key, doc xml not null);
CREATE INDEX foo_doc_id_exists_root_element_test ON foo (doc_id) WHERE
xpath_exists('/root/element[text()="test"]', doc);
CREATE INDEX foo_root_element_text_gin ON foo USING
gin((xpath('/root/element/text()', doc)::text[]));

To find documents which have <element>test</element>, using the above indexes:

# explain analyze select * from foo where
xpath_exists('/root/element[text()="test"]', doc);
 Bitmap Heap Scan on foo  (cost=3.33..450.22 rows=4311 width=36)
(actual time=0.025..0.026 rows=1 loops=1)
  Recheck Cond: xpath_exists('/root/element[text()="test"]'::text,
doc, '{}'::text[])
  ->  Bitmap Index Scan on foo_doc_id_exists_root_element_test
(cost=0.00..2.26 rows=4311 width=0) (actual time=0.014..0.014 rows=1
loops=1)
 Total runtime: 0.067 ms

# explain analyze select * from foo where
(xpath('/root/element/text()', doc)::text[]) @> array['test'];
 Bitmap Heap Scan on foo  (cost=8.50..105.51 rows=65 width=32) (actual
time=0.025..0.025 rows=1 loops=1)
  Recheck Cond: ((xpath('/root/element/text()'::text, doc,
'{}'::text[]))::text[] @> '{test}'::text[])
  ->  Bitmap Index Scan on foo_root_element_text_gin
(cost=0.00..8.49 rows=65 width=0) (actual time=0.020..0.020 rows=1
loops=1)
        Index Cond: ((xpath('/root/element/text()'::text, doc,
'{}'::text[]))::text[] @> '{test}'::text[])
 Total runtime: 0.046 ms
(5 rows)

The GIN index lets you search for documents that have both "test" and "testing":
(xpath('/root/element/text()', doc)::text[]) @> array['test','testing'];

Regards,
Marti