Re: SQL Property Graph Queries (SQL/PGQ) - Mailing list pgsql-hackers
| From | Henson Choi |
|---|---|
| Subject | Re: SQL Property Graph Queries (SQL/PGQ) |
| Date | |
| Msg-id | CAAAe_zBAR+X=c_QWDhGm=anDN184PGGHs6fKYwCYBo0Wf0Ju9A@mail.gmail.com Whole thread Raw |
| In response to | Re: SQL Property Graph Queries (SQL/PGQ) (Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>) |
| List | pgsql-hackers |
Subject: Re: SQL Property Graph Queries (SQL/PGQ)
Hi hackers,
Apologies for jumping into the SQL/PGQ discussion mid-stream. I've been
working on implementing the LABELS() graph element function and ran into
an issue I'd like to get some guidance on.
== What LABELS() does ==
LABELS(element_variable) returns a text[] array containing all labels
associated with a graph element. For example:
SELECT * FROM GRAPH_TABLE (myshop
MATCH (n IS customers)
COLUMNS (n.name, LABELS(n) AS lbls)
);
name | lbls
------------+-------------
customer1 | {customers}
customer2 | {customers}
== Implementation approach ==
A naive approach would return LABELS() as a constant (e.g., ARRAY['customers']
directly). However, this prevents the optimizer from pushing down predicates
involving LABELS() through UNION ALL branches.
To enable optimizer pushdown, I wrap each element table in a subquery that
adds a virtual __labels__ column containing a constant array. This way,
LABELS(n) returns a Var referencing that column, allowing the optimizer to
evaluate it per-branch and prune accordingly:
-- Conceptually, the rewrite produces:
SELECT name, __labels__ AS lbls FROM (
SELECT *, ARRAY['customers']::text[] AS __labels__ FROM customers
) ...
This allows the optimizer to perform constant folding and prune UNION ALL
branches when filtering by label:
-- Filter in outer WHERE clause
EXPLAIN SELECT * FROM GRAPH_TABLE (myshop
MATCH (n IS lists) -- lists label shared by orders and wishlists
COLUMNS (LABELS(n) AS lbls, n.node_id)
) WHERE 'orders' = ANY(lbls);
QUERY PLAN
------------------------------------------------------
Seq Scan on orders
Output: '{orders,lists}'::text[], orders.order_id
-- Filter in MATCH WHERE clause without IS (both tables scanned)
EXPLAIN SELECT * FROM GRAPH_TABLE (myshop
MATCH (n) WHERE 'lists' = ANY(LABELS(n))
COLUMNS (LABELS(n) AS lbls, n.node_id)
);
QUERY PLAN
------------------------------------------------------
Append
-> Seq Scan on orders
Output: '{orders,lists}'::text[], orders.order_id
-> Seq Scan on wishlists
Output: '{lists,wishlists}'::text[], wishlists.wishlist_id
The first example prunes the wishlists branch since 'orders' is not in its
label set. The second example uses LABELS() without an IS clause to filter
across all element tables that have the 'lists' label.
== The problem ==
The subquery wrapping breaks column-level privilege checking.
Test case from privileges.sql:
-- regress_priv_user4 has SELECT on atest5(one, four) only
-- lttc label maps atest5.three -> lttck (no privilege)
SET ROLE regress_priv_user4;
SELECT * FROM graph_table (ptg1 MATCH (v IS lttc) COLUMNS (v.lttck));
-- Expected: ERROR: permission denied for table atest5
-- Actual: query succeeds (returns rows without error)
I haven't yet identified the exact root cause of this issue. Has anyone
encountered a similar issue? Any pointers would be appreciated.
Thanks,
Henson
Hi hackers,
Apologies for jumping into the SQL/PGQ discussion mid-stream. I've been
working on implementing the LABELS() graph element function and ran into
an issue I'd like to get some guidance on.
== What LABELS() does ==
LABELS(element_variable) returns a text[] array containing all labels
associated with a graph element. For example:
SELECT * FROM GRAPH_TABLE (myshop
MATCH (n IS customers)
COLUMNS (n.name, LABELS(n) AS lbls)
);
name | lbls
------------+-------------
customer1 | {customers}
customer2 | {customers}
== Implementation approach ==
A naive approach would return LABELS() as a constant (e.g., ARRAY['customers']
directly). However, this prevents the optimizer from pushing down predicates
involving LABELS() through UNION ALL branches.
To enable optimizer pushdown, I wrap each element table in a subquery that
adds a virtual __labels__ column containing a constant array. This way,
LABELS(n) returns a Var referencing that column, allowing the optimizer to
evaluate it per-branch and prune accordingly:
-- Conceptually, the rewrite produces:
SELECT name, __labels__ AS lbls FROM (
SELECT *, ARRAY['customers']::text[] AS __labels__ FROM customers
) ...
This allows the optimizer to perform constant folding and prune UNION ALL
branches when filtering by label:
-- Filter in outer WHERE clause
EXPLAIN SELECT * FROM GRAPH_TABLE (myshop
MATCH (n IS lists) -- lists label shared by orders and wishlists
COLUMNS (LABELS(n) AS lbls, n.node_id)
) WHERE 'orders' = ANY(lbls);
QUERY PLAN
------------------------------------------------------
Seq Scan on orders
Output: '{orders,lists}'::text[], orders.order_id
-- Filter in MATCH WHERE clause without IS (both tables scanned)
EXPLAIN SELECT * FROM GRAPH_TABLE (myshop
MATCH (n) WHERE 'lists' = ANY(LABELS(n))
COLUMNS (LABELS(n) AS lbls, n.node_id)
);
QUERY PLAN
------------------------------------------------------
Append
-> Seq Scan on orders
Output: '{orders,lists}'::text[], orders.order_id
-> Seq Scan on wishlists
Output: '{lists,wishlists}'::text[], wishlists.wishlist_id
The first example prunes the wishlists branch since 'orders' is not in its
label set. The second example uses LABELS() without an IS clause to filter
across all element tables that have the 'lists' label.
== The problem ==
The subquery wrapping breaks column-level privilege checking.
Test case from privileges.sql:
-- regress_priv_user4 has SELECT on atest5(one, four) only
-- lttc label maps atest5.three -> lttck (no privilege)
SET ROLE regress_priv_user4;
SELECT * FROM graph_table (ptg1 MATCH (v IS lttc) COLUMNS (v.lttck));
-- Expected: ERROR: permission denied for table atest5
-- Actual: query succeeds (returns rows without error)
I haven't yet identified the exact root cause of this issue. Has anyone
encountered a similar issue? Any pointers would be appreciated.
Thanks,
Henson
2025년 12월 18일 (목) PM 6:16, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>님이 작성:
On Wed, Dec 17, 2025 at 2:28 PM Peter Eisentraut <peter@eisentraut.org> wrote:
>
> On 17.12.25 06:32, Ashutosh Bapat wrote:
> > On Mon, Dec 15, 2025 at 6:43 PM Ashutosh Bapat
> > <ashutosh.bapat.oss@gmail.com> wrote:
> >>
> >> Rebased patches on the latest HEAD which required me to move
> >> graph_table.sql to another parallel group.
> >
> > Huh, the movement resulted in losing that test from parallel_schedule.
> > Fixed in the attached patches.
>
> A couple of items:
>
> 1) I was running some tests that involved properties with mismatching
> typmods, and I got an error message like
>
> ERROR: property "p1" data type modifier mismatch: 14 vs. 19
>
> but the actual types were varchar(10) and varchar(15). So to improve
> that, we need to run these through the typmod formatting routines, not
> just print the raw typmod numbers. I actually just combined that with
> the check for the type itself. Also, there was no test covering this,
> so I added one. See attached patches.
+1. The error message is better.
>
> I did another investigation about whether this level of checking is
> necessary. I think according to the letter of the SQL standard, the
> typmods must indeed match. It seems Oracle does not check (the example
> mentioned above came from an Oracle source). But I think it's okay to
> keep the check. In PostgreSQL, it is much less common to write like
> varchar(1000). And we can always consider relaxing it later.
+1.
Attached patch adds a couple more test statements.
>
> 2) I had it in my notes to consider whether we should support the colon
> syntax for label expressions. I think we might have talked about that
> before.
>
> I'm now leaning toward not supporting it in the first iteration. I
> don't know that we have fully explored possible conflicts with host
> variable syntax in ecpg and psql and the like. Maybe avoid that for now.
>
I was aware of ecpg and I vaguely remember we fixed something in ECPG
to allow : in MATCH statement. Probably following changes in
psqlscan.l and pgc.l
-self [,()\[\].;\:\+\-\*\/\%\^\<\>\=]
+self [,()\[\].;\:\|\+\-\*\/\%\^\<\>\=]
Those changes add | after : (and not the : itself) so maybe they are
not about supporting : . Do you remember what those are?
However, I see that : in psql can be a problem
#create table t1 (a int, b int);
#create property graph g1 vertex tables (t1 key (a));
#select * from GRAPH_TABLE (g1 MATCH (a :t1) COLUMNS (a.a));
a
---
(0 rows)
#\set t1 blank
#select * from GRAPH_TABLE (g1 MATCH (a :t1) COLUMNS (a.a));
ERROR: syntax error at or near "blank"
LINE 1: select * from GRAPH_TABLE (g1 MATCH (a blank) COLUMNS (a.a))...
> There was also a bit of an inconsistency in the presentation: The
> documentation introduced the colon as seemingly the preferred syntax,
> but ruleutils.c dumped out the IS syntax.
>
> (It was also a bit curious that some test code put spaces around the
> colon, which is not idiomatic.)
>
That might have been just me trying to type one letter less and yet
keeping the query readable. A column between variable name and the
label is not always noticeable.
> Attached is a patch that shows how to revert the colon support. It's
> pretty simple, and it would be easy to add it back in later, I think.
I agree that it's better not to support it now. It requires more work
and it's optional. When we come to support it, we will need thorough
testing.
I spotted some examples that use : in ddl.sgml.
<programlisting>
SELECT customer_name FROM GRAPH_TABLE (myshop MATCH
(c:customer)-[:has]->(o:"order" WHERE o.ordered_when = current_date)
COLUMNS (c.name AS customer_name));
</programlisting>
The query demonstrates that one can use label names in a way that will
make the pattern look like an English sentence. Replacing : with IS
defeats that purpose.
As written in that paragraph, the labels serve the purpose of exposing
the table with a different logical view (using different label and
property names). So we need that paragraph, but I think we should
change the example to use IS instead of :. Attached is suggested
minimal change, but I am not happy with it. Another possibility is we
completely remove that paragraph; I don't think we need to discuss all
possible usages the users will come up with.
The patch changes one more instance of : by IS. But that's straight forward.
In ddl.sgml I noticed a seemingly incomplete sentence
A property graph is a way to represent database contents, instead of using
relational structures such as tables.
Represent the contents as what? I feel the complete sentence should be
one of the following
property graph is a way to represent database contents as a graph,
instead of representing those as relational structures OR
property graph is another way to represent database contents instead
of using relational structures such as tables
But I can't figure out what was originally intended.
I have squashed 0002 from my earlier patchset and your 3 patches into 0001.
0002 has extra tests mentioned above. It also removes "TODO: dubious
error message" from a comment. I don't see anything dubious in the
error message. I think this patch is safe to be merged into 0001.
0003 is changed to ddl.sgml. Those need a bit more work as described above.
--
Best Wishes,
Ashutosh Bapat
Attachment
pgsql-hackers by date: