Thread: How hard would a "path" operator be to implement in PostgreSQL

How hard would a "path" operator be to implement in PostgreSQL

From
Chris Travers
Date:
Hi;

I have been reading up on object-relational features of Oracle and DB2 and found that one of the big things they have that we don't is a path operator.  The idea is that you can use the path operator to follow some subset of foreign keys called refs.  

Suppose we have a table (Oracle/DB2 styles here):

CREATE TABLE country (
    REF IS OID USER GENERATED,
    id int serial not null unique,
    name text primary key,
    short_name text not null unique
);

CREATE TABLE address (
    street_text text not null,
    city text not null,
    state_province text.
    country REF(country)
);

In oracle this might allow you to do something like:

SELECT * from address where address.country.short_name = 'US'; 

In DB2 this might be done like:

SELECT * FROM address WHERE address->country->short_name = 'US';

I like DB2's approach better because there is no ambiguity between namespace resolution but I don't entirely like the way the refs work as separate from standard foreign keys.

What I am thinking about is a function which accepts a row and a destination table name, looks up foreign keys in and retrieves the row from the other table, returning it, if the table name exists.   This could then be mapped to an operator which would avoid some problems.  This could then be:

SELECT * FROM address where (address->'country').short_name = 'US'

Or if we also have a continents table:

SELECT * FROM address where (address->'country'->'continent').name = 'North America';

Obviously these examples assume a very small number of address records and are largely contrived from the IBM examples.  However you could also do this on a small return set:

select (a->'country').short_name from address where ....;

Or even:

select (a->'country').* from address where......

The next question is whether there is a way to pass country in as an identifier so there is no need to use single quotes.  This would make things a little more transparent if possible but I would be happy without this.  Eventually it might be kinda useful (for those porting O-R stuff from Oracle or DB2) to have a path operator built in with a concept of a default, implicit join.

Any thoughts?  If it can be done in plain SQL and inlined that would be ideal but in the prototyping state, that isn't so important and I expect that it is not.

Best Wishes,
Chris Traves

Re: How hard would a "path" operator be to implement in PostgreSQL

From
Ondrej Ivanič
Date:
Hi,

On 20 August 2012 11:28, Chris Travers <chris.travers@gmail.com> wrote:
> I have been reading up on object-relational features of Oracle and DB2 and
> found that one of the big things they have that we don't is a path operator.
> The idea is that you can use the path operator to follow some subset of
> foreign keys called refs.

<snip>

> Any thoughts?  If it can be done in plain SQL and inlined that would be
> ideal but in the prototyping state, that isn't so important and I expect
> that it is not.


You should look at ltree:
This module implements a data type ltree for representing labels of
data stored in a hierarchical tree-like structure. Extensive
facilities for searching through label trees are provided.

http://www.postgresql.org/docs/9.1/static/ltree.html


--
Ondrej Ivanic
(ondrej.ivanic@gmail.com)


Re: How hard would a "path" operator be to implement in PostgreSQL

From
David Johnston
Date:
On Aug 19, 2012, at 21:28, Chris Travers <chris.travers@gmail.com> wrote:

> Hi;
>
> I have been reading up on object-relational features of Oracle and DB2 and found that one of the big things they have
thatwe don't is a path operator.  The idea is that you can use the path operator to follow some subset of foreign keys
calledrefs.   

Why do you feel this is a "big thing".  Sure, you can save a few keystrokes when writing multi-table queries but that
doesn'tseem all that great and now you are introducing ambiguity into the planner/query when trying to resolve these
implicitjoins.  I concur that introducing an explicit REF is a poor choice taken at face value since now you have to
rememberwhat references are present.  With FOREIGN KEYS you introduce a logical constraint but you are able to perform
anjoin between two tables independent of the presence of an FK. 

>
> Suppose we have a table (Oracle/DB2 styles here):
>
> CREATE TABLE country (
>     REF IS OID USER GENERATED,
>     id int serial not null unique,
>     name text primary key,
>     short_name text not null unique
> );
>
> CREATE TABLE address (
>     street_text text not null,
>     city text not null,
>     state_province text.
>     country REF(country)
> );
>
> In oracle this might allow you to do something like:
>
> SELECT * from address where address.country.short_name = 'US';
>
> In DB2 this might be done like:
>
> SELECT * FROM address WHERE address->country->short_name = 'US';
>
> I like DB2's approach better because there is no ambiguity between namespace resolution but I don't entirely like the
waythe refs work as separate from standard foreign keys. 
>
> What I am thinking about is a function which accepts a row and a destination table name, looks up foreign keys in and
retrievesthe row from the other table, returning it, if the table name exists.   This could then be mapped to an
operatorwhich would avoid some problems.  This could then be: 
>
> SELECT * FROM address where (address->'country').short_name = 'US'

short for:

SELECT address.*
FROM address
JOIN country ON (country_id = country.id)
WHERE short_name = 'US';

depending on naming scheme a NATURAL JOIN or USING can shorten this a bit.

I guess there is some value in not making the columns from country available to the SELECT LIST...

Another form equates to...

SELECT * FROM address WHERE EXISTS (SELECT 1 FROM country WHERE id = country_id AND short_name = 'US';

This second form seems a more direct translation of the query.  At first glance this would seem to be a bad query to
writefor the given goal but maybe the planner does the right thing and so the two pure forms perform equivalently? 

>
> Or if we also have a continents table:
>
> SELECT * FROM address where (address->'country'->'continent').name = 'North America';
>
> Obviously these examples assume a very small number of address records and are largely contrived from the IBM
examples. However you could also do this on a small return set: 

As implied above if this only is going to generate meaningful plans on small datasets its value seems even more
limited...

>
> select (a->'country').short_name from address where ....;
>
> Or even:
>
> select (a->'country').* from address where......
>
> The next question is whether there is a way to pass country in as an identifier so there is no need to use single
quotes. This would make things a little more transparent if possible but I would be happy without this.  Eventually it
mightbe kinda useful (for those porting O-R stuff from Oracle or DB2) to have a path operator built in with a concept
ofa default, implicit join. 
>
> Any thoughts?  If it can be done in plain SQL and inlined that would be ideal but in the prototyping state, that
isn'tso important and I expect that it is not. 

I am presuming this syntax and capability is not specified in the SQL standard anywhere?

I don't have a dog in the "make things easier to port" fight but this seems like something that would be fairly easy to
identifyin Oracle/DB2 SQL code and to rewrite using explicit joins without losing any actual functionality. 

Given that you come across this via "reading" it doesn't sound like you have actually been bitten by its lack nor are
havingpeople complain about its lack so I suppose you are arguing for its inclusion based upon its own merits in making
codingeasier. 

>
> Best Wishes,
> Chris Traves


Re: How hard would a "path" operator be to implement in PostgreSQL

From
Craig Ringer
Date:
On 08/20/2012 11:13 AM, David Johnston wrote:
> On Aug 19, 2012, at 21:28, Chris Travers <chris.travers@gmail.com> wrote:
>
>> Hi;
>>
>> I have been reading up on object-relational features of Oracle and DB2 and found that one of the big things they
havethat we don't is a path operator.  The idea is that you can use the path operator to follow some subset of foreign
keyscalled refs. 
> Why do you feel this is a "big thing".  Sure, you can save a few keystrokes when writing multi-table queries but that
doesn'tseem all that great and now you are introducing ambiguity into the planner/query when trying to resolve these
implicitjoins.  I concur that introducing an explicit REF is a poor choice taken at face value since now you have to
rememberwhat references are present.  With FOREIGN KEYS you introduce a logical constraint but you are able to perform
anjoin between two tables independent of the presence of an FK. 

These things mostly become valuable when used with more complex
expressions. They can also hide a *lot* of expensive work being done
behind the scenes, though, making seemingly simple queries extremely
expensive.

Think:

SELECT
      parent->child11->child12 AS a,
      parent->child21->child22 AS b,
FROM parent;

instead of:

SELECT
     child12.a AS a,
     child22.b AS b
FROM parent
     LEFT OUTER JOIN child11 ON (child11.parent_id = parent.id)
     LEFT OUTER JOIN child12 ON (child12.child11_id = child11.id)
     LEFT OUTER JOIN child21 ON (child21.parent_id = parent.id)
     LEFT OUTER JOIN child22 ON (child22.child21_id = child21.id);


It's basically the same inversion model used by the HQL query language
of Hibernate, or the similar JPQL of JPA. They're quick and convenient,
but can get insanely expensive to execute when implemented as join sets.

To do this well, the DB really needs a query optimiser that's smart
enough to know when it should be JOINing vs when it should be issuing
subqueries or even using path-operator-specific plans.

The main area *I* find path operators appealing is when combined with
features like `json`, so whole graphs can be fetched and returned in
single queries. For example something like:

SELECT row_to_json(ROW(
     customer.*,
     array_agg(customer->address) AS addresses,
     array_agg(customer->contacts) AS contacts
     )) AS json_result
FROM customer;

would potentially help get rid of a huge amount of the ugliness ORMs
currently do with de-duplicating results after doing huge left outer
join chains.

--
Craig Ringer


Re: How hard would a "path" operator be to implement in PostgreSQL

From
Chris Travers
Date:
Hi David;

On Sun, Aug 19, 2012 at 8:13 PM, David Johnston <polobo@yahoo.com> wrote:
On Aug 19, 2012, at 21:28, Chris Travers <chris.travers@gmail.com> wrote:

> Hi;
>
> I have been reading up on object-relational features of Oracle and DB2 and found that one of the big things they have that we don't is a path operator.  The idea is that you can use the path operator to follow some subset of foreign keys called refs.

Why do you feel this is a "big thing".  Sure, you can save a few keystrokes when writing multi-table queries but that doesn't seem all that great and now you are introducing ambiguity into the planner/query when trying to resolve these implicit joins.  I concur that introducing an explicit REF is a poor choice taken at face value since now you have to remember what references are present.  With FOREIGN KEYS you introduce a logical constraint but you are able to perform an join between two tables independent of the presence of an FK.

One of the major areas I am pushing right now in my own work is the idea of data encapsulation in the db and the idea that database interfaces might be contained in the db sufficient to be discoverable at run-time or through code-generating tools.  This allows easy interop with software written in all sorts of languages.  You can just port the code generation tools to a new language and suddenly you get your database access in that language.

So yes, I do think it's a significant step on one aspect of what can be done with O-R functionality in the database (what you might think of as query-wrapping O-R, as opposed to intra-query O-R which is more developed on Pg).  There are some ways to get around it but they require a fair bit of extra work.  Being able to centralize this would be nice.

Imagine being able to have the application specify address info to the db, and get related records automatically.

I could flesh out the use cases I am thinking about in a bit more detail, but suppose we have a reverse path operator (this is  *not* what's in DB2 or Oracle but woudl be similar to implement):

select eca.(*) from (select company(?, ?, ?, ?, ?)<-credit_accounts) eca.

This could be done entirely by the code generator without any knowledge of underlying data structures.


>
> Suppose we have a table (Oracle/DB2 styles here):
>
> CREATE TABLE country (
>     REF IS OID USER GENERATED,
>     id int serial not null unique,
>     name text primary key,
>     short_name text not null unique
> );
>
> CREATE TABLE address (
>     street_text text not null,
>     city text not null,
>     state_province text.
>     country REF(country)
> );
>
> In oracle this might allow you to do something like:
>
> SELECT * from address where address.country.short_name = 'US';
>
> In DB2 this might be done like:
>
> SELECT * FROM address WHERE address->country->short_name = 'US';
>
> I like DB2's approach better because there is no ambiguity between namespace resolution but I don't entirely like the way the refs work as separate from standard foreign keys.
>
> What I am thinking about is a function which accepts a row and a destination table name, looks up foreign keys in and retrieves the row from the other table, returning it, if the table name exists.   This could then be mapped to an operator which would avoid some problems.  This could then be:
>
> SELECT * FROM address where (address->'country').short_name = 'US'

short for:

SELECT address.*
FROM address
JOIN country ON (country_id = country.id)
WHERE short_name = 'US';

depending on naming scheme a NATURAL JOIN or USING can shorten this a bit.

I guess there is some value in not making the columns from country available to the SELECT LIST...

Another form equates to...

SELECT * FROM address WHERE EXISTS (SELECT 1 FROM country WHERE id = country_id AND short_name = 'US';

This second form seems a more direct translation of the query.  At first glance this would seem to be a bad query to write for the given goal but maybe the planner does the right thing and so the two pure forms perform equivalently?

>
> Or if we also have a continents table:
>
> SELECT * FROM address where (address->'country'->'continent').name = 'North America';
>
> Obviously these examples assume a very small number of address records and are largely contrived from the IBM examples.  However you could also do this on a small return set:

As implied above if this only is going to generate meaningful plans on small datasets its value seems even more limited...

At least for prototyping it might be useful to look at design patterns, and then see what can be done to make it more useful.  There isn't much sense in trying to figure everything out at once if the effort turns out to be largely wasted. 

>
> select (a->'country').short_name from address where ....;
>
> Or even:
>
> select (a->'country').* from address where......
>
> The next question is whether there is a way to pass country in as an identifier so there is no need to use single quotes.  This would make things a little more transparent if possible but I would be happy without this.  Eventually it might be kinda useful (for those porting O-R stuff from Oracle or DB2) to have a path operator built in with a concept of a default, implicit join.
>
> Any thoughts?  If it can be done in plain SQL and inlined that would be ideal but in the prototyping state, that isn't so important and I expect that it is not.

I am presuming this syntax and capability is not specified in the SQL standard anywhere?

I don't have a dog in the "make things easier to port" fight but this seems like something that would be fairly easy to identify in Oracle/DB2 SQL code and to rewrite using explicit joins without losing any actual functionality.

Given that you come across this via "reading" it doesn't sound like you have actually been bitten by its lack nor are having people complain about its lack so I suppose you are arguing for its inclusion based upon its own merits in making coding easier.

 I suspect not a lot is being done with PostgreSQL in terms of the sorts of O-R capabilities it has, let alone those that other db's have added.....

Also note that the current workarounds are both maintenance-heavy and limited.  One could currently build stored procedures such that you could:

SELECT * FROM address WHERE ((address.country).continent).name = 'North America';

But then that's two nested function calls, and those functions need to be separately maintained.  The functions may be able to be inlined or not.  I don't know.  However if there was a way to do something like this through an operation, it might actually be easier.  This would make it more useful for intra-query O-R.

Best Wishes,
Chris Travers

Re: How hard would a "path" operator be to implement in PostgreSQL

From
Martijn van Oosterhout
Date:
On Sun, Aug 19, 2012 at 06:28:57PM -0700, Chris Travers wrote:
> In DB2 this might be done like:
>
> SELECT * FROM address WHERE address->country->short_name = 'US';
>
> I like DB2's approach better because there is no ambiguity between
> namespace resolution but I don't entirely like the way the refs work as
> separate from standard foreign keys.

I'm not sure I have an opinion on pushing ORM features to the database
layer, SQLAlchemy is doing a pretty good job for me already.

I just wanted to say that in the beginning I didn't understand
SQLAchemy's separation between foreign keys and relationships, but I
have since come to understand that a "relationship" is a far more
general concept than a foreign key.  There are many kinds of
relationships between entities in a database that can't be represented
by a foreign key, but can be by a general join condition.

A simple example might be "items in an order" which you could derive
from a foreign key, compared to "items in an order which have sales
tax" which is something more general.

So whatever the result of this discussion, don't just consider foreign
keys, think bigger.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.
   -- Arthur Schopenhauer

Attachment

Re: How hard would a "path" operator be to implement in PostgreSQL

From
Craig Ringer
Date:
On 08/21/2012 03:06 AM, Martijn van Oosterhout wrote:
> I'm not sure I have an opinion on pushing ORM features to the database
> layer, SQLAlchemy is doing a pretty good job for me already.

There are some things ORMs could really use help from the database with,
though. Particularly when fetching larger object graphs with deep
relationships. The classic ORM chained-left-join pattern leads to
*awful* explosions of join sizes, and huge amounts of duplicate data
being sent. The n+1 selects alternative is even worse.

> I just wanted to say that in the beginning I didn't understand
> SQLAchemy's separation between foreign keys and relationships, but I
> have since come to understand that a "relationship" is a far more
> general concept than a foreign key.  There are many kinds of
> relationships between entities in a database that can't be represented
> by a foreign key, but can be by a general join condition.

+1 on that, and the same is true with other object/relational work. A
relationship is broader than a foreign key.

--
Craig Ringer


Re: How hard would a "path" operator be to implement in PostgreSQL

From
Martijn van Oosterhout
Date:
On Tue, Aug 21, 2012 at 09:39:20AM +0800, Craig Ringer wrote:
> On 08/21/2012 03:06 AM, Martijn van Oosterhout wrote:
> >I'm not sure I have an opinion on pushing ORM features to the database
> >layer, SQLAlchemy is doing a pretty good job for me already.
>
> There are some things ORMs could really use help from the database
> with, though. Particularly when fetching larger object graphs with
> deep relationships. The classic ORM chained-left-join pattern leads
> to *awful* explosions of join sizes, and huge amounts of duplicate
> data being sent. The n+1 selects alternative is even worse.

Well, Postgres in principle supports arrays of records, so I've
wondered if a relationship join could stuff all the objects in a single
field of the response using an aggregate.  I think what's always
prevented this from working is that client would have to parse the
resulting output text output, which is practically impossible in the
face of custom types.

What seems more useful to me is working on returning multiple
resultsets, which could be interleaved by the server, so you could do
things like (syntax invented on the spot, essentially WITHs without an
actual query):

WITH order_result AS
   (SELECT * FROM orders WHERE interesting)
WITH widget_result AS
   (SELECT * FROM widgets WHERE widgets.order=order_result.id);

Here the server could perform the join and return both sides of the
join in seperate result sets.  But named, so they can refer to
eachother.  I suppose for backward compatability you'd have a master
result set with named children, otherwise the memory management gets
hairy.  And I have no idea if the BE/FE protocol can handle it, but it
would be useful, and I think easy for ORMs to use, since they can stuff
the user query in the first bit, and tack their relationship joins on
the end.

If the bare WITHs look like it might be ambiguous, you could make the
actual query something like:

...
RETURNS order_result, widget_result;

which might be better since it allows the original query to use WITHs
without interfering with the result.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.
   -- Arthur Schopenhauer

Attachment

Re: How hard would a "path" operator be to implement in PostgreSQL

From
Craig Ringer
Date:
On 08/21/2012 03:01 PM, Martijn van Oosterhout wrote:
> Well, Postgres in principle supports arrays of records, so I've
> wondered if a relationship join could stuff all the objects in a single
> field of the response using an aggregate.  I think what's always
> prevented this from working is that client would have to parse the
> resulting output text output, which is practically impossible in the
> face of custom types.

That's where the new JSON support is interesting; it provides a much
more commonly understood and easier to parse structured form for
results, so trees (but not more general graphs) can be returned.

> What seems more useful to me is working on returning multiple
> resultsets, which could be interleaved by the server, so you could do
> things like

That'd certainly be a nice option, but there's a big difference between
it and the other form: With multiple result sets, the client still has
to effectively join everything client side to work out the relationships
and build a graph or tree (usually an object graph).

On the upside, multiple result sets can be transformed into graphs,
where JSON can only represent simple trees without introducing the need
for cross reference resolution.

I like your notion of chaining common table expressions so you can
return intermediate CTs as result sets. That feels clean.

Currently many ORM systems (those that don't do horrible giant chained
left joins or n+1 selects) do follow-up queries that repeat much of the
work the 1st query did, eg:

SELECT a.*
FROM a INNER JOIN b ON (...) INNER JOIN c ON (...)
WHERE expensive_clause;

SELECT b.* FROM b WHERE b IN (
   SELECT a.b_id
   FROM a INNER JOIN b ON (...) INNER JOIN c ON (...)
   WHERE expensive_clause;
);

... and that's one of the *nicer* ways they execute queries.

Multiple result set support would be pretty handy for stored procs, too;
it's something people grumble about occasionally, though I've never
needed it and would just use refcursors if I did.

How do other DBs handle multiple result sets? Do they only support them
from stored procs?

> And I have no idea if the BE/FE protocol can handle it, but it
> would be useful, and I think easy for ORMs to use, since they can stuff
> the user query in the first bit, and tack their relationship joins on
> the end.

I suspect the BE/FE protocol would be a bit of an issue. That's part of
the reason I was thinking about the utility of the JSON support for
this, because with a few aggregate operators etc it'd be a fairly low
impact solution.

--
Craig Ringer



Re: How hard would a "path" operator be to implement in PostgreSQL

From
Chris Travers
Date:
Hi all;

So I found an interesting and relatively manageable way of doing this.

Suppose we have an inventory table:

CREATE TABLE inventory_item (
    id serial primary key,
    cogs_account_id int references account(id),
    inv_account_id int references account(id),
    income_account_id int references account(id),
    sku text not null,
    description text,
    last_cost numeric, -- null if never purchased
    sell_price numeric not null,
    active bool not null default true
);

Now we want to be able to add pointers to this table in other tables
without adding a lot of decentralized code.  So what we do is:


CREATE TABLE joins_inventory_item (
    inventory_item_id int
);

Then we create a table method function like:


CREATE FUNCTION inventory_item(joins_inventory_item) RETURNS inventory_item
LANGUAGE SQL AS $$
    SELECT * FROM inventory_item where id = $1.inventory_item_id;
$$;

Then any table which inherits joins_inventory_item gets a path back.
So for example:

CREATE TABLE inventory_barcode (
    barcode text primary key;
    FOREIGN KEY inventory_item_id REFERENCES inventory_item(id)
);

Then we can:

select (bc.inventory_item).sku FROM inventory_barcode bc WHERE barcode
= '12345';

Best Wishes,
Chris Travers


Re: How hard would a "path" operator be to implement in PostgreSQL

From
Merlin Moncure
Date:
On Tue, Aug 21, 2012 at 2:56 AM, Craig Ringer <ringerc@ringerc.id.au> wrote:
> On 08/21/2012 03:01 PM, Martijn van Oosterhout wrote:
>>
>> Well, Postgres in principle supports arrays of records, so I've
>> wondered if a relationship join could stuff all the objects in a single
>> field of the response using an aggregate.  I think what's always
>> prevented this from working is that client would have to parse the
>> resulting output text output, which is practically impossible in the
>> face of custom types.
>
>
> That's where the new JSON support is interesting; it provides a much more
> commonly understood and easier to parse structured form for results, so
> trees (but not more general graphs) can be returned.

I'd go beyond 'interesting' and call it a complete game changer if you
are consuming data in a language that has good support for JSON
(especially javascript naturally).  Another good option if you're
consuming structured data in C is libpqtypes.

merlin