Thread: How hard would a "path" operator be to implement in PostgreSQL
Hi;
I have been reading up on object-relational features of Oracle and DB2 and found that one of the big things they have that we don't is a path operator. The idea is that you can use the path operator to follow some subset of foreign keys called refs.
Suppose we have a table (Oracle/DB2 styles here):
CREATE TABLE country (
REF IS OID USER GENERATED,
id int serial not null unique,
name text primary key,
short_name text not null unique
);
CREATE TABLE address (
street_text text not null,
city text not null,
state_province text.
country REF(country)
);
In oracle this might allow you to do something like:
SELECT * from address where address.country.short_name = 'US';
In DB2 this might be done like:
SELECT * FROM address WHERE address->country->short_name = 'US';
I like DB2's approach better because there is no ambiguity between namespace resolution but I don't entirely like the way the refs work as separate from standard foreign keys.
What I am thinking about is a function which accepts a row and a destination table name, looks up foreign keys in and retrieves the row from the other table, returning it, if the table name exists. This could then be mapped to an operator which would avoid some problems. This could then be:
SELECT * FROM address where (address->'country').short_name = 'US'
Or if we also have a continents table:
SELECT * FROM address where (address->'country'->'continent').name = 'North America';
Obviously these examples assume a very small number of address records and are largely contrived from the IBM examples. However you could also do this on a small return set:
select (a->'country').short_name from address where ....;
Or even:
select (a->'country').* from address where......
The next question is whether there is a way to pass country in as an identifier so there is no need to use single quotes. This would make things a little more transparent if possible but I would be happy without this. Eventually it might be kinda useful (for those porting O-R stuff from Oracle or DB2) to have a path operator built in with a concept of a default, implicit join.
Any thoughts? If it can be done in plain SQL and inlined that would be ideal but in the prototyping state, that isn't so important and I expect that it is not.
Best Wishes,
Chris Traves
Hi, On 20 August 2012 11:28, Chris Travers <chris.travers@gmail.com> wrote: > I have been reading up on object-relational features of Oracle and DB2 and > found that one of the big things they have that we don't is a path operator. > The idea is that you can use the path operator to follow some subset of > foreign keys called refs. <snip> > Any thoughts? If it can be done in plain SQL and inlined that would be > ideal but in the prototyping state, that isn't so important and I expect > that it is not. You should look at ltree: This module implements a data type ltree for representing labels of data stored in a hierarchical tree-like structure. Extensive facilities for searching through label trees are provided. http://www.postgresql.org/docs/9.1/static/ltree.html -- Ondrej Ivanic (ondrej.ivanic@gmail.com)
On Aug 19, 2012, at 21:28, Chris Travers <chris.travers@gmail.com> wrote: > Hi; > > I have been reading up on object-relational features of Oracle and DB2 and found that one of the big things they have thatwe don't is a path operator. The idea is that you can use the path operator to follow some subset of foreign keys calledrefs. Why do you feel this is a "big thing". Sure, you can save a few keystrokes when writing multi-table queries but that doesn'tseem all that great and now you are introducing ambiguity into the planner/query when trying to resolve these implicitjoins. I concur that introducing an explicit REF is a poor choice taken at face value since now you have to rememberwhat references are present. With FOREIGN KEYS you introduce a logical constraint but you are able to perform anjoin between two tables independent of the presence of an FK. > > Suppose we have a table (Oracle/DB2 styles here): > > CREATE TABLE country ( > REF IS OID USER GENERATED, > id int serial not null unique, > name text primary key, > short_name text not null unique > ); > > CREATE TABLE address ( > street_text text not null, > city text not null, > state_province text. > country REF(country) > ); > > In oracle this might allow you to do something like: > > SELECT * from address where address.country.short_name = 'US'; > > In DB2 this might be done like: > > SELECT * FROM address WHERE address->country->short_name = 'US'; > > I like DB2's approach better because there is no ambiguity between namespace resolution but I don't entirely like the waythe refs work as separate from standard foreign keys. > > What I am thinking about is a function which accepts a row and a destination table name, looks up foreign keys in and retrievesthe row from the other table, returning it, if the table name exists. This could then be mapped to an operatorwhich would avoid some problems. This could then be: > > SELECT * FROM address where (address->'country').short_name = 'US' short for: SELECT address.* FROM address JOIN country ON (country_id = country.id) WHERE short_name = 'US'; depending on naming scheme a NATURAL JOIN or USING can shorten this a bit. I guess there is some value in not making the columns from country available to the SELECT LIST... Another form equates to... SELECT * FROM address WHERE EXISTS (SELECT 1 FROM country WHERE id = country_id AND short_name = 'US'; This second form seems a more direct translation of the query. At first glance this would seem to be a bad query to writefor the given goal but maybe the planner does the right thing and so the two pure forms perform equivalently? > > Or if we also have a continents table: > > SELECT * FROM address where (address->'country'->'continent').name = 'North America'; > > Obviously these examples assume a very small number of address records and are largely contrived from the IBM examples. However you could also do this on a small return set: As implied above if this only is going to generate meaningful plans on small datasets its value seems even more limited... > > select (a->'country').short_name from address where ....; > > Or even: > > select (a->'country').* from address where...... > > The next question is whether there is a way to pass country in as an identifier so there is no need to use single quotes. This would make things a little more transparent if possible but I would be happy without this. Eventually it mightbe kinda useful (for those porting O-R stuff from Oracle or DB2) to have a path operator built in with a concept ofa default, implicit join. > > Any thoughts? If it can be done in plain SQL and inlined that would be ideal but in the prototyping state, that isn'tso important and I expect that it is not. I am presuming this syntax and capability is not specified in the SQL standard anywhere? I don't have a dog in the "make things easier to port" fight but this seems like something that would be fairly easy to identifyin Oracle/DB2 SQL code and to rewrite using explicit joins without losing any actual functionality. Given that you come across this via "reading" it doesn't sound like you have actually been bitten by its lack nor are havingpeople complain about its lack so I suppose you are arguing for its inclusion based upon its own merits in making codingeasier. > > Best Wishes, > Chris Traves
On 08/20/2012 11:13 AM, David Johnston wrote: > On Aug 19, 2012, at 21:28, Chris Travers <chris.travers@gmail.com> wrote: > >> Hi; >> >> I have been reading up on object-relational features of Oracle and DB2 and found that one of the big things they havethat we don't is a path operator. The idea is that you can use the path operator to follow some subset of foreign keyscalled refs. > Why do you feel this is a "big thing". Sure, you can save a few keystrokes when writing multi-table queries but that doesn'tseem all that great and now you are introducing ambiguity into the planner/query when trying to resolve these implicitjoins. I concur that introducing an explicit REF is a poor choice taken at face value since now you have to rememberwhat references are present. With FOREIGN KEYS you introduce a logical constraint but you are able to perform anjoin between two tables independent of the presence of an FK. These things mostly become valuable when used with more complex expressions. They can also hide a *lot* of expensive work being done behind the scenes, though, making seemingly simple queries extremely expensive. Think: SELECT parent->child11->child12 AS a, parent->child21->child22 AS b, FROM parent; instead of: SELECT child12.a AS a, child22.b AS b FROM parent LEFT OUTER JOIN child11 ON (child11.parent_id = parent.id) LEFT OUTER JOIN child12 ON (child12.child11_id = child11.id) LEFT OUTER JOIN child21 ON (child21.parent_id = parent.id) LEFT OUTER JOIN child22 ON (child22.child21_id = child21.id); It's basically the same inversion model used by the HQL query language of Hibernate, or the similar JPQL of JPA. They're quick and convenient, but can get insanely expensive to execute when implemented as join sets. To do this well, the DB really needs a query optimiser that's smart enough to know when it should be JOINing vs when it should be issuing subqueries or even using path-operator-specific plans. The main area *I* find path operators appealing is when combined with features like `json`, so whole graphs can be fetched and returned in single queries. For example something like: SELECT row_to_json(ROW( customer.*, array_agg(customer->address) AS addresses, array_agg(customer->contacts) AS contacts )) AS json_result FROM customer; would potentially help get rid of a huge amount of the ugliness ORMs currently do with de-duplicating results after doing huge left outer join chains. -- Craig Ringer
Hi David;
On Sun, Aug 19, 2012 at 8:13 PM, David Johnston <polobo@yahoo.com> wrote:
On Aug 19, 2012, at 21:28, Chris Travers <chris.travers@gmail.com> wrote:Why do you feel this is a "big thing". Sure, you can save a few keystrokes when writing multi-table queries but that doesn't seem all that great and now you are introducing ambiguity into the planner/query when trying to resolve these implicit joins. I concur that introducing an explicit REF is a poor choice taken at face value since now you have to remember what references are present. With FOREIGN KEYS you introduce a logical constraint but you are able to perform an join between two tables independent of the presence of an FK.
> Hi;
>
> I have been reading up on object-relational features of Oracle and DB2 and found that one of the big things they have that we don't is a path operator. The idea is that you can use the path operator to follow some subset of foreign keys called refs.
One of the major areas I am pushing right now in my own work is the idea of data encapsulation in the db and the idea that database interfaces might be contained in the db sufficient to be discoverable at run-time or through code-generating tools. This allows easy interop with software written in all sorts of languages. You can just port the code generation tools to a new language and suddenly you get your database access in that language.
So yes, I do think it's a significant step on one aspect of what can be done with O-R functionality in the database (what you might think of as query-wrapping O-R, as opposed to intra-query O-R which is more developed on Pg). There are some ways to get around it but they require a fair bit of extra work. Being able to centralize this would be nice.
Imagine being able to have the application specify address info to the db, and get related records automatically.
I could flesh out the use cases I am thinking about in a bit more detail, but suppose we have a reverse path operator (this is *not* what's in DB2 or Oracle but woudl be similar to implement):
select eca.(*) from (select company(?, ?, ?, ?, ?)<-credit_accounts) eca.
This could be done entirely by the code generator without any knowledge of underlying data structures.
short for:
>
> Suppose we have a table (Oracle/DB2 styles here):
>
> CREATE TABLE country (
> REF IS OID USER GENERATED,
> id int serial not null unique,
> name text primary key,
> short_name text not null unique
> );
>
> CREATE TABLE address (
> street_text text not null,
> city text not null,
> state_province text.
> country REF(country)
> );
>
> In oracle this might allow you to do something like:
>
> SELECT * from address where address.country.short_name = 'US';
>
> In DB2 this might be done like:
>
> SELECT * FROM address WHERE address->country->short_name = 'US';
>
> I like DB2's approach better because there is no ambiguity between namespace resolution but I don't entirely like the way the refs work as separate from standard foreign keys.
>
> What I am thinking about is a function which accepts a row and a destination table name, looks up foreign keys in and retrieves the row from the other table, returning it, if the table name exists. This could then be mapped to an operator which would avoid some problems. This could then be:
>
> SELECT * FROM address where (address->'country').short_name = 'US'
SELECT address.*
FROM address
JOIN country ON (country_id = country.id)
WHERE short_name = 'US';
depending on naming scheme a NATURAL JOIN or USING can shorten this a bit.
I guess there is some value in not making the columns from country available to the SELECT LIST...
Another form equates to...
SELECT * FROM address WHERE EXISTS (SELECT 1 FROM country WHERE id = country_id AND short_name = 'US';
This second form seems a more direct translation of the query. At first glance this would seem to be a bad query to write for the given goal but maybe the planner does the right thing and so the two pure forms perform equivalently?As implied above if this only is going to generate meaningful plans on small datasets its value seems even more limited...
>
> Or if we also have a continents table:
>
> SELECT * FROM address where (address->'country'->'continent').name = 'North America';
>
> Obviously these examples assume a very small number of address records and are largely contrived from the IBM examples. However you could also do this on a small return set:
At least for prototyping it might be useful to look at design patterns, and then see what can be done to make it more useful. There isn't much sense in trying to figure everything out at once if the effort turns out to be largely wasted.
I am presuming this syntax and capability is not specified in the SQL standard anywhere?
>
> select (a->'country').short_name from address where ....;
>
> Or even:
>
> select (a->'country').* from address where......
>
> The next question is whether there is a way to pass country in as an identifier so there is no need to use single quotes. This would make things a little more transparent if possible but I would be happy without this. Eventually it might be kinda useful (for those porting O-R stuff from Oracle or DB2) to have a path operator built in with a concept of a default, implicit join.
>
> Any thoughts? If it can be done in plain SQL and inlined that would be ideal but in the prototyping state, that isn't so important and I expect that it is not.
I don't have a dog in the "make things easier to port" fight but this seems like something that would be fairly easy to identify in Oracle/DB2 SQL code and to rewrite using explicit joins without losing any actual functionality.
Given that you come across this via "reading" it doesn't sound like you have actually been bitten by its lack nor are having people complain about its lack so I suppose you are arguing for its inclusion based upon its own merits in making coding easier.
I suspect not a lot is being done with PostgreSQL in terms of the sorts of O-R capabilities it has, let alone those that other db's have added.....
Also note that the current workarounds are both maintenance-heavy and limited. One could currently build stored procedures such that you could:
SELECT * FROM address WHERE ((address.country).continent).name = 'North America';
But then that's two nested function calls, and those functions need to be separately maintained. The functions may be able to be inlined or not. I don't know. However if there was a way to do something like this through an operation, it might actually be easier. This would make it more useful for intra-query O-R.
Best Wishes,
Chris Travers
Re: How hard would a "path" operator be to implement in PostgreSQL
From
Martijn van Oosterhout
Date:
On Sun, Aug 19, 2012 at 06:28:57PM -0700, Chris Travers wrote: > In DB2 this might be done like: > > SELECT * FROM address WHERE address->country->short_name = 'US'; > > I like DB2's approach better because there is no ambiguity between > namespace resolution but I don't entirely like the way the refs work as > separate from standard foreign keys. I'm not sure I have an opinion on pushing ORM features to the database layer, SQLAlchemy is doing a pretty good job for me already. I just wanted to say that in the beginning I didn't understand SQLAchemy's separation between foreign keys and relationships, but I have since come to understand that a "relationship" is a far more general concept than a foreign key. There are many kinds of relationships between entities in a database that can't be represented by a foreign key, but can be by a general join condition. A simple example might be "items in an order" which you could derive from a foreign key, compared to "items in an order which have sales tax" which is something more general. So whatever the result of this discussion, don't just consider foreign keys, think bigger. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > He who writes carelessly confesses thereby at the very outset that he does > not attach much importance to his own thoughts. -- Arthur Schopenhauer
Attachment
On 08/21/2012 03:06 AM, Martijn van Oosterhout wrote: > I'm not sure I have an opinion on pushing ORM features to the database > layer, SQLAlchemy is doing a pretty good job for me already. There are some things ORMs could really use help from the database with, though. Particularly when fetching larger object graphs with deep relationships. The classic ORM chained-left-join pattern leads to *awful* explosions of join sizes, and huge amounts of duplicate data being sent. The n+1 selects alternative is even worse. > I just wanted to say that in the beginning I didn't understand > SQLAchemy's separation between foreign keys and relationships, but I > have since come to understand that a "relationship" is a far more > general concept than a foreign key. There are many kinds of > relationships between entities in a database that can't be represented > by a foreign key, but can be by a general join condition. +1 on that, and the same is true with other object/relational work. A relationship is broader than a foreign key. -- Craig Ringer
Re: How hard would a "path" operator be to implement in PostgreSQL
From
Martijn van Oosterhout
Date:
On Tue, Aug 21, 2012 at 09:39:20AM +0800, Craig Ringer wrote: > On 08/21/2012 03:06 AM, Martijn van Oosterhout wrote: > >I'm not sure I have an opinion on pushing ORM features to the database > >layer, SQLAlchemy is doing a pretty good job for me already. > > There are some things ORMs could really use help from the database > with, though. Particularly when fetching larger object graphs with > deep relationships. The classic ORM chained-left-join pattern leads > to *awful* explosions of join sizes, and huge amounts of duplicate > data being sent. The n+1 selects alternative is even worse. Well, Postgres in principle supports arrays of records, so I've wondered if a relationship join could stuff all the objects in a single field of the response using an aggregate. I think what's always prevented this from working is that client would have to parse the resulting output text output, which is practically impossible in the face of custom types. What seems more useful to me is working on returning multiple resultsets, which could be interleaved by the server, so you could do things like (syntax invented on the spot, essentially WITHs without an actual query): WITH order_result AS (SELECT * FROM orders WHERE interesting) WITH widget_result AS (SELECT * FROM widgets WHERE widgets.order=order_result.id); Here the server could perform the join and return both sides of the join in seperate result sets. But named, so they can refer to eachother. I suppose for backward compatability you'd have a master result set with named children, otherwise the memory management gets hairy. And I have no idea if the BE/FE protocol can handle it, but it would be useful, and I think easy for ORMs to use, since they can stuff the user query in the first bit, and tack their relationship joins on the end. If the bare WITHs look like it might be ambiguous, you could make the actual query something like: ... RETURNS order_result, widget_result; which might be better since it allows the original query to use WITHs without interfering with the result. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > He who writes carelessly confesses thereby at the very outset that he does > not attach much importance to his own thoughts. -- Arthur Schopenhauer
Attachment
On 08/21/2012 03:01 PM, Martijn van Oosterhout wrote: > Well, Postgres in principle supports arrays of records, so I've > wondered if a relationship join could stuff all the objects in a single > field of the response using an aggregate. I think what's always > prevented this from working is that client would have to parse the > resulting output text output, which is practically impossible in the > face of custom types. That's where the new JSON support is interesting; it provides a much more commonly understood and easier to parse structured form for results, so trees (but not more general graphs) can be returned. > What seems more useful to me is working on returning multiple > resultsets, which could be interleaved by the server, so you could do > things like That'd certainly be a nice option, but there's a big difference between it and the other form: With multiple result sets, the client still has to effectively join everything client side to work out the relationships and build a graph or tree (usually an object graph). On the upside, multiple result sets can be transformed into graphs, where JSON can only represent simple trees without introducing the need for cross reference resolution. I like your notion of chaining common table expressions so you can return intermediate CTs as result sets. That feels clean. Currently many ORM systems (those that don't do horrible giant chained left joins or n+1 selects) do follow-up queries that repeat much of the work the 1st query did, eg: SELECT a.* FROM a INNER JOIN b ON (...) INNER JOIN c ON (...) WHERE expensive_clause; SELECT b.* FROM b WHERE b IN ( SELECT a.b_id FROM a INNER JOIN b ON (...) INNER JOIN c ON (...) WHERE expensive_clause; ); ... and that's one of the *nicer* ways they execute queries. Multiple result set support would be pretty handy for stored procs, too; it's something people grumble about occasionally, though I've never needed it and would just use refcursors if I did. How do other DBs handle multiple result sets? Do they only support them from stored procs? > And I have no idea if the BE/FE protocol can handle it, but it > would be useful, and I think easy for ORMs to use, since they can stuff > the user query in the first bit, and tack their relationship joins on > the end. I suspect the BE/FE protocol would be a bit of an issue. That's part of the reason I was thinking about the utility of the JSON support for this, because with a few aggregate operators etc it'd be a fairly low impact solution. -- Craig Ringer
Hi all; So I found an interesting and relatively manageable way of doing this. Suppose we have an inventory table: CREATE TABLE inventory_item ( id serial primary key, cogs_account_id int references account(id), inv_account_id int references account(id), income_account_id int references account(id), sku text not null, description text, last_cost numeric, -- null if never purchased sell_price numeric not null, active bool not null default true ); Now we want to be able to add pointers to this table in other tables without adding a lot of decentralized code. So what we do is: CREATE TABLE joins_inventory_item ( inventory_item_id int ); Then we create a table method function like: CREATE FUNCTION inventory_item(joins_inventory_item) RETURNS inventory_item LANGUAGE SQL AS $$ SELECT * FROM inventory_item where id = $1.inventory_item_id; $$; Then any table which inherits joins_inventory_item gets a path back. So for example: CREATE TABLE inventory_barcode ( barcode text primary key; FOREIGN KEY inventory_item_id REFERENCES inventory_item(id) ); Then we can: select (bc.inventory_item).sku FROM inventory_barcode bc WHERE barcode = '12345'; Best Wishes, Chris Travers
On Tue, Aug 21, 2012 at 2:56 AM, Craig Ringer <ringerc@ringerc.id.au> wrote: > On 08/21/2012 03:01 PM, Martijn van Oosterhout wrote: >> >> Well, Postgres in principle supports arrays of records, so I've >> wondered if a relationship join could stuff all the objects in a single >> field of the response using an aggregate. I think what's always >> prevented this from working is that client would have to parse the >> resulting output text output, which is practically impossible in the >> face of custom types. > > > That's where the new JSON support is interesting; it provides a much more > commonly understood and easier to parse structured form for results, so > trees (but not more general graphs) can be returned. I'd go beyond 'interesting' and call it a complete game changer if you are consuming data in a language that has good support for JSON (especially javascript naturally). Another good option if you're consuming structured data in C is libpqtypes. merlin