Thread: [HACKERS] [RFC] What would be difficult to make data models pluggable for making PostgreSQL a multi-model database?

Hello,

Please forgive me for asking such a stupid and rough question.

I'm thinking of making PostgreSQL a multi-model database by supporting
data models other than the current relational model.  A data model
consists of a query language (e.g. SQL for relational model, Cypher
for graph model), a parser and analyzer to transform a query into a
query tree, a planner to transform the query tree into an execution
plan, an executor, and a storage engine.

To promote the data model development, I want to make data models
pluggable.  The rough sketch is:

1) A data model developer implements the parser, analyzer,
transformer, planner, executor, and storage engine functions in a
shared library.

2) The DBA registers the data model.
 CREATE QUERY LANGUAGE Cypher (   PARSER = <parser function> );
 CREATE DATA MODEL graph (   QUERY LANGUAGE = Cypher,   ANALYZER = <analyzer function>,   TRANSFORMER = <transformer
function>,  PLANNER = <planner function>,   EXECUTOR = <executor function>,   STORAGE ENGINE = <storage engine
function>,);
 
 CREATE REGION cypher_graph (   QUERY LANGUAGE = Cypher,   DATA MODEL = graph );

The region is just a combination of a query language and a data model,
much like a locale is a combination of a language and a country.  This
is because there may be multiple popular query languages for a data
model.

3) The application connects to the database, specifying a desired
region.  The specified region's query language becomes the default
query language for the session.


The application can use the data of multiple data models in one query
by specifying another region and its query via in_region().  For
example, the following query combines the relational restaurant table
and a social graph to get the five chinese restaurants in Tokyo that
are most popular among friends of John and their friends.
 SELECT r.name, g.num_likers FROM restaurant r,   cast_region(     in_region('cypher_graph',       'MATCH (:Person
{name:"John"})-[:IS_FRIEND_OF*1..2]-(friend),        (friend)-[:LIKES]->(restaurant:Restaurant)         RETURN
restaurant.name,count(*)'),   'relational', 'g', '(name text, num_likers int') WHERE r.name = g.name AND   r.city =
'Tokyo'AND r.cuisine = 'chinese' ORDER BY g.num_likers DESC LIMIT 5;
 


What do you think would be difficult to make data models pluggable,
especially related to plugging the parser, planner, executor, etc?
One possible concern is that various PostgreSQL components might be
too dependent on the data model being relational, and it would be
difficult to separate tight coupling.

Regards
MauMau






On Sat, Aug 19, 2017 at 4:29 PM, MauMau <maumau307@gmail.com> wrote:
Hello,

Please forgive me for asking such a stupid and rough question.

I'm thinking of making PostgreSQL a multi-model database by supporting
data models other than the current relational model.  A data model
consists of a query language (e.g. SQL for relational model, Cypher
for graph model), a parser and analyzer to transform a query into a
query tree, a planner to transform the query tree into an execution
plan, an executor, and a storage engine.

To promote the data model development, I want to make data models
pluggable.  The rough sketch is:

1) A data model developer implements the parser, analyzer,
transformer, planner, executor, and storage engine functions in a
shared library.

2) The DBA registers the data model.

  CREATE QUERY LANGUAGE Cypher (
    PARSER = <parser function>
  );

  CREATE DATA MODEL graph (
    QUERY LANGUAGE = Cypher,
    ANALYZER = <analyzer function>,
    TRANSFORMER = <transformer function>,
    PLANNER = <planner function>,
    EXECUTOR = <executor function>,
    STORAGE ENGINE = <storage engine function>,
  );

  CREATE REGION cypher_graph (
    QUERY LANGUAGE = Cypher,
    DATA MODEL = graph
  );

Why cannot you do all this in a language handler and treat as a user defined function? 

The region is just a combination of a query language and a data model,
much like a locale is a combination of a language and a country.  This
is because there may be multiple popular query languages for a data
model.

3) The application connects to the database, specifying a desired
region.  The specified region's query language becomes the default
query language for the session.


The application can use the data of multiple data models in one query
by specifying another region and its query via in_region().  For
example, the following query combines the relational restaurant table
and a social graph to get the five chinese restaurants in Tokyo that
are most popular among friends of John and their friends.

  SELECT r.name, g.num_likers
  FROM restaurant r,
    cast_region(
      in_region('cypher_graph',
        'MATCH (:Person {name:"John"})-[:IS_FRIEND_OF*1..2]-(friend),
          (friend)-[:LIKES]->(restaurant:Restaurant)
          RETURN restaurant.name, count(*)'),
    'relational', 'g', '(name text, num_likers int')
  WHERE r.name = g.name AND
    r.city = 'Tokyo' AND r.cuisine = 'chinese'
  ORDER BY g.num_likers DESC LIMIT 5

If you have a language handler for cypher, why do you need in_region or cast_region?  Why not just have a graph_search() function which takes in a cypher query and returns a set of records?
;


What do you think would be difficult to make data models pluggable,
especially related to plugging the parser, planner, executor, etc?
One possible concern is that various PostgreSQL components might be
too dependent on the data model being relational, and it would be
difficult to separate tight coupling.

I guess I am missing why the current language handler structure is not enough.  Maybe I am missing something?
 

Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers



--
Best Regards,
Chris Travers
Database Administrator

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin

"MauMau" <maumau307@gmail.com> writes:
> I'm thinking of making PostgreSQL a multi-model database by supporting
> data models other than the current relational model.  A data model
> consists of a query language (e.g. SQL for relational model, Cypher
> for graph model), a parser and analyzer to transform a query into a
> query tree, a planner to transform the query tree into an execution
> plan, an executor, and a storage engine.

It sounds like what you want is to replace all of Postgres except
the name.  I'm not clear on the point.
        regards, tom lane



From: Chris Travers
> Why cannot you do all this in a language handler and treat as a user
defined function?
> ...
> If you have a language handler for cypher, why do you need in_region
or cast_region?  Why not just have a graph_search() function which
takes in a cypher query and returns a set of records?

The language handler is for *stored* functions.  The user-defined
function (UDF) doesn't participate in the planning of the outer
(top-level) query.  And they both assume that they are executed in SQL
commands.

I want the data models to meet these:

1) The query language can be used as a top-level session language.
For example, if an app specifies "region=cypher_graph" at database
connection, it can use the database as a graph database and submit
Cypher queries without embedding them in SQL.

2) When a query contains multiple query fragments of different data
models, all those fragments are parsed and planned before execution.
The planner comes up with the best plan, crossing the data model
boundary.  To take the query example in my first mail, which joins a
relational table and the result of a graph query.  The relational
planner considers how to scan the table, the graph planner considers
how to search the graph, and the relational planner considers how to
join the two fragments.

So in_region() and cast_region() are not functions to be executed
during execution phase, but are syntax constructs that are converted,
during analysis phase, into calls to another region's parser/analyzer
and an inter-model cast routine.

1. The relational parser finds in_region('cypher_graph', 'graph
query') and produces a parse node InRegion(region_name, query) in the
parse tree.

2. The relational analyzer looks up the system catalog to checks if
the specified region exists, then calls its parser/analyzer to produce
the query tree for the graph query fragment.  The relational analyzer
attaches the graph query tree to the InRegion node.

3. When the relational planner finds the graph query tree, it passes
the graph query tree to the graph planner to produce the graph
execution plan.

4. The relational planner produces a join plan node, based on the
costs/statistics of the relational table scan and graph query.  The
graph execution plan is attached to the join plan node.

The parse/query/plan nodes have a label to denote a region, so that
appropriate region's routines can be called.

Regards
MauMau




On 20 August 2017 at 10:10, MauMau <maumau307@gmail.com> wrote:
From: Chris Travers
> Why cannot you do all this in a language handler and treat as a user
defined function?
> ...
> If you have a language handler for cypher, why do you need in_region
or cast_region?  Why not just have a graph_search() function which
takes in a cypher query and returns a set of records?

The language handler is for *stored* functions.  The user-defined
function (UDF) doesn't participate in the planning of the outer
(top-level) query.  And they both assume that they are executed in SQL
commands.

While I generally agree with Tom on this, I think there are some useful ideas to examine.

Allow a UDF to emit multiple result sets that can then be incorporated into a outer query. IMO it'd be fine to support this by returning a wide row of REFCURSORs and then allow FETCH to be used in a subquery.

The UDF would need to be invoked before the rest of the query was planned, so the planner could learn the structure of the cursor's result sets.

Or some higher level concept could be introduced, like it was for aggregates and window functions, where one call can be made to get the output structure and some stats estimates, and another call (or series) to get the rows.

I guess you're going two steps further than that, seeking a more integrated model where the plugin can generate paths and participate more actively in planning, and where you can optionally make it the default so you don't need a SQL function call to access it.

If you want to pursue that, I suggest you start small and go step-by-step. Things like:

* Allow FETCH ... <refcursor> to be used in subqueries with explicitly listed output relation structure, like calling a function that returns record

* Allow pre-execution of parts of a query that produce refcursors used in subqueries, then finish planning the outer query once the cursor output types are known

* A construct that can inject arbitrary virtual relations into the namespace at parse-time, so you don't have to do the dance with refcursors. (Like WITH).

* Construct that can supply stats estimates for the virtual relations

So try to build it in stages.

You could also potentially use the FDW interface.
 
I want the data models to meet these:

1) The query language can be used as a top-level session language.
For example, if an app specifies "region=cypher_graph" at database
connection, it can use the database as a graph database and submit
Cypher queries without embedding them in SQL.

Why? What does this offer over the app or client tool wrapping its queries in "SELECT cypher_graph('....')" ?
 
2) When a query contains multiple query fragments of different data
models, all those fragments are parsed and planned before execution.
The planner comes up with the best plan, crossing the data model
boundary.  To take the query example in my first mail, which joins a
relational table and the result of a graph query.  The relational
planner considers how to scan the table, the graph planner considers
how to search the graph, and the relational planner considers how to
join the two fragments.

Here, what you need is a way to define a set of virtual relations on a per-query basis, where you can get stats estimates for the relations during planning.

I guess what you're imagining is something more sophisticated where you're generating some kind of sub-plan candidates, like the path model. With some kind of interaction so the sub-planner for the other model could know to generate a different sub-plan based on the context of the outer plan. I have no idea how that could work. But I think you have about zero chance of achieving what you want by going straight there. Focus on small incremental steps, preferably ones you can find other uses for too.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


On Sun, Aug 20, 2017 at 4:10 AM, MauMau <maumau307@gmail.com> wrote:
From: Chris Travers
> Why cannot you do all this in a language handler and treat as a user
defined function?
> ...
> If you have a language handler for cypher, why do you need in_region
or cast_region?  Why not just have a graph_search() function which
takes in a cypher query and returns a set of records?

The language handler is for *stored* functions.  The user-defined
function (UDF) doesn't participate in the planning of the outer
(top-level) query.  And they both assume that they are executed in SQL
commands.

Sure but stored functions can take arguments, such as a query string which gets handled by the language handler.  There's absolutely no reason you cannot declare a function in C that takes in a Cypher query and returns a set of tuples.   And you can do a whole lot with preloaded shared libraries if you need to.

The planning bit is more difficult, but see below as to where I see major limits here.

I want the data models to meet these:

1) The query language can be used as a top-level session language.
For example, if an app specifies "region=cypher_graph" at database
connection, it can use the database as a graph database and submit
Cypher queries without embedding them in SQL.

That sounds like a foot gun.  I would probably think of those cases as being ideal for a custom background worker, similar to Mongress.   Expecting to be able to switch query languages on the fly strikes me as adding totally needless complexity everywhere to be honest.  Having different listeners on different ports simplifies this a lot and having, additionally, query languages for ad-hoc mixing via language handlers might be able to get most of what you want already.

2) When a query contains multiple query fragments of different data
models, all those fragments are parsed and planned before execution.
The planner comes up with the best plan, crossing the data model
boundary.  To take the query example in my first mail, which joins a
relational table and the result of a graph query.  The relational
planner considers how to scan the table, the graph planner considers
how to search the graph, and the relational planner considers how to
join the two fragments.

It seems like all you really need is a planner hook for user defined languages (I.e. "how many rows does this function return with these parameters" right?).  Right now we allow hints but they are static.  I wonder how hard this would be using preloaded, shared libraries.
 

So in_region() and cast_region() are not functions to be executed
during execution phase, but are syntax constructs that are converted,
during analysis phase, into calls to another region's parser/analyzer
and an inter-model cast routine.

So basically they work like immutable functions except that you cannot index the output?

1. The relational parser finds in_region('cypher_graph', 'graph
query') and produces a parse node InRegion(region_name, query) in the
parse tree.

2. The relational analyzer looks up the system catalog to checks if
the specified region exists, then calls its parser/analyzer to produce
the query tree for the graph query fragment.  The relational analyser
attaches the graph query tree to the InRegion node.

3. When the relational planner finds the graph query tree, it passes
the graph query tree to the graph planner to produce the graph
execution plan.

4. The relational planner produces a join plan node, based on the
costs/statistics of the relational table scan and graph query.  The
graph execution plan is attached to the join plan node.

The parse/query/plan nodes have a label to denote a region, so that
appropriate region's routines can be called.

It would be interesting to see how much of what you want you can get with what we currently have and what pieces are really missing. 

Am I right that if you wrote a function in C to take a Cypher query plan, and analyse it, and execute it, the only thing really missing would be feedback to the PostgreSQL planner regarding number of rows expected?

Regards
MauMau




--
Best Regards,
Chris Travers
Database Administrator

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin

This may be interesting... they implement cypher (unfortunately they had to fork in order to have cypher be a first class query language with SQL).

https://github.com/bitnine-oss/agensgraph



On Mon, Aug 21, 2017 at 12:44 AM Chris Travers <chris.travers@adjust.com> wrote:
On Sun, Aug 20, 2017 at 4:10 AM, MauMau <maumau307@gmail.com> wrote:
From: Chris Travers
> Why cannot you do all this in a language handler and treat as a user
defined function?
> ...
> If you have a language handler for cypher, why do you need in_region
or cast_region?  Why not just have a graph_search() function which
takes in a cypher query and returns a set of records?

The language handler is for *stored* functions.  The user-defined
function (UDF) doesn't participate in the planning of the outer
(top-level) query.  And they both assume that they are executed in SQL
commands.

Sure but stored functions can take arguments, such as a query string which gets handled by the language handler.  There's absolutely no reason you cannot declare a function in C that takes in a Cypher query and returns a set of tuples.   And you can do a whole lot with preloaded shared libraries if you need to.

The planning bit is more difficult, but see below as to where I see major limits here.

I want the data models to meet these:

1) The query language can be used as a top-level session language.
For example, if an app specifies "region=cypher_graph" at database
connection, it can use the database as a graph database and submit
Cypher queries without embedding them in SQL.

That sounds like a foot gun.  I would probably think of those cases as being ideal for a custom background worker, similar to Mongress.   Expecting to be able to switch query languages on the fly strikes me as adding totally needless complexity everywhere to be honest.  Having different listeners on different ports simplifies this a lot and having, additionally, query languages for ad-hoc mixing via language handlers might be able to get most of what you want already.

2) When a query contains multiple query fragments of different data
models, all those fragments are parsed and planned before execution.
The planner comes up with the best plan, crossing the data model
boundary.  To take the query example in my first mail, which joins a
relational table and the result of a graph query.  The relational
planner considers how to scan the table, the graph planner considers
how to search the graph, and the relational planner considers how to
join the two fragments.

It seems like all you really need is a planner hook for user defined languages (I.e. "how many rows does this function return with these parameters" right?).  Right now we allow hints but they are static.  I wonder how hard this would be using preloaded, shared libraries.
 

So in_region() and cast_region() are not functions to be executed
during execution phase, but are syntax constructs that are converted,
during analysis phase, into calls to another region's parser/analyzer
and an inter-model cast routine.

So basically they work like immutable functions except that you cannot index the output?

1. The relational parser finds in_region('cypher_graph', 'graph
query') and produces a parse node InRegion(region_name, query) in the
parse tree.

2. The relational analyzer looks up the system catalog to checks if
the specified region exists, then calls its parser/analyzer to produce
the query tree for the graph query fragment.  The relational analyser

attaches the graph query tree to the InRegion node.

3. When the relational planner finds the graph query tree, it passes
the graph query tree to the graph planner to produce the graph
execution plan.

4. The relational planner produces a join plan node, based on the
costs/statistics of the relational table scan and graph query.  The
graph execution plan is attached to the join plan node.

The parse/query/plan nodes have a label to denote a region, so that
appropriate region's routines can be called.

It would be interesting to see how much of what you want you can get with what we currently have and what pieces are really missing. 

Am I right that if you wrote a function in C to take a Cypher query plan, and analyse it, and execute it, the only thing really missing would be feedback to the PostgreSQL planner regarding number of rows expected?

Regards
MauMau




--
Best Regards,
Chris Travers
Database Administrator

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin

From: Henry M
> This may be interesting... they implement cypher (unfortunately they
had to fork in order to have cypher be a first class query language
with SQL).
>
> https://github.com/bitnine-oss/agensgraph

I'm sorry for my very late reply.

Thanks for the information.  AgensGraph is certainly interesting, but
the problem is that it's a fork of PostgreSQL as you mentioned.  I
wish the data models, including query languages, to be pluggable
extensions, so that various people (especially database researchers?)
can develop them flexibly.  Of course, I want various data models to
be incorporated in the core as early as possible, but I'm afraid it's
not easy.  If new data models can be added as extensions, they can be
developed outside the PostgreSQL community process, get popular and
mature, and then be embraced in core like GiST/SP-Gist indexes and
full text search did.


Regards
MauMau




I was just reading the Postgresql 11 roadmap and it mentions native graph support. I would be interested in following the design work for this.

Would this require a the new pluggable storage which is currently in development or would the existing storage engine be sufficient? I am just wondering if there are any rough design/plans for this...

https://wiki.postgresql.org/wiki/Fujitsu_roadmap#Multi-model_database
  • graph: Natively support graph data model. Implement Cypher and/or Gremlin as the query language through UDFs.

Thank you,
Henry


On Sun, Sep 3, 2017 at 1:14 PM MauMau <maumau307@gmail.com> wrote:
From: Henry M
> This may be interesting... they implement cypher (unfortunately they
had to fork in order to have cypher be a first class query language
with SQL).
>
> https://github.com/bitnine-oss/agensgraph

I'm sorry for my very late reply.

Thanks for the information.  AgensGraph is certainly interesting, but
the problem is that it's a fork of PostgreSQL as you mentioned.  I
wish the data models, including query languages, to be pluggable
extensions, so that various people (especially database researchers?)
can develop them flexibly.  Of course, I want various data models to
be incorporated in the core as early as possible, but I'm afraid it's
not easy.  If new data models can be added as extensions, they can be
developed outside the PostgreSQL community process, get popular and
mature, and then be embraced in core like GiST/SP-Gist indexes and
full text search did.


Regards
MauMau

From: Henry
Would this require a the new pluggable storage which is currently in
development or would the existing storage engine be sufficient? I am
just wondering if there are any rough design/plans for this...

I'm sorry for the long interval.  The graph model can be implemented
on top of the relational storage engine, like Oracle and SQL Server
stores graph data in relational tables.  But the optimal storage
engine for the graph model is required to maximize performance for
graph traversal, which introduces direct pointers between graph nodes.
So I think new level of pluggable storage is necessary.

Regards
MauMau




From: Tom Lane
It sounds like what you want is to replace all of Postgres except
the name.  I'm not clear on the point.


The point is to make PostgreSQL a versatile database suitable even for
niche use cases.  I want more people to love and rely on PostgreSQL.
Ideally, I want to see various data models incorporated in core from
the beginning, but it would be difficult.  So, this pluggable data
model is necessary to foster data model development both inside and
outside the PostgreSQL community.  With that said, I hope PostgreSQL
will someday absorb those data models, just like PL/pgSQL is now in
core.

But for now, I think the relational data model will continue to play a
central role.  I don't think it's worth making the current relational
data model implementation a plugged module.  It will depend on the
software design and source code cleanness whether to do that.

I don't understand yet how painful it would be to support other data
models as first-class citizens, and I may be a reckless man who
doesn't know fear.  So I wish you could help and pave this difficult
way together.

Regards
MauMau




I just came across the following paper and project and thought I would share:

It seems some incremental graph query extensions to SQL could add some powerful capabilities to PostgreSQL (without having to think about a complete query language replacement). The incremental change could include: paths as first class citizens and a new MATCH keyword.

This is what I was reading:

https://arxiv.org/abs/1712.01550  G-CORE A Core for Future Graph Query Languages

"Path as first-class citizens. The notion of Path is fundamental for graph databases, because it introduces an intermediate abstraction level that allows to represents how elements in a graph are related. The facilities provided by a graph query language to manipulate paths (i.e. describe, search, filter, count, annotate, return, etc.) increase the expressivity of the language. Particularly, the ability to return paths enables the user to post-process paths within the query language rather that in an ad-hoc manner [15]."

They have an open source parser for G-Core here: https://github.com/ldbc/ldbc_gcore_parser

"This is a G-Core query example which matches persons with the name “John Doe” together with indirect friends that live in the same city and returns a table with the names of these friends."

SELECT m.lastName + ', ' + m.firstName AS friendName MATCH (n:Person)-/<:knows*>/->(m:Person)

WHERE n.firstName = 'John' AND n.lastName = 'Doe' AND (n)-[:isLocatedIn]->()<-[:isLocatedIn]-(m)


Oracle has a similar graph query language as well: http://pgql-lang.org
SELECT p2.name AS friend_of_friend FROM facebook_graph                             /* In the Facebook graph..   */MATCH (p1:Person) -/:friend_of{2}/-> (p2:Person) /* ..match two-hop friends.. */WHERE p1.name = 'Mark'                           /* ..of Mark.                */

And Microsoft SQL server added limited MATCH capability:

https://docs.microsoft.com/en-us/sql/t-sql/queries/match-sql-graph



On Sun, Dec 3, 2017 at 2:37 PM Deep-Impact <sakasaumau2009@clear.ocn.ne.jp> wrote:
From: Tom Lane
It sounds like what you want is to replace all of Postgres except
the name.  I'm not clear on the point.


The point is to make PostgreSQL a versatile database suitable even for
niche use cases.  I want more people to love and rely on PostgreSQL.
Ideally, I want to see various data models incorporated in core from
the beginning, but it would be difficult.  So, this pluggable data
model is necessary to foster data model development both inside and
outside the PostgreSQL community.  With that said, I hope PostgreSQL
will someday absorb those data models, just like PL/pgSQL is now in
core.

But for now, I think the relational data model will continue to play a
central role.  I don't think it's worth making the current relational
data model implementation a plugged module.  It will depend on the
software design and source code cleanness whether to do that.

I don't understand yet how painful it would be to support other data
models as first-class citizens, and I may be a reckless man who
doesn't know fear.  So I wish you could help and pave this difficult
way together.

Regards
MauMau