Thread: Creating a read/write virtual table?

Creating a read/write virtual table?

From
Ben Martin
Date:
Hi,
  I'm the author of the libferris virtual filesystem. Last year I
created the ability to expose a ferris filesystem as a virtual table in
SQLite [1] and would like to do much the same for PostgreSQL if
possible.

  I was digging into where to get the whole libferris-as-vtable thing to
happen with pg:
http://www.postgresql.org/docs/8.4/static/xfunc-c.html
it strikes me that in the returning sets part:
http://www.postgresql.org/docs/8.4/static/xfunc-c.html#XFUNC-C-RETURN-SET
I might be able to prod the tupledesc that is generated by
get_call_result_type() to work out what should be returned.

There are two major drawbacks to this approach:
1) This is run as the postgres user on the server
2) This is read only.

  In my SQLite vtable both of the above restrictions don't apply. To
mitigate (1) I was hopeful that there was a way to get the connection to
PG on the server to fork/setuid ala how one can setup apache to run web
responses as linux uids. Given the difference between \copy and COPY
(client/server fs access) I don't know if such a restriction is easy to
lift.

  I think using a function restricts the results to being read only.
Perhaps having a second function that allows updates to be processed
separately.

  Any pointers / suggestions on how to do this sort of thing with
postgresql would be wonderful. Please CC any such replies as I'm not
currently on list.

[1]
http://ldn.linuxfoundation.org/article/libferris-and-sqlite-a-powerful-combination-part-1
https://ldn.linuxfoundation.org/node/26821


Attachment

Re: Creating a read/write virtual table?

From
Craig Ringer
Date:
On 12/04/10 11:34, Ben Martin wrote:

>    In my SQLite vtable both of the above restrictions don't apply. To
> mitigate (1) I was hopeful that there was a way to get the connection to
> PG on the server to fork/setuid

The pg backends are designed to run as one user. You might have issues
setting it up to do otherwise.

Consider talking to your storage backend over a unix socket, pipe,
tcp/ip socket, shared memory, or other suitable IPC mechanism from Pg.
This also protects Pg from any issues caused by the storage backend
doing things like spawning unexpected threads, corrupting the
stack/heap, outright crashing, etc.

>    I think using a function restricts the results to being read only.

If you're planning on writing a C-level UDF loaded via 'CREATE
FUNCTION', then yes it'll have to return a materialized data set and
there's no way to UPDATE/INSERT/DELETE against the function's results.

> Perhaps having a second function that allows updates to be processed
> separately.

Yep, though that's clumsy I expect it'd work, and honestly it's probably
your best bet.

>    Any pointers / suggestions on how to do this sort of thing with
> postgresql would be wonderful. Please CC any such replies as I'm not
> currently on list.

AFAIK Pg's innards aren't currently very friendly toward extension with
random user-provided table/database storage backends. Unlike
(say) MySQL, there's no pluggable storage engine system.

That means Pg's planner/executor can make a lot of decisions based on
things it "knows" about how Pg's data storage works, speeding and
simplifying it and letting it be much smarter. Pg can enforce strict
rules about transactional scope, ACID, error handling, etc based on
knowledge of the data storage's capabilities and expectations. On the
other hand, it means it's hard to plug in your own storage system.

After all, how would your external system handle something like this:

BEGIN;
INSERT INTO my_external_storage(a, b) VALUES (1,2);
SAVEPOINT sp_a;
INSERT INTO my_external_storage(a, b) VALUES (3,4);
ROLLBACK TO SAVEPOINT sp_a;
INSERT INTO my_external_storage(a, b) VALUES (5,6);
COMMIT;

How would it honour SERIALIZABLE isolation if it doesn't its self
maintain MVCC snapshots and have some way of mapping pg's xids to its
internal versions? Etc.


Pg would have to be able to error out on many commands with "not
supported by underlying storage system" or the like. Yet sometimes it
won't even know that until it's well into executing something.

SQLite has the advantage of simplicity. It doesn't have to worry about
complicated concurrency rules, versioning, user-defined data type
storage, etc etc etc. And MySQL was designed to handle pluggable
backends, so it's prepared for storage backends not to be able to
perform certain operations, and it makes some big sacrifices in terms of
its broader capabilities to permit pluggable backends.

I guess at this point I have to ask "why do you want to do this? What
will you gain?"

--
Craig Ringer

Re: Creating a read/write virtual table?

From
Ben Martin
Date:
On Mon, 2010-04-12 at 13:15 +0800, Craig Ringer wrote:
> On 12/04/10 11:34, Ben Martin wrote:
>
> >    In my SQLite vtable both of the above restrictions don't apply. To
> > mitigate (1) I was hopeful that there was a way to get the connection to
> > PG on the server to fork/setuid
>
> The pg backends are designed to run as one user. You might have issues
> setting it up to do otherwise.

I was suspecting this to be the case, with apache it's somewhat
different to be able to set the uid and then access the filesystem, but
trying to have multiple users + processes accessing pg data files would
be a messier thing.

>
> Consider talking to your storage backend over a unix socket, pipe,
> tcp/ip socket, shared memory, or other suitable IPC mechanism from Pg.
> This also protects Pg from any issues caused by the storage backend
> doing things like spawning unexpected threads, corrupting the
> stack/heap, outright crashing, etc.

Hmm, this might be a good way to go anyway just for the isolation. And
actually fork/execing ferrisls --xml and massaging that as the return
value might be a fairly quick and robust initial implementation.

Thinking broader, being able to run "a program" and massage
xml/json/yaml back as the return value of a custom function might be a
handy generic thing to have. I'll have a quick search around to make
sure such a thing doesn't already exist...

>
> >    I think using a function restricts the results to being read only.
>
> If you're planning on writing a C-level UDF loaded via 'CREATE
> FUNCTION', then yes it'll have to return a materialized data set and
> there's no way to UPDATE/INSERT/DELETE against the function's results.
>
> > Perhaps having a second function that allows updates to be processed
> > separately.
>
> Yep, though that's clumsy I expect it'd work, and honestly it's probably
> your best bet.

OK

>
> >    Any pointers / suggestions on how to do this sort of thing with
> > postgresql would be wonderful. Please CC any such replies as I'm not
> > currently on list.
>
> AFAIK Pg's innards aren't currently very friendly toward extension with
> random user-provided table/database storage backends. Unlike
> (say) MySQL, there's no pluggable storage engine system.
>
> That means Pg's planner/executor can make a lot of decisions based on
> things it "knows" about how Pg's data storage works, speeding and
> simplifying it and letting it be much smarter. Pg can enforce strict
> rules about transactional scope, ACID, error handling, etc based on
> knowledge of the data storage's capabilities and expectations. On the
> other hand, it means it's hard to plug in your own storage system.

This all sounds very reasonable. Things also become a quagmire when the
database would need to ask the table implementation about virtual
indexes and their selectivity etc, etc. Once again, not too much of a
problem for smallish data sets and SQLite when a full virtual table scan
is acceptable for performance.

>
> After all, how would your external system handle something like this:
>
> BEGIN;
> INSERT INTO my_external_storage(a, b) VALUES (1,2);
> SAVEPOINT sp_a;
> INSERT INTO my_external_storage(a, b) VALUES (3,4);
> ROLLBACK TO SAVEPOINT sp_a;
> INSERT INTO my_external_storage(a, b) VALUES (5,6);
> COMMIT;
>
> How would it honour SERIALIZABLE isolation if it doesn't its self
> maintain MVCC snapshots and have some way of mapping pg's xids to its
> internal versions? Etc.

This indeed gets very hairy. I have thought about trying to implement
transactions in the filesystem, but for some virtual filesystems it is
not really possible, especially when rollback itself could fail.

>
>
> Pg would have to be able to error out on many commands with "not
> supported by underlying storage system" or the like. Yet sometimes it
> won't even know that until it's well into executing something.
>
> SQLite has the advantage of simplicity. It doesn't have to worry about
> complicated concurrency rules, versioning, user-defined data type
> storage, etc etc etc. And MySQL was designed to handle pluggable
> backends, so it's prepared for storage backends not to be able to
> perform certain operations, and it makes some big sacrifices in terms of
> its broader capabilities to permit pluggable backends.
>
> I guess at this point I have to ask "why do you want to do this? What
> will you gain?"
>

A few things I do with the SQLite engine are analysing apache log files,
or other separated files (like running awk with SQL as the query
language), and interacting with web services as a table, eg, selecting
data from a google spreadsheet and a local sqlite database in a single
query. As a fair amount of the use cases I have are simply introducing
and joining data from other sources into queries I think perhaps
starting out with just a record set returning function would be a good
thing.

This nicely avoids the issues that the filesystem can't really do MVCC
or transactions properly, but gives access to some more diverse data in
a query for folks who want to perform such queries.


Attachment

Re: Creating a read/write virtual table?

From
Craig Ringer
Date:
On 12/04/10 17:52, Ben Martin wrote:

> A few things I do with the SQLite engine are analysing apache log files,
> or other separated files (like running awk with SQL as the query
> language), and interacting with web services as a table, eg, selecting
> data from a google spreadsheet and a local sqlite database in a single
> query. As a fair amount of the use cases I have are simply introducing
> and joining data from other sources into queries I think perhaps
> starting out with just a record set returning function would be a good
> thing.
>
> This nicely avoids the issues that the filesystem can't really do MVCC
> or transactions properly, but gives access to some more diverse data in
> a query for folks who want to perform such queries.
>

If that's all you're after, you may well be able to use the existing
COPY protocol to populate a temp table by outputting suitable csv from
your data source.

--
Craig Ringer