Thread: How to share the result data of separated plan
While designing writeable CTEs, I found it needs a new infrastructure that helps to share the result data among separated PlannedStmt. Up to now, in a portal we only have multiple plans that share nothing but relations, then executor starts, runs and ends separately. Writeable CTEs need the result tuplestore of separated plans that were executed before main query . In a prototype Marko and I made adds a List of DerivedTable to Portal, which holds each tuplestore and tupledesc set and the previously-executed result of the child plans are stored in these structure. Here we have some problems: - the process to make tuplestore receiver, to set it to executor of child plans, to run it and to update snapshot is very common but very local in pquery.c. We need similar copy in explain.c and functions.c - we are not sure where to save this List of DerivedTable. In pquery.c it is Portal, and we can see it in ExplainState in explan.c. But very messy... - Wait, isn't there any problem to share data without relations among separated plans? The most conservative way to solve it is to make it by rewriter not by planner to separated executions with temporary tables instead of tuplestore. In this way we don't do so much in planner and writeable CTEs is only a rewrite process. The first reason why we are hacking the planner not rewriter is that the main purpose of writeable CTEs is the performance gained by using tuplestore instead of temporary tables. In the second place, we will be able to extend it to combine RECURSIVE queries (which may be discarded in the first version for simplicity). Please find attached patch (pquery.c only) as a prototype we are working to see what is done about the problem. PlannedStmt now has childStmt member to store separated child plans(PlannedStmts) and the portal calls FillDerivedTables() to execute them and fill tuplestores. Tuplestores are saved in the portal member ddts. They will be referred by main query so they are passed to CreateQueryDesc() on the main query execution. I'm very sorry the explanation is poor, but if you have some thoughts, please let us know. Regards, -- Hitoshi Harada
Attachment
Hitoshi Harada <umi.tanuki@gmail.com> writes: > Up to now, in a portal we only have multiple plans that share nothing > but relations, then executor starts, runs and ends separately. > Writeable CTEs need the result tuplestore of separated plans that were > executed before main query . Um ... why? I thought the whole point of breaking out ModifyTable as a separate node type was so that a query involving writeable CTEs would still be just one plan tree. regards, tom lane
On 2010-11-07 6:23 PM +0200, Tom Lane wrote: > Hitoshi Harada<umi.tanuki@gmail.com> writes: >> Up to now, in a portal we only have multiple plans that share nothing >> but relations, then executor starts, runs and ends separately. >> Writeable CTEs need the result tuplestore of separated plans that were >> executed before main query . > > Um ... why? I thought the whole point of breaking out ModifyTable > as a separate node type was so that a query involving writeable CTEs > would still be just one plan tree. We tried that for 9.0 and it didn't work. Almost all work for 9.1 has been spent on creating an infrastructure for running the executor separately for every WITH list element when wCTEs are present. Regards, Marko Tiikkaja
Marko Tiikkaja <marko.tiikkaja@cs.helsinki.fi> writes: > On 2010-11-07 6:23 PM +0200, Tom Lane wrote: >> Um ... why? I thought the whole point of breaking out ModifyTable >> as a separate node type was so that a query involving writeable CTEs >> would still be just one plan tree. > We tried that for 9.0 and it didn't work. Almost all work for 9.1 has > been spent on creating an infrastructure for running the executor > separately for every WITH list element when wCTEs are present. I guess I shoulda been paying closer attention :-(. That really, really seems like fundamentally the wrong direction. What was it that was unfixable about the other way? If it is unfixable, should we revert ModifyTable? regards, tom lane
On 2010-11-07 8:08 PM +0200, Tom Lane wrote: > Marko Tiikkaja<marko.tiikkaja@cs.helsinki.fi> writes: >> On 2010-11-07 6:23 PM +0200, Tom Lane wrote: >>> Um ... why? I thought the whole point of breaking out ModifyTable >>> as a separate node type was so that a query involving writeable CTEs >>> would still be just one plan tree. > >> We tried that for 9.0 and it didn't work. Almost all work for 9.1 has >> been spent on creating an infrastructure for running the executor >> separately for every WITH list element when wCTEs are present. > > I guess I shoulda been paying closer attention :-(. That really, really > seems like fundamentally the wrong direction. What was it that was > unfixable about the other way? If it is unfixable, should we revert > ModifyTable? The immediate problem at hand was that scan nodes cached the number of blocks a relation has and didn't see new tuples added by wCTEs. That's not unfixable, but the consensus seemed to be that running the executor separately is better. ModifyTable nodes still save us from duplicating a lot of code; you can just run an executor on a plan and expect it to work, even it's a DML statement. Reverting it would be a huge mistake in my opinion. Regards, Marko Tiikkaja
On Sun, Nov 7, 2010 at 1:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Marko Tiikkaja <marko.tiikkaja@cs.helsinki.fi> writes: >> On 2010-11-07 6:23 PM +0200, Tom Lane wrote: >>> Um ... why? I thought the whole point of breaking out ModifyTable >>> as a separate node type was so that a query involving writeable CTEs >>> would still be just one plan tree. > >> We tried that for 9.0 and it didn't work. Almost all work for 9.1 has >> been spent on creating an infrastructure for running the executor >> separately for every WITH list element when wCTEs are present. > > I guess I shoulda been paying closer attention :-(. That really, really > seems like fundamentally the wrong direction. What was it that was > unfixable about the other way? If it is unfixable, should we revert > ModifyTable? The relevant thread is here: http://archives.postgresql.org/pgsql-hackers/2010-02/msg00783.php -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
2010/11/8 Tom Lane <tgl@sss.pgh.pa.us>: > Marko Tiikkaja <marko.tiikkaja@cs.helsinki.fi> writes: >> On 2010-11-07 6:23 PM +0200, Tom Lane wrote: >>> Um ... why? I thought the whole point of breaking out ModifyTable >>> as a separate node type was so that a query involving writeable CTEs >>> would still be just one plan tree. > >> We tried that for 9.0 and it didn't work. Almost all work for 9.1 has >> been spent on creating an infrastructure for running the executor >> separately for every WITH list element when wCTEs are present. > > I guess I shoulda been paying closer attention :-(. That really, really > seems like fundamentally the wrong direction. What was it that was > unfixable about the other way? If it is unfixable, should we revert > ModifyTable? Especially snapshot should not be updated during execution so we decided to separate plans and restart executor. But reading closer your response, it occurred to me that ModifyTable can be the storage to be shared among separated plans instead of newly invented DtScan. Are you suggesting embed ModifyTable into the main query, that is also on the top of child plans? In this way we can still separate plans and scan the result of child plans without much modification of portal/explain? Regards, -- Hitoshi Harada
Robert Haas <robertmhaas@gmail.com> writes: > On Sun, Nov 7, 2010 at 1:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I guess I shoulda been paying closer attention :-(. �That really, really >> seems like fundamentally the wrong direction. �What was it that was >> unfixable about the other way? �If it is unfixable, should we revert >> ModifyTable? > The relevant thread is here: > http://archives.postgresql.org/pgsql-hackers/2010-02/msg00783.php My opinion is still the same as here: http://archives.postgresql.org/pgsql-hackers/2010-02/msg00688.php namely, that all we should be worrying about is a tuplestore full of RETURNING tuples. Any other side-effects of a DML subquery should *not* be visible to the calling query, and therefore all this argument about snapshots and seqscan limits is beside the point. regards, tom lane
Hitoshi Harada <umi.tanuki@gmail.com> writes: > But reading closer your response, it occurred to me that ModifyTable > can be the storage to be shared among separated plans instead of newly > invented DtScan. Are you suggesting embed ModifyTable into the main > query, that is also on the top of child plans? In this way we can > still separate plans and scan the result of child plans without much > modification of portal/explain? There is noplace where we need to "scan the result of child plans". We need to scan the set of RETURNING tuples emitted by a child ModifyTable node. That has nothing whatever to do with what it wrote to disk. We would hardly wish to try to reconstruct those tuples from re-reading what it wrote to disk anyway --- that idea fails completely for a DELETE RETURNING subquery, for example. regards, tom lane
On 2010-11-08 6:38 PM +0200, Tom Lane wrote: > Robert Haas<robertmhaas@gmail.com> writes: >> On Sun, Nov 7, 2010 at 1:08 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: >>> I guess I shoulda been paying closer attention :-(. That really, really >>> seems like fundamentally the wrong direction. What was it that was >>> unfixable about the other way? If it is unfixable, should we revert >>> ModifyTable? > >> The relevant thread is here: >> http://archives.postgresql.org/pgsql-hackers/2010-02/msg00783.php > > My opinion is still the same as here: > http://archives.postgresql.org/pgsql-hackers/2010-02/msg00688.php > > namely, that all we should be worrying about is a tuplestore full of > RETURNING tuples. Any other side-effects of a DML subquery should > *not* be visible to the calling query, and therefore all this argument > about snapshots and seqscan limits is beside the point. What happened to: http://archives.postgresql.org/pgsql-hackers/2009-10/msg00566.php ? Regards, Marko Tiikkaja
2010/11/9 Tom Lane <tgl@sss.pgh.pa.us>: > Robert Haas <robertmhaas@gmail.com> writes: >> On Sun, Nov 7, 2010 at 1:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I guess I shoulda been paying closer attention :-(. That really, really >>> seems like fundamentally the wrong direction. What was it that was >>> unfixable about the other way? If it is unfixable, should we revert >>> ModifyTable? > >> The relevant thread is here: >> http://archives.postgresql.org/pgsql-hackers/2010-02/msg00783.php > > My opinion is still the same as here: > http://archives.postgresql.org/pgsql-hackers/2010-02/msg00688.php > > namely, that all we should be worrying about is a tuplestore full of > RETURNING tuples. Any other side-effects of a DML subquery should > *not* be visible to the calling query, and therefore all this argument > about snapshots and seqscan limits is beside the point. Current consensus says: WITH x AS (SELECT count(*) FROM t), y AS (DELETE FROM t), z AS (SELECT count(*) FROM t) SELECT x.count, z.count FROM x, z; should return 0 for z.count but some number of original rows for x.count. So I think you need to read the underlying table itself as well as the emitted data of ModfyTable (or the result of any writeable CTE queries) in *predictable order*. To make it happen, we need CCI in each execution of child plans. Regards, -- Hitoshi Harada
Marko Tiikkaja <marko.tiikkaja@cs.helsinki.fi> writes: > On 2010-11-08 6:38 PM +0200, Tom Lane wrote: >> My opinion is still the same as here: >> http://archives.postgresql.org/pgsql-hackers/2010-02/msg00688.php >> >> namely, that all we should be worrying about is a tuplestore full of >> RETURNING tuples. Any other side-effects of a DML subquery should >> *not* be visible to the calling query, and therefore all this argument >> about snapshots and seqscan limits is beside the point. > What happened to: > http://archives.postgresql.org/pgsql-hackers/2009-10/msg00566.php ? On the whole, I think that's overspecifying the behavior. Compare what happens if you do the same thing in one query now; that is, you have say an UPDATE query and within that you invoke some function that looks directly at the target table. Is it going to see a consistent view of the data? No, it's going to see a partially updated table. The argument I'm making at the moment is that a query containing DML CTEs should all execute with the same snapshot, meaning that no part of the query should see the effects of any other part. If, within the query, you execute volatile functions that look at the target tables, they will see an unspecified subset of the updates as having been applied already. This is the same as what happens with a volatile function in an updating query now, and we've not heard many complaints about it. The alternative is to artificially serialize the DML CTEs, which while it does have some advantages doesn't seem like a win overall. In particular, defining that as the behavior will greatly constrain our abilities to optimize anything in future. I think that a typical use-case will be something like WITH u AS (UPDATE data_table SET ... RETURNING ...)INSERT INTO log_table SELECT * FROM u; If that UPDATE touches a lot of rows, users will be unhappy if the RETURNING data builds up in memory rather than getting streamed directly to log_table. (I'm not saying this has to work that way on day zero, but I am thinking we'd better be able to do it eventually.) regards, tom lane
Hitoshi Harada <umi.tanuki@gmail.com> writes: > 2010/11/9 Tom Lane <tgl@sss.pgh.pa.us>: >> My opinion is still the same as here: >> http://archives.postgresql.org/pgsql-hackers/2010-02/msg00688.php > Current consensus says: > WITH x AS (SELECT count(*) FROM t), y AS (DELETE FROM t), z AS (SELECT > count(*) FROM t) SELECT x.count, z.count FROM x, z; > should return 0 for z.count but some number of original rows for > x.count. Consensus according to who? It's at least as consistent for all the queries to start from the same snapshot, meaning that x and z would produce the same results (independent of what y does). It might be worth inspecting the SQL2011 draft to see if they provide any guidance on what ought to happen here. regards, tom lane
2010/11/9 Tom Lane <tgl@sss.pgh.pa.us>: > Hitoshi Harada <umi.tanuki@gmail.com> writes: >> 2010/11/9 Tom Lane <tgl@sss.pgh.pa.us>: >>> My opinion is still the same as here: >>> http://archives.postgresql.org/pgsql-hackers/2010-02/msg00688.php > >> Current consensus says: > >> WITH x AS (SELECT count(*) FROM t), y AS (DELETE FROM t), z AS (SELECT >> count(*) FROM t) SELECT x.count, z.count FROM x, z; > >> should return 0 for z.count but some number of original rows for >> x.count. > > Consensus according to who? It's at least as consistent for all the > queries to start from the same snapshot, meaning that x and z would > produce the same results (independent of what y does). Consensus according to the pgsql-hackers archive (I'll collect some links if you want). But to tell the truth, while trying to implement it I've been feeling it is really bad design to separate plans not queries (as rewriter). Through overall the PostgreSQL as an SQL engine, it is not designed to do so. Especially if you imagine letting the CCI-style writeable CTEs in SQL functions makes disaster; SQL functions prepare the plan on the function init and wCTEs need to execute the child plan since it is necessary before the CreateQueryDesc, then it results in that only child plans are executed but main query are not in case of the function wasn't called. Very inconsistent. > It might be worth inspecting the SQL2011 draft to see if they provide > any guidance on what ought to happen here. I agree we need decision of what to do (or what we want) with this feature. Regards, -- Hitoshi Harada
On 2010-11-08 7:26 PM +0200, Tom Lane wrote: > Marko Tiikkaja<marko.tiikkaja@cs.helsinki.fi> writes: >> On 2010-11-08 6:38 PM +0200, Tom Lane wrote: >>> My opinion is still the same as here: >>> http://archives.postgresql.org/pgsql-hackers/2010-02/msg00688.php >>> >>> namely, that all we should be worrying about is a tuplestore full of >>> RETURNING tuples. Any other side-effects of a DML subquery should >>> *not* be visible to the calling query, and therefore all this argument >>> about snapshots and seqscan limits is beside the point. > >> What happened to: >> http://archives.postgresql.org/pgsql-hackers/2009-10/msg00566.php ? > > The alternative is to artificially serialize the DML CTEs, which > while it does have some advantages doesn't seem like a win overall. > In particular, defining that as the behavior will greatly constrain > our abilities to optimize anything in future. I think that a typical > use-case will be something like > > WITH u AS (UPDATE data_table SET ... RETURNING ...) > INSERT INTO log_table SELECT * FROM u; > > If that UPDATE touches a lot of rows, users will be unhappy if the > RETURNING data builds up in memory rather than getting streamed directly > to log_table. (I'm not saying this has to work that way on day zero, > but I am thinking we'd better be able to do it eventually.) We've discussed this before and the consensus was that as long as we don't change the results, we can optimize the materialization away. This seems to be a textbook example of such a case. Regards, Marko Tiikkaja
On Mon, Nov 8, 2010 at 12:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Marko Tiikkaja <marko.tiikkaja@cs.helsinki.fi> writes: >> On 2010-11-08 6:38 PM +0200, Tom Lane wrote: >>> My opinion is still the same as here: >>> http://archives.postgresql.org/pgsql-hackers/2010-02/msg00688.php >>> >>> namely, that all we should be worrying about is a tuplestore full of >>> RETURNING tuples. Any other side-effects of a DML subquery should >>> *not* be visible to the calling query, and therefore all this argument >>> about snapshots and seqscan limits is beside the point. > >> What happened to: >> http://archives.postgresql.org/pgsql-hackers/2009-10/msg00566.php ? > > On the whole, I think that's overspecifying the behavior. Compare what > happens if you do the same thing in one query now; that is, you have > say an UPDATE query and within that you invoke some function that looks > directly at the target table. Is it going to see a consistent view of > the data? No, it's going to see a partially updated table. > > The argument I'm making at the moment is > [exactly the opposite of what I said before] You've now vetoed both "using the same snapshot" and "using different snapshots" at different points in time. Had you taken your current position at this time a year ago, we might well have this feature in 9.0. I guess that's water under the bridge at this point, but I believe that a pretty substantial amount of work has gone into whacking this patch around based on your previously-expressed views as to how it should be implemented, and they seem to have just shifted again. Frankly, I buy both arguments. I think that if we use the same snapshot, we are going to get people complaining about really screwy behavior if the same table is used more than once in the same query (BEFORE triggers already make my head explode). I think if we use different snapshots, optimization is going to be rather difficult, and people are going to complain about the performance. So I don't know what the right thing to do is, but it is not the case that we are designing this tabula rasa. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Marko Tiikkaja <marko.tiikkaja@cs.helsinki.fi> writes: > On 2010-11-08 7:26 PM +0200, Tom Lane wrote: >> The alternative is to artificially serialize the DML CTEs, which >> while it does have some advantages doesn't seem like a win overall. > We've discussed this before and the consensus was that as long as we > don't change the results, we can optimize the materialization away. No, because the problem is mainly about what might happen if user-defined functions choose to look at the target tables. We can't really tell what triggers are going to do, to take one item that the planner has no access to. I think we have to decide up front which implementation behavior it's going to be, and if we go with serialized queries, we'll be stuck with that. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > You've now vetoed both "using the same snapshot" and "using different > snapshots" at different points in time. Yeah, I did change my mind between 2009-10 and 2010-02. I was not aware that I'm not allowed to change my mind ... regards, tom lane
On Mon, Nov 8, 2010 at 1:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> You've now vetoed both "using the same snapshot" and "using different >> snapshots" at different points in time. > > Yeah, I did change my mind between 2009-10 and 2010-02. I was not aware > that I'm not allowed to change my mind ... I didn't mean to say that you aren't, just that it's hard to hit a moving target. As far as I was aware until yesterday, your 2009-10 opinion was still the most current one. Your email also gave the (perhaps incorrect) impression that you didn't remember the previous conversations on this topic. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2010-11-08 8:30 PM +0200, Tom Lane wrote: > Marko Tiikkaja<marko.tiikkaja@cs.helsinki.fi> writes: >> On 2010-11-08 7:26 PM +0200, Tom Lane wrote: >>> The alternative is to artificially serialize the DML CTEs, which >>> while it does have some advantages doesn't seem like a win overall. > >> We've discussed this before and the consensus was that as long as we >> don't change the results, we can optimize the materialization away. > > No, because the problem is mainly about what might happen if > user-defined functions choose to look at the target tables. We can't > really tell what triggers are going to do, to take one item that the > planner has no access to. The relevant thread seems to be this one: http://archives.postgresql.org/pgsql-hackers/2009-11/msg00003.php and I do agree with what you said there. Regards, Marko Tiikkaja