Thread: I'd like to discuss scaleout at PGCon
Hello, I'm going to attend PGCon in Ottawa for the first time. I am happy if I can meet you. Because I'm visually impaired, I only have vision to sense light. If you see a Japanese man with a height of 171 cm with a white cane, it's probably me. I'd be happy if you talk to me. But as I'm still far from good at listening and speaking English, I'm sorry if I take an unfriendly attitude or if I can not keep on talking for a long time. I'd like to have a session on scaleout design at the unconference. I've created a wiki page for that (this is still just a memo; I'd like to populate this page with you as the discussion in the community progresses). I'd appreciate it if someone could stand with me and facilitate the discussion at the unconference. https://wiki.postgresql.org/wiki/Scaleout_Design The background is ... our company is faced with an immediate need to develop the read-write scaleout feature on PostgreSQL. We tried Postgres-XL with much hope, but we found it difficult to achieve our performance goal. I will tell you the details at the conference. But personally, Postgres-XL seems to be very nice software, and I feel that good parts of it should be integrated into core. I know that many great hackers from 2ndQuadrant, EnterpriseDB, NTT, Postgres Professional, CitusData, and so on are addressing this difficult scaleout feature. I don't think yet we are competent to lead this development. On the other hand, we have a proprietary RDBMS called Symfoware (I'm sure you don't know it), which is not based on PostgreSQL, that provides the scaleout feature. Its architecture is a mix of shared nothing and shared everything. It implements deadlock detection and resolution without a central node or periodic monitoring, parallel 2PC across nodes, parallel crash recovery, client connection routing and failover without any overhead of intermediary middleware during SQL execution, etc. So we may be able to help in some way. I'd be happy if we could help the community to proceed with development of scaleout. If you have a session for scaleout outside the unconference, could you call me and let me join it? By the way, the popularity score of PostgreSQL finally exceeded 400 points in the DB-Engines ranking! The popularity difference with the top products has shrunk greatly. Let's make PostgreSQL more popular. https://db-engines.com/en/ranking [as of May 27, 2018] Oracle=1290.42 MySQL=1223.34 SQL Server=1085.84 PostgreSQL=400.90 MongoDB=342.11 (Oracle / PostgreSQL ratio is 3.2) [as of Feb 2016, according to a memo at hand] Oracle=1476.14 MySQL=1321.13 SQL Server=?? MongoDB=?? PostgreSQL=288.66 (Oracle / PostgreSQL ratio is 5.1) Regards MauMau
On Sun, May 27, 2018 at 1:20 AM, MauMau <maumau307@gmail.com> wrote: > I'm going to attend PGCon in Ottawa for the first time. I am happy if > I can meet you. It was nice to meet you in person. > I'd like to have a session on scaleout design at the unconference. > I've created a wiki page for that (this is still just a memo; I'd like > to populate this page with you as the discussion in the community > progresses). I'd appreciate it if someone could stand with me and > facilitate the discussion at the unconference. > > https://wiki.postgresql.org/wiki/Scaleout_Design We didn't have time in the unconference session to discuss these topics in detail, for you have raised many issues here each of which deserves discussion individually and in detail. I wrote a blog post somewhat related to this topic recently which you can find at http://rhaas.blogspot.com/2018/05/built-in-sharding-for-postgresql.html -- it focuses on a somewhat different set of issues than the ones you raise but that's good, because we need to hear all perspectives. In terms of high-level architecture, I think you are right to wonder about the possibility of a cloud-native mode based on separating storage and compute. Amazon Aurora offers that feature, but we might want to have it in PostgreSQL. Another, somewhat different thing that we might want is a fully distributed database, with a distributed buffer cache, distributed lock manager, distributed invalidation queue, etc. That would be like what Oracle RAC does, but it's a tremendous amount of work, and a lot of that work has no value by itself. You don't get the payoff until it's all working. There are a few pieces that might be independently useful, though, like a distributed deadlock detector. The goal of this kind of effort is, I think, to support zillions of connections -- scale further than you can with just one node. This would be a lot better if combined with the cloud-native storage, but of course that's even more work. The FDW approach, of which I have been a supporter for some years now, is really aiming at a different target, which is to allow efficient analytics queries across a multi-node cluster. I think we're getting pretty close to being able to do that -- IMHO, the last fundamental building block that we need is asynchronous execution, which Andres is working on. After that, it's a matter of adding other features that people want (like cross-node MVCC) and improving the plans for queries that still don't perform well (like joins that involve redistributing one of the tables involved). We might not want to confine ourselves strictly to the FDW interface -- for example, I've thought about the idea of building introducing a new relkind for a "distributed table". A distributed table may be present on the local node, in which case it can be scanned like a local table, or it may be not present, in which case it can be scanned like a foreign table by connecting to a node on which it is present. The set of nodes on which a table is present is metadata that is shared throughout the cluster. Multi-master logical replication propagates changes between all nodes on which the table is present. With a concept like this, there is a lot of opportunity to optimize queries by, for example, deciding to perform a join on a node on which both input tables are present, to minimize data movement. But even if we create something like this, I see it as fundamentally an extension of the FDW approach that would hopefully manage to reuse a lot of what's already been built there. I don't think we need to (or should) throw away the work that's been done on FDW pushdown and start over -- we should find a way to build on top of it and add ease-of-use and management features. In fact, even if we said that we want a fully distributed database, we'd probably still need some kind of distributed table concept. Unless every node has a full copy of everything in the database, you still need to reason about which data is present on which nodes and optimize queries accordingly. > By the way, the popularity score of PostgreSQL finally exceeded 400 > points in the DB-Engines ranking! The popularity difference with the > top products has shrunk greatly. Let's make PostgreSQL more popular. > > https://db-engines.com/en/ranking > > [as of May 27, 2018] > Oracle=1290.42 MySQL=1223.34 SQL Server=1085.84 > PostgreSQL=400.90 MongoDB=342.11 > (Oracle / PostgreSQL ratio is 3.2) > > [as of Feb 2016, according to a memo at hand] > Oracle=1476.14 MySQL=1321.13 SQL Server=?? > MongoDB=?? PostgreSQL=288.66 > (Oracle / PostgreSQL ratio is 5.1) From the chart view, in February 2016, SQL Server was at 1150.227, and MongoDB was at 305.599. Generally it looks like the "big three" -- Oracle, MySQL, and SQL Server -- are all slowly declining in the rankings while PostgreSQL and MongoDB are slowly rising. It's good to see PostgreSQL going up, but we have a long way to go to really be in the big leagues. I think in the last couple of years we have done a much better job of adding interesting features than we did for a few years before that. Between PostgreSQL 9.6 and 11, we will have added (and in many cases significantly improved) partitioning, logical replication, parallel query, and stored procedures. Along the way we've improved a lot of other things, not least executor performance. I don't know how directly those things affect the DB-Engines ranking, but they are such major things that I have to believe they make PostgreSQL a real possibility for many users who wouldn't previously have found it a viable option, and I assume that's at least part of what is driving the numbers, though there are probably other factors too, such as Oracle's acquisition of MySQL. I think it's pretty clear that we need to both continue to improve some of these major new features we've added and at the same time keep introducing even more new things if we want to continue to gain market share and mind share. I hope that features like scale-out and also zheap are going to help us continue to whittle away at the gap, and I look forward to seeing what else anyone may have in mind. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
2018-05-31 11:26 GMT+09:00, Robert Haas <robertmhaas@gmail.com>: > It was nice to meet you in person. Me too. And it was very kind of you to help me to display the wiki page well and guide the session. When I first heard your voice at the Developer Meeting, I thought Bruce Momjian was speaking, because your voice sounded similar to him... > We didn't have time in the unconference session to discuss these > topics in detail, for you have raised many issues here each of which > deserves discussion individually and in detail. I wrote a blog post > somewhat related to this topic recently which you can find at > http://rhaas.blogspot.com/2018/05/built-in-sharding-for-postgresql.html Yes, I read this article before PGCon. Your articles are always helpful to catch the current overall situation of the community. > In > terms of high-level architecture, I think you are right to wonder > about the possibility of a cloud-native mode based on separating > storage and compute. Amazon Aurora offers that feature, but we might > want to have it in PostgreSQL. > Another, somewhat different thing that we might want is a fully > distributed database, with a distributed buffer cache, distributed > lock manager, distributed invalidation queue, etc. That would be like > what Oracle RAC does, but it's a tremendous amount of work, and a lot > of that work has no value by itself. You don't get the payoff until > it's all working. There are a few pieces that might be independently > useful, though, like a distributed deadlock detector. The goal of > this kind of effort is, I think, to support zillions of connections -- > scale further than you can with just one node. This would be a lot > better if combined with the cloud-native storage, but of course that's > even more work. Yes, I can understand the difficulty. So, I simply wanted to ask opinions at the unconference on which (hard) direction the community wants to go and what database we want PostgreSQL to be like ultimately. Without that fundamental consensus, development work might be wasted, facing objections after submitting patches. As you mentioned in your blog post and in a past email, I don't think anyone yet has a clear image of what the scaleout of PostgreSQL should be. How should we proceed? Which approach should we take to minimize rework? a) Define functional specification with the basic overall architecture (this doesn't mean to write a heavy detailed design document or manual; I think a general README or wiki would be sufficient.) At this time, I expect we can evaluate how to use scaleout feature and whether it's reasonably easy to use. Then we can proceed to design and code each part with relief -- 2PC, global consistency, failure detection and failover, distributed lock management and deadlock handling, etc. b) Various developers design and code each part, bring together those patches, and then all try to figure out how to combine them. I'm in favor of a) at least at the basic architecture level. Otherwise, such an unhappiness could happen: "Hey, I made a patch to implement a distributed cache management like Oracle Cache Fusion." "No, we don't want features based on shared everything architecture." I anticipated a decision process at the unconference like this: "Do we want to build on shared everything architecture?" "No, because it limits scalability, requires expensive shared storage, and it won't run on many clouds." "Then do we want to employ a new architecture like AWS Aurora?" "That may be interesting. But AWS could do it because they have an infinitely scalable storage layer which is built and used for a long time. This architecture may not be our option. But let's keep our eye on leveraging services of major cloud vendors just like Vertica does recently. Cloud services are now like traditional vendor-specific hardware. Maybe PostgreSQL should utilize them just like we use CPU-specific instructions now and GPU/persistent memory in the near future." "Then, it seems that we should go on the shared nothing architecture. Is it OK?" "Yes." > The FDW approach, of which I have been a supporter for some years now, > is really aiming at a different target, which is to allow efficient > analytics queries across a multi-node cluster. Oh, I didn't know you support FDW approach mainly for analytics. I guessed the first target was OLTP read-write scalability. > We might not want to confine ourselves strictly to the FDW interface > -- for example, I've thought about the idea of building introducing a > new relkind for a "distributed table". A distributed table may be > present on the local node, in which case it can be scanned like a > local table, or it may be not present, in which case it can be scanned > like a foreign table by connecting to a node on which it is present. > The set of nodes on which a table is present is metadata that is > shared throughout the cluster. Multi-master logical replication > propagates changes between all nodes on which the table is present. > With a concept like this, there is a lot of opportunity to optimize > queries by, for example, deciding to perform a join on a node on which > both input tables are present, to minimize data movement. I agree. XL, Oracle Sharding, and possibly MySQL Cluster does that, too. It seems like a must-do thing. > But even if > we create something like this, I see it as fundamentally an extension > of the FDW approach that would hopefully manage to reuse a lot of > what's already been built there. I don't think we need to (or should) > throw away the work that's been done on FDW pushdown and start over -- > we should find a way to build on top of it and add ease-of-use and > management features. Agreed. I think we should not write much code from scratch, too. On the other hand, if we have to support sharding natively without FDW, I wonder if we can reuse the FDW artifact. I mean, extracting necessary logics from FDW into common functions, and native sharding code also calls them. > In fact, even if we said that we want a fully distributed database, > we'd probably still need some kind of distributed table concept. > Unless every node has a full copy of everything in the database, you > still need to reason about which data is present on which nodes and > optimize queries accordingly. Then, how about building the cluster membership management first, including node management and failure detection/failover? I think that node management is necessary anyway, and other developers can experiment other things on that cluster infrastructure. Do you think it would be helpful or wasteful? I'm finding what we can do for early scaleout release. > From the chart view, in February 2016, SQL Server was at 1150.227, and > MongoDB was at 305.599. Generally it looks like the "big three" -- Thank you for looking at the chart and telling me the figures. > I > think it's pretty clear that we need to both continue to improve some > of these major new features we've added and at the same time keep > introducing even more new things if we want to continue to gain market > share and mind share. I hope that features like scale-out and also > zheap are going to help us continue to whittle away at the gap, and I > look forward to seeing what else anyone may have in mind. Definitely. I couldn't agree more. Regards MauMau
On Thu, May 31, 2018 at 8:12 AM, MauMau <maumau307@gmail.com> wrote: > I anticipated a decision process at the unconference like this: > "Do we want to build on shared everything architecture?" > "No, because it limits scalability, requires expensive shared storage, > and it won't run on many clouds." > "Then do we want to employ a new architecture like AWS Aurora?" > "That may be interesting. But AWS could do it because they have an > infinitely scalable storage layer which is built and used for a long > time. This architecture may not be our option. But let's keep our > eye on leveraging services of major cloud vendors just like Vertica > does recently. Cloud services are now like traditional > vendor-specific hardware. Maybe PostgreSQL should utilize them just > like we use CPU-specific instructions now and GPU/persistent memory in > the near future." > "Then, it seems that we should go on the shared nothing architecture. > Is it OK?" > "Yes." I think all of that except for the last two sentences accords with my view, at least. As for the last two sentences, I'm happy to see someone develop some kind of cloud-native storage if they want, but I have no immediate plans in that area myself. >> The FDW approach, of which I have been a supporter for some years now, >> is really aiming at a different target, which is to allow efficient >> analytics queries across a multi-node cluster. > > Oh, I didn't know you support FDW approach mainly for analytics. I > guessed the first target was OLTP read-write scalability. That seems like a harder target to me, because you will have an extra hop involved -- SQL from the client to the first server, then via SQL to a second server. The work of parsing and planning also has to be done twice, once for the foreign table and again for the table. For longer-running queries this overhead doesn't matter as much, but for short-running queries it is significant. > Agreed. I think we should not write much code from scratch, too. On > the other hand, if we have to support sharding natively without FDW, I > wonder if we can reuse the FDW artifact. I mean, extracting necessary > logics from FDW into common functions, and native sharding code also > calls them. I wouldn't do it that way. I want to use postgres_fdw as a whole, not break it up into pieces and reuse the individual bits of code. We should think also about whether other FDWs could be part of it, although maybe that's not realistic. >> In fact, even if we said that we want a fully distributed database, >> we'd probably still need some kind of distributed table concept. >> Unless every node has a full copy of everything in the database, you >> still need to reason about which data is present on which nodes and >> optimize queries accordingly. > > Then, how about building the cluster membership management first, > including node management and failure detection/failover? I think > that node management is necessary anyway, and other developers can > experiment other things on that cluster infrastructure. Do you think > it would be helpful or wasteful? I'm finding what we can do for early > scaleout release. I don't know what "node management" and "failure dectection/failover" mean specifically. I'd like to hear proposals, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
2018-05-31 22:44 GMT+09:00, Robert Haas <robertmhaas@gmail.com>: > On Thu, May 31, 2018 at 8:12 AM, MauMau <maumau307@gmail.com> wrote: >> Oh, I didn't know you support FDW approach mainly for analytics. I >> guessed the first target was OLTP read-write scalability. > > That seems like a harder target to me, because you will have an extra > hop involved -- SQL from the client to the first server, then via SQL > to a second server. The work of parsing and planning also has to be > done twice, once for the foreign table and again for the table. For > longer-running queries this overhead doesn't matter as much, but for > short-running queries it is significant. Yes, that extra hop and double parsing/planning were the killer for our performance goal when we tried to meet our customer's scaleout needs with XL. The application executes 82 DML statements in one transaction. Those DMLs consist of INSERT, UPDATE and SELECT that only accesses one row with a primary key. The target tables are only a few, so the application PREPAREs a few statements and EXECUTEs them repeatedly. We placed the coordinator node of XL on the same host as the application, and data nodes and GTM on other individual nodes. The response time of XL compared to PostgreSQL was 2.4 times, and the throughput (tps) was 43%. Interestingly, perf showed that base_yyparse() was the top CPU consumer on both coordinator and data node, while base_yyparse() appeared near the bottom of the ranking. The SQL processor should be one layer, not two layers. In the above benchmark, each transaction only accessed data on one data node. That's what sharding principles recommend. The FDW approach would be no problem as long as the application follows the sharding recommendation. But not all applications will/can follow the sharding recommendation. The above application, which is migrated from a mainframe, uses INSERTs to load data, inserting rows onto various nodes. Considering your concern of double parsing/planning for a local foreign table and a remote real table, wouldn't the FDW approach hit the wall? > I don't know what "node management" and "failure dectection/failover" > mean specifically. I'd like to hear proposals, though. That's nothing special or new. Things like: * Define a set of nodes that can join the cluster. * Initialize or configure a node according to its role in the cluster. * Decommission a node from the cluster. * Define a node group in which all member nodes have the same data set for redundancy. * One command to start and shutdown the entire cluster. * System tables to display the member nodes and node groups. * Each node's in-memory view of the current cluster state. * How each node monitors which other nodes. * Elect a new primary node within a node group when the current primary node fails. * Whether each node group should be configured with a master-slaves replication topology, or a multi-master topology like MySQL Group Replication Some of the above may end up with XL's things like pgxc_node/pgxc_group system tables, pgxc_ctl command, CREATE/DROP NODE/NODE GROUP commands, etc. Regards MauMau
On Thu, May 31, 2018 at 11:00 PM, MauMau <maumau307@gmail.com> wrote: > 2018-05-31 22:44 GMT+09:00, Robert Haas <robertmhaas@gmail.com>: >> On Thu, May 31, 2018 at 8:12 AM, MauMau <maumau307@gmail.com> wrote: >>> Oh, I didn't know you support FDW approach mainly for analytics. I >>> guessed the first target was OLTP read-write scalability. >> >> That seems like a harder target to me, because you will have an extra >> hop involved -- SQL from the client to the first server, then via SQL >> to a second server. The work of parsing and planning also has to be >> done twice, once for the foreign table and again for the table. For >> longer-running queries this overhead doesn't matter as much, but for >> short-running queries it is significant. > > Yes, that extra hop and double parsing/planning were the killer for > our performance goal when we tried to meet our customer's scaleout > needs with XL. The application executes 82 DML statements in one > transaction. Those DMLs consist of INSERT, UPDATE and SELECT that > only accesses one row with a primary key. The target tables are only > a few, so the application PREPAREs a few statements and EXECUTEs them > repeatedly. We placed the coordinator node of XL on the same host as > the application, and data nodes and GTM on other individual nodes. > I agree that there's double parsing happening, but I am hesitant to agree with the double planning claim. We do plan, let's say a join between two foreign tables, on the local server, but that's only to decide whether it's efficient to join locally or on the foreign server. That means we create foreign paths for scan on the foreign tables, may be as many parameterized plans as the number of join conditions, and one path for the join pushdown that's it. We then create local join paths but we need those to decide whether it's efficient to join locally and if yes, which way. But don't create paths as to how the foreign server would plan that join. That's not double planning since we do not create same paths locally and on the foreign server. In order to avoid double parsing, we might want to find a way to pass a "normalized" parse tree down to the foreign server. We need to normalize the OIDs in the parse tree since those may be different across the nodes. > > >> I don't know what "node management" and "failure dectection/failover" >> mean specifically. I'd like to hear proposals, though. > > That's nothing special or new. Things like: That's a good summary of what we need here. Thanks for the summary. > > * Define a set of nodes that can join the cluster. > * Initialize or configure a node according to its role in the cluster. > * Decommission a node from the cluster. > * Define a node group in which all member nodes have the same data set > for redundancy. > * One command to start and shutdown the entire cluster. Right. > * System tables to display the member nodes and node groups. I think we need system tables on each node to store the cluster configuration as seen by that node not just display. But that's a bit of a detail. > * Each node's in-memory view of the current cluster state. > * How each node monitors which other nodes. That's where we can have multiple ways. It's either each node monitoring other nodes or we have a kind of watch-dog or a central place (not necessarily a node, a file shared across the nodes might as well server that purpose) from where we can fetch the "last known" state of the cluster. But that along with split brain problem is much larger problem to solve. I agree we need some way to know the "last known" status of the cluster and then correct it as the cluster has conflicting experiences, and do all of that without having much communication overhead. > * Elect a new primary node within a node group when the current > primary node fails. Well, some configuration might want more than one primary or coordinator nodes. Having a single primary in itself creates an SPOF, which should be avoided. But then there will be some loads which will be happy with a single primary and risks arising out of that. > * Whether each node group should be configured with a master-slaves > replication topology, or a multi-master topology like MySQL Group > Replication Instead of a master-slave configuration, we might want to use logical replication or some such method to create replicas of tables on multiple nodes in a cluster and let the optimizer take advantage of that for join push-down or load balancing. > > Some of the above may end up with XL's things like > pgxc_node/pgxc_group system tables, pgxc_ctl command, CREATE/DROP > NODE/NODE GROUP commands, etc. > I agree. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On 1 June 2018 at 15:44, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Thu, May 31, 2018 at 11:00 PM, MauMau <maumau307@gmail.com> wrote: >> 2018-05-31 22:44 GMT+09:00, Robert Haas <robertmhaas@gmail.com>: >>> On Thu, May 31, 2018 at 8:12 AM, MauMau <maumau307@gmail.com> wrote: >>>> Oh, I didn't know you support FDW approach mainly for analytics. I >>>> guessed the first target was OLTP read-write scalability. >>> >>> That seems like a harder target to me, because you will have an extra >>> hop involved -- SQL from the client to the first server, then via SQL >>> to a second server. The work of parsing and planning also has to be >>> done twice, once for the foreign table and again for the table. For >>> longer-running queries this overhead doesn't matter as much, but for >>> short-running queries it is significant. >> >> Yes, that extra hop and double parsing/planning were the killer for >> our performance goal when we tried to meet our customer's scaleout >> needs with XL. The application executes 82 DML statements in one >> transaction. Those DMLs consist of INSERT, UPDATE and SELECT that >> only accesses one row with a primary key. The target tables are only >> a few, so the application PREPAREs a few statements and EXECUTEs them >> repeatedly. We placed the coordinator node of XL on the same host as >> the application, and data nodes and GTM on other individual nodes. >> > > I agree that there's double parsing happening, but I am hesitant to > agree with the double planning claim. We do plan, let's say a join > between two foreign tables, on the local server, but that's only to > decide whether it's efficient to join locally or on the foreign > server. That means we create foreign paths for scan on the foreign > tables, may be as many parameterized plans as the number of join > conditions, and one path for the join pushdown that's it. We then > create local join paths but we need those to decide whether it's > efficient to join locally and if yes, which way. But don't create > paths as to how the foreign server would plan that join. That's not > double planning since we do not create same paths locally and on the > foreign server. > > In order to avoid double parsing, we might want to find a way to pass > a "normalized" parse tree down to the foreign server. We need to > normalize the OIDs in the parse tree since those may be different > across the nodes. Passing detailed info between servers is exactly what XL does. It requires us to define a cluster, exactly as XL does. And yes, its a good idea to replicate some tables to all nodes, as XL does. So it seems we have at last some agreement that some of the things XL does are the correct approaches. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 1 June 2018 at 04:00, MauMau <maumau307@gmail.com> wrote: > The SQL processor should be one layer, not two layers. For OLTP, that would be best. But it would be restricted to single-node requests, leaving you the problem of how you know ahead of time whether an SQL statement was single node or not. Using a central coordinator node allows us to hide the decision of single-node/multi-node from the user which seems essential for general SQL. If you are able to restrict the types of requests users make then we can do direct access to partitions - so there is scope for a single-node API, as Mongo provides. Using a central coordinator also allows multi-node transaction control, global deadlock detection etc.. And that is why both XL and "FDW approach" rely on a central coordinator. FDWs alone are not enough. It is clear that some more tight coupling is required to get things to work well. For example, supporting SQL query plans that allow for redistribution of data for joins. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> writes: > In order to avoid double parsing, we might want to find a way to pass > a "normalized" parse tree down to the foreign server. We need to > normalize the OIDs in the parse tree since those may be different > across the nodes. I don't think this is a good idea at all. It breaks any hope of supporting remote servers that are not the identical version to the local one (since their parsetrees might be different). And "normalized OIDs" sounds like "pie in the sky". You might get away with asssuming that built-in functions have stable OIDs, but you can't expect that for functions in extensions. regards, tom lane
On Fri, Jun 1, 2018 at 11:10 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > Using a central coordinator also allows multi-node transaction > control, global deadlock detection etc.. But that becomes an SPOF and then we have to configure a standby for that. I am not saying that that's a bad design but it's not very good for many work-loads. But it would be good if we could avoid any "central server" in this configuration. > > And that is why both XL and "FDW approach" rely on a central coordinator. I don't think we ever specified that "FDW approach" "relies" on a central coordinator. One could configure and setup a cluster with multiple coordinators using FDWs. > > FDWs alone are not enough. It is clear that some more tight coupling > is required to get things to work well. For example, supporting SQL > query plans that allow for redistribution of data for joins. I think partitioning + FDW provide basic infrastructure for distributing data, planning queries working with such data. We need more glue to support node management, cluster configuration. So, I agree with your statement. But I think it was clear from the beginning that we need more than FDW and partitioning. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On Fri, Jun 1, 2018 at 11:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> writes: >> In order to avoid double parsing, we might want to find a way to pass >> a "normalized" parse tree down to the foreign server. We need to >> normalize the OIDs in the parse tree since those may be different >> across the nodes. > > I don't think this is a good idea at all. It breaks any hope of > supporting remote servers that are not the identical version to the local > one (since their parsetrees might be different). And "normalized OIDs" > sounds like "pie in the sky". You might get away with asssuming that > built-in functions have stable OIDs, but you can't expect that for > functions in extensions. Sorry for confusing writeup. I didn't mean "normalized OIDs" as I mentioned in my last sentence. I meant "normalized parse-tree" as in the first sentence. In order to normalize parse trees, we need to at least replace various OIDs in parse-tree with something that the foreign server will understand correctly like table name on the foreign table pointed to by local foreign table OR (schema qualified) function names and so on. There might be more things to "normalize" in the parse tree other than OIDs, but I can't think of anything right now. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On Wed, May 30, 2018 at 9:26 PM Robert Haas <robertmhaas@gmail.com> wrote: > The FDW approach, of which I have been a supporter for some years now, > is really aiming at a different target, which is to allow efficient > analytics queries across a multi-node cluster. I think we're getting > pretty close to being able to do that -- IMHO, the last fundamental > building block that we need is asynchronous execution, which Andres is > working on. After that, it's a matter of adding other features that > people want (like cross-node MVCC) and improving the plans for queries > that still don't perform well (like joins that involve redistributing > one of the tables involved). FWIW, Distributed analytical queries is the right market to be in. This is the field in which I work, and this is where the action is at. I am very, very, sure about this. My view is that many of the existing solutions to this problem (in particular hadoop class soltuions) have major architectural downsides that make them inappropriate in use cases that postgres really shines at; direct hookups to low latency applications for example. postgres is fundamentally a more capable 'node' with its multiple man-millennia of engineering behind it. Unlimited vertical scaling (RAC etc) is interesting too, but this is not the way the market is moving as hardware advancements have reduced or eliminated the need for that in many spheres. The direction of the project is sound and we are on the cusp of the point where multiple independent coalescing features (FDW, logical replication, parallel query, executor enhancements) will open new scaling avenues that will not require trading off the many other benefits of SQL that competing contemporary solutions might. The broader development market is starting to realize this and that is a major driver of the recent upswing in popularity. This is benefiting me tremendously personally due to having gone 'all-in' with postgres almost 20 years ago :-D. (Time sure flies) These are truly wonderful times for the community. merlin
On 1 June 2018 at 16:56, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Fri, Jun 1, 2018 at 11:10 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> >> Using a central coordinator also allows multi-node transaction >> control, global deadlock detection etc.. > > But that becomes an SPOF and then we have to configure a standby for > that. I am not saying that that's a bad design but it's not very good > for many work-loads. But it would be good if we could avoid any > "central server" in this configuration. > >> >> And that is why both XL and "FDW approach" rely on a central coordinator. > > I don't think we ever specified that "FDW approach" "relies" on a > central coordinator. One could configure and setup a cluster with > multiple coordinators using FDWs. Yes, of course. You're just misunderstanding me. I'm talking about a query coordinator "role". There can be many coordinator components and they can be spread out in variours ways, but for any one SQL query there needs to be one coordinator node. Not a SPOF. >> >> FDWs alone are not enough. It is clear that some more tight coupling >> is required to get things to work well. For example, supporting SQL >> query plans that allow for redistribution of data for joins. > > I think partitioning + FDW provide basic infrastructure for > distributing data, planning queries working with such data. We need > more glue to support node management, cluster configuration. So, I > agree with your statement. But I think it was clear from the beginning > that we need more than FDW and partitioning. No, it wasn't clear. But I'm glad to hear it. It might actually work then. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Jun 2, 2018 at 4:05 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 1 June 2018 at 16:56, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: >> On Fri, Jun 1, 2018 at 11:10 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> >>> Using a central coordinator also allows multi-node transaction >>> control, global deadlock detection etc.. >> >> But that becomes an SPOF and then we have to configure a standby for >> that. I am not saying that that's a bad design but it's not very good >> for many work-loads. But it would be good if we could avoid any >> "central server" in this configuration. >> >>> >>> And that is why both XL and "FDW approach" rely on a central coordinator. >> >> I don't think we ever specified that "FDW approach" "relies" on a >> central coordinator. One could configure and setup a cluster with >> multiple coordinators using FDWs. > > Yes, of course. You're just misunderstanding me. I'm talking about a > query coordinator "role". There can be many coordinator components and > they can be spread out in variours ways, but for any one SQL query > there needs to be one coordinator node. Not a SPOF. In your earlier mail, which is included above, you mentioned central coordinator for multi-node transaction control and global deadlock detection. That doesn't sound like a "query coordinator role". It sounds more like GTM, which is an SPOF. Anyway I am happy to clarify that "FDW approach" relies on a query coordinator, the server which faces the client. But I don't think we have decided how would the transaction management and deadlock detection work in the shared nothing cluster of PostgreSQL servers. There was discussion in developer unconference this year, but I was not part of that as I was holding another session the same time. May be somebody who attended that session can post a summary here or provide a link to the summary written elsewhere. > >>> >>> FDWs alone are not enough. It is clear that some more tight coupling >>> is required to get things to work well. For example, supporting SQL >>> query plans that allow for redistribution of data for joins. >> >> I think partitioning + FDW provide basic infrastructure for >> distributing data, planning queries working with such data. We need >> more glue to support node management, cluster configuration. So, I >> agree with your statement. But I think it was clear from the beginning >> that we need more than FDW and partitioning. > > No, it wasn't clear. But I'm glad to hear it. It might actually work then. Good to see some agreement. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On 2 June 2018 at 22:46, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: >>>> And that is why both XL and "FDW approach" rely on a central coordinator. >>> >>> I don't think we ever specified that "FDW approach" "relies" on a >>> central coordinator. One could configure and setup a cluster with >>> multiple coordinators using FDWs. >> >> Yes, of course. You're just misunderstanding me. I'm talking about a >> query coordinator "role". There can be many coordinator components and >> they can be spread out in variours ways, but for any one SQL query >> there needs to be one coordinator node. Not a SPOF. > > In your earlier mail, which is included above, you mentioned central > coordinator for multi-node transaction control and global deadlock > detection. That doesn't sound like a "query coordinator role". It > sounds more like GTM, which is an SPOF. In XL, GTM is a singe component managing transaction ids. That has a standby, so is not a SPOF. But that is not what I mean. I don't believe that a GTM-style component is necessary in a future in-core scalablility solution. Each incoming query needs to be planned and executed from one coordinator component, then the work performed across many workers on different nodes (or just one). We could have coordinator components on each worker node, or we could have a set of coordinator nodes and a set of worker nodes. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Jun 3, 2018 at 2:00 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > In XL, GTM is a singe component managing transaction ids. That has a > standby, so is not a SPOF. > > But that is not what I mean. I don't believe that a GTM-style > component is necessary in a future in-core scalablility solution. > I agree. I think, a distributed algorithm, which does not need a single GTM-style node, would be better. That automatically provides high availability without configuring a standby. > Each incoming query needs to be planned and executed from one > coordinator component, then the work performed across many workers on > different nodes (or just one). Each node need to be confiugred and maintained. That requires efforts. So we need to keep the number of nodes to a minimum. With a coordinator and worker node segregation, we require at least two nodes in a cluster and just that configuration doesn't provide much scalability. With each node functioning as coordinator (facing clients) and worker (facing other coordinators) keeps the number of nodes to a minimum. It is good for HA. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
From: Simon Riggs > Passing detailed info between servers is exactly what XL does. > > It requires us to define a cluster, exactly as XL does. > > And yes, its a good idea to replicate some tables to all nodes, as XL does. > > So it seems we have at last some agreement that some of the things XL > does are the correct approaches. Exactly. I was impressed when I read the paper on XC for the first time. I respect what Suzuki-san and people from NTT, EnterpriseDB, and 2ndQuadrant have done for XC/XL. Just like some people believe we should leverage the artifact for FDW, I think we should leverage the code and idea of XC/XL. Regards MauMau
From: Robert Haas On Thu, May 31, 2018 at 8:12 AM, MauMau <maumau307@gmail.com> wrote: >> Oh, I didn't know you support FDW approach mainly for analytics. I >> guessed the first target was OLTP read-write scalability. > > That seems like a harder target to me, because you will have an extra > hop involved -- SQL from the client to the first server, then via SQL > to a second server. The work of parsing and planning also has to be > done twice, once for the foreign table and again for the table. For > longer-running queries this overhead doesn't matter as much, but for > short-running queries it is significant. From: Simon Riggs On 1 June 2018 at 04:00, MauMau <maumau307@gmail.com> wrote: >> The SQL processor should be one layer, not two layers. > For OLTP, that would be best. But it would be restricted to > single-node requests, leaving you the problem of how you know ahead of > time whether an SQL statement was single node or not. > > Using a central coordinator node allows us to hide the decision of > single-node/multi-node from the user which seems essential for general > SQL. If you are able to restrict the types of requests users make then > we can do direct access to partitions - so there is scope for a > single-node API, as Mongo provides. I don't think an immediate server like the coordinators in XL is necessary. That extra hop can be eliminated by putting both the coordinator and the data node roles in the same server process. That is, the node to which an application connects communicates with other nodes only when it does not necessary data. Furthermore, an extra hop and double parsing/planning could matter for analytic queries, too. For example, SAP HANA boasts of scanning 1 billion rows in one second. In HANA's scaleout architecture, an application can connect to any worker node and the node communicates with other nodes only when necessary (there's one special node called "master", but it manages the catalog and transactions; it's not an extra hop like the coordinator in XL). Vertica is an MPP analytics database, but it doesn't have a node like the coordinator, either. To achieve maximum performance for real-time queries, the scaleout architecture should avoid an extra hop when possible. > Using a central coordinator also allows multi-node transaction > control, global deadlock detection etc.. VoltDB does not have an always-pass hop like the coordinator in XL. Our proprietary RDBMS named Symfoware, which is not based on PostgreSQL, also doesn't have an extra hop, and can handle distributed transactions and deadlock detection/resolution without any special node like GTM. Regards MauMau
From: Ashutosh Bapat > In order to normalize parse trees, we need to at > least replace various OIDs in parse-tree with something that the > foreign server will understand correctly like table name on the > foreign table pointed to by local foreign table OR (schema qualified) > function names and so on. Yes, that's the drawback of each node in the cluster having different OIDs for the same object. That applies to XL, too. In XL, the data node returns the data type names of the columns in the result set to the coordinator. Then the coordinator seemed to parse each data type name with base_yyparse() to convert the name to its OID on the coordinator. That's why base_yyparse() appeared at the top in the perf profile. That repeated typename-to-OID conversion would be reduced by caching the conversion result, like the logical replication of PostgreSQL does. But managing the catalog at one place and using the same OID values seems to concise to me as a concept. Regards MauMau
From: Ashutosh Bapat > In order to normalize parse trees, we need to at > least replace various OIDs in parse-tree with something that the > foreign server will understand correctly like table name on the > foreign table pointed to by local foreign table OR (schema qualified) > function names and so on. Yes, that's the drawback of each node in the cluster having different OIDs for the same object. That applies to XL, too. In XL, the data node returns the data type names of the columns in the result set to the coordinator. Then the coordinator seemed to parse each data type name with base_yyparse() to convert the name to its OID on the coordinator. That's why base_yyparse() appeared at the top in the perf profile. That repeated typename-to-OID conversion would be reduced by caching the conversion result, like the logical replication of PostgreSQL does. But managing the catalog at one place and using the same OID values seems to concise to me as a concept. Regards MauMau -----Original Message----- From: Ashutosh Bapat Sent: Saturday, June 2, 2018 1:00 AM To: Tom Lane Cc: MauMau ; Robert Haas ; PostgreSQL Hackers Subject: Re: I'd like to discuss scaleout at PGCon On Fri, Jun 1, 2018 at 11:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> writes: >> In order to avoid double parsing, we might want to find a way to pass >> a "normalized" parse tree down to the foreign server. We need to >> normalize the OIDs in the parse tree since those may be different >> across the nodes. > > I don't think this is a good idea at all. It breaks any hope of > supporting remote servers that are not the identical version to the local > one (since their parsetrees might be different). And "normalized OIDs" > sounds like "pie in the sky". You might get away with asssuming that > built-in functions have stable OIDs, but you can't expect that for > functions in extensions. Sorry for confusing writeup. I didn't mean "normalized OIDs" as I mentioned in my last sentence. I meant "normalized parse-tree" as in the first sentence. In order to normalize parse trees, we need to at least replace various OIDs in parse-tree with something that the foreign server will understand correctly like table name on the foreign table pointed to by local foreign table OR (schema qualified) function names and so on. There might be more things to "normalize" in the parse tree other than OIDs, but I can't think of anything right now. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
From: Ashutosh Bapat > Each node need to be confiugred and maintained. That requires efforts. > So we need to keep the number of nodes to a minimum. With a > coordinator and worker node segregation, we require at least two nodes > in a cluster and just that configuration doesn't provide much > scalability. With each node functioning as coordinator (facing > clients) and worker (facing other coordinators) keeps the number of > nodes to a minimum. It is good for HA. I think so, too. Every node should behave as both the coordinator and the data node in XL parlance. But I don't preclude a central node. Some node needs to manage sequences, and it may as well manage the system catalog. Regards MauMau
From: Merlin Moncure > FWIW, Distributed analytical queries is the right market to be in. > This is the field in which I work, and this is where the action is at. > I am very, very, sure about this. My view is that many of the > existing solutions to this problem (in particular hadoop class > soltuions) have major architectural downsides that make them > inappropriate in use cases that postgres really shines at; direct > hookups to low latency applications for example. postgres is > fundamentally a more capable 'node' with its multiple man-millennia of > engineering behind it. Unlimited vertical scaling (RAC etc) is > interesting too, but this is not the way the market is moving as > hardware advancements have reduced or eliminated the need for that in > many spheres. I'm feeling the same. As the Moore's Law ceases to hold, software needs to make most of the processor power. Hadoop and Spark are written in Java and Scala. According to Google [1] (see Fig. 8), Java is slower than C++ by 3.7x - 12.6x, and Scala is slower than C++ by 2.5x - 3.6x. Won't PostgreSQL be able to cover the workloads of Hadoop and Spark someday, when PostgreSQL supports scaleout, in-memory database, multi-model capability, and in-database filesystem? That may be a pipedream, but why do people have to tolerate the separation of the relational-based data warehouse and Hadoop-based data lake? [1] Robert Hundt. "Loop Recognition in C++/Java/Go/Scala". Proceedings of Scala Days 2011 Regards MauMau
From: Simon Riggs On 1 June 2018 at 16:56, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: >> I think partitioning + FDW provide basic infrastructure for >> distributing data, planning queries working with such data. We need >> more glue to support node management, cluster configuration. So, I >> agree with your statement. But I think it was clear from the beginning >> that we need more than FDW and partitioning. > > No, it wasn't clear. But I'm glad to hear it. It might actually work then. I found a possibly interesting description in the XL manual. Although XL performs various pushdowns like FDW, XL seems to perform some kind of joins with datanode-to-datanode communication. Doesn't this prove that the FDW approach can't handle those joins optimally? What kind of joins use the shared queue? https://www.postgres-xl.org/documentation/pg-xc-specifics.html -------------------------------------------------- shared_queues (integer) Datanode Only For some joins that occur in queries, data from one Datanode may need to be joined with data from another Datanode. Postgres-XL uses shared queues for this purpose. During execution each Datanode knows if it needs to produce or consume tuples, or both. Note that there may be mulitple shared_queues used even for a single query. So a value should be set taking into account the number of connections it can accept and expected number of such joins occurring simultaneously. -------------------------------------------------- Regards MauMau
On Wed, Jun 06, 2018 at 01:14:04AM +0900, MauMau wrote: > I don't think an immediate server like the coordinators in XL is > necessary. That extra hop can be eliminated by putting both the > coordinator and the data node roles in the same server process. That > is, the node to which an application connects communicates with other > nodes only when it does not necessary data. Yes, I agree with that. This was actually a concern I had over the original XC design after a couple of years working on it. The less nodes, the easier the HA, even if applying any PITR logic in N nodes instead of N*2 nodes with 2PC checks and cleanups is far from trivial either.. It happens that the code resulting in splitting coordinator and datanode was simpler to maintain than merging both, at the cost of operation maintenance and complexity in running the thing. > Furthermore, an extra hop and double parsing/planning could matter for > analytic queries, too. For example, SAP HANA boasts of scanning 1 > billion rows in one second. In HANA's scaleout architecture, an > application can connect to any worker node and the node communicates > with other nodes only when necessary (there's one special node called > "master", but it manages the catalog and transactions; it's not an > extra hop like the coordinator in XL). Vertica is an MPP analytics > database, but it doesn't have a node like the coordinator, either. To > achieve maximum performance for real-time queries, the scaleout > architecture should avoid an extra hop when possible. Greenplum's orca planner (and Citus?) have such facilities if I recall correctly, just mentioning that pushing down directly to remote nodes compiled plans ready for execution exists here and there (that's not the case of XC/XL). For queries whose planning time is way shorter than its actual execution, like analytical work that would not matter much. But not for OLTP and short transaction workloads. >> Using a central coordinator also allows multi-node transaction >> control, global deadlock detection etc.. > > VoltDB does not have an always-pass hop like the coordinator in XL. Greenplum uses also a single-coordinator, multi-datanode instance. That looks similar, right? > Our proprietary RDBMS named Symfoware, which is not based on > PostgreSQL, also doesn't have an extra hop, and can handle distributed > transactions and deadlock detection/resolution without any special > node like GTM. Interesting to know that. This is an area with difficult problems. At the closer to merge with Postgres head, the more fun (?) you get into trying to support new SQL features, and sometimes you finish with hard ERRORs or extra GUC switches to prevent any kind of inconsistent operations. -- Michael
Attachment
On Tue, Jun 5, 2018 at 10:04 PM, MauMau <maumau307@gmail.com> wrote: > From: Ashutosh Bapat >> In order to normalize parse trees, we need to at >> least replace various OIDs in parse-tree with something that the >> foreign server will understand correctly like table name on the >> foreign table pointed to by local foreign table OR (schema > qualified) >> function names and so on. > > Yes, that's the drawback of each node in the cluster having different > OIDs for the same object. That applies to XL, too. Keeping OIDs same across the nodes would require extra communication between nodes to keep track of next OID, dropped OIDs etc. We need to weigh the time spent in that communication and the time saved during parsing. > In XL, the data > node returns the data type names of the columns in the result set to > the coordinator. Then the coordinator seemed to parse each data type > name with base_yyparse() to convert the name to its OID on the > coordinator. That's why base_yyparse() appeared at the top in the > perf profile. I do not understand, why do we need base_yyparse() to parse type name. We already have functions specifically for parsing object names. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On 5 June 2018 at 17:14, MauMau <maumau307@gmail.com> wrote: > Furthermore, an extra hop and double parsing/planning could matter for > analytic queries, too. For example, SAP HANA boasts of scanning 1 > billion rows in one second. In HANA's scaleout architecture, an > application can connect to any worker node and the node communicates > with other nodes only when necessary (there's one special node called > "master", but it manages the catalog and transactions; it's not an > extra hop like the coordinator in XL). Vertica is an MPP analytics > database, but it doesn't have a node like the coordinator, either. To > achieve maximum performance for real-time queries, the scaleout > architecture should avoid an extra hop when possible. Agreed. When possible. When is it possible? Two ways I know of are: 1. have pre-knowledge in the client of where data is located (difficult to scale) 2. you must put data in all places the client can connect to (i.e. multimaster) Perhaps there are more? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 05.06.2018 20:17, MauMau wrote: > From: Merlin Moncure >> FWIW, Distributed analytical queries is the right market to be in. >> This is the field in which I work, and this is where the action is > at. >> I am very, very, sure about this. My view is that many of the >> existing solutions to this problem (in particular hadoop class >> soltuions) have major architectural downsides that make them >> inappropriate in use cases that postgres really shines at; direct >> hookups to low latency applications for example. postgres is >> fundamentally a more capable 'node' with its multiple man-millennia > of >> engineering behind it. Unlimited vertical scaling (RAC etc) is >> interesting too, but this is not the way the market is moving as >> hardware advancements have reduced or eliminated the need for that > in >> many spheres. > I'm feeling the same. As the Moore's Law ceases to hold, software > needs to make most of the processor power. Hadoop and Spark are > written in Java and Scala. According to Google [1] (see Fig. 8), Java > is slower than C++ by 3.7x - 12.6x, and Scala is slower than C++ by > 2.5x - 3.6x. > > Won't PostgreSQL be able to cover the workloads of Hadoop and Spark > someday, when PostgreSQL supports scaleout, in-memory database, > multi-model capability, and in-database filesystem? That may be a > pipedream, but why do people have to tolerate the separation of the > relational-based data warehouse and Hadoop-based data lake? > > > [1] Robert Hundt. "Loop Recognition in C++/Java/Go/Scala". > Proceedings of Scala Days 2011 > > Regards > MauMau > > I can not completely agree with it. I have done a lot of benchmarking of PostgreSQL, CitusDB, SparkSQL and native C/Scala code generated for TPC-H queries. The picture is not so obvious... All this systems provides different scalability and so shows best performance at different hardware configurations. Also Java JIT has made a good progress since 2011. Calculation intensive code (like matrix multiplication) implemented in Java is about 2 times slower than optimized C code. But DBMSes are rarely CPU bounded. Even if all database fits in memory (which is not so common scenario for big data applications), speed of modern CPU is much higher than RAM access speed... Java application are slower than C/C++ mostly because of garbage collection. This is why SparkSQL is moving to off-heap approach when objects are allocated outside Java heap and so not affecting Java GC. New versions of SparkSQL with off-heap memory and native code generation show very good performance. And high scalability always was one of the major features of SparkSQL. So it is naive to expect that Postgres will be 4 times faster than SparkSQL on analytic queries just because it is written in C and SparkSQL - in Scala. Postgres has made a very good progress in support of OLAP in last releases: it now supports parallel query execution, JIT, partitioning... But still its scalability is very limited comparing with SparkSQL. I am not sure about GreenPlum with its sophisticated distributed query optimizer, but most of other OLAP solutions for Postgres are not able to efficiently handle complex queries (with a lot of joins by non-partitioning keys). I do not want to say that it is not possible to implement good analytic platform for OLAP on top of Postgres. But it is very challenged task. And IMHO choice of programming language is not so important. What is more important is format of storing data. The bast systems for data analytic: Vartica, HyPer, KDB,... are using vertical data mode. SparkSQL is also using Parquet file format which provides efficient extraction and processing of data. With abstract storage API Postgres is also given a chance to implement efficient storage for OLAP data processing. But huge amount of work has to be done here. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 05.06.2018 20:17, MauMau wrote:From: Merlin MoncureI can not completely agree with it. I have done a lot of benchmarking of PostgreSQL, CitusDB, SparkSQL and native C/Scala code generated for TPC-H queries.FWIW, Distributed analytical queries is the right market to be in.at.
This is the field in which I work, and this is where the action isI am very, very, sure about this. My view is that many of theof
existing solutions to this problem (in particular hadoop class
soltuions) have major architectural downsides that make them
inappropriate in use cases that postgres really shines at; direct
hookups to low latency applications for example. postgres is
fundamentally a more capable 'node' with its multiple man-millenniaengineering behind it. Unlimited vertical scaling (RAC etc) isin
interesting too, but this is not the way the market is moving as
hardware advancements have reduced or eliminated the need for thatmany spheres.I'm feeling the same. As the Moore's Law ceases to hold, software
needs to make most of the processor power. Hadoop and Spark are
written in Java and Scala. According to Google [1] (see Fig. 8), Java
is slower than C++ by 3.7x - 12.6x, and Scala is slower than C++ by
2.5x - 3.6x.
Won't PostgreSQL be able to cover the workloads of Hadoop and Spark
someday, when PostgreSQL supports scaleout, in-memory database,
multi-model capability, and in-database filesystem? That may be a
pipedream, but why do people have to tolerate the separation of the
relational-based data warehouse and Hadoop-based data lake?
[1] Robert Hundt. "Loop Recognition in C++/Java/Go/Scala".
Proceedings of Scala Days 2011
Regards
MauMau
The picture is not so obvious... All this systems provides different scalability and so shows best performance at different hardware configurations.
Also Java JIT has made a good progress since 2011. Calculation intensive code (like matrix multiplication) implemented in Java is about 2 times slower than optimized C code.
But DBMSes are rarely CPU bounded. Even if all database fits in memory (which is not so common scenario for big data applications), speed of modern CPU is much higher than RAM access speed... Java application are slower than C/C++ mostly because of garbage collection. This is why SparkSQL is moving to off-heap approach when objects are allocated outside Java heap and so not affecting Java GC. New versions of SparkSQL with off-heap memory and native code generation show very good performance. And high scalability always was one of the major features of SparkSQL.
So it is naive to expect that Postgres will be 4 times faster than SparkSQL on analytic queries just because it is written in C and SparkSQL - in Scala.
Postgres has made a very good progress in support of OLAP in last releases: it now supports parallel query execution, JIT, partitioning...
But still its scalability is very limited comparing with SparkSQL. I am not sure about GreenPlum with its sophisticated distributed query optimizer, but
most of other OLAP solutions for Postgres are not able to efficiently handle complex queries (with a lot of joins by non-partitioning keys).
I do not want to say that it is not possible to implement good analytic platform for OLAP on top of Postgres. But it is very challenged task.
And IMHO choice of programming language is not so important. What is more important is format of storing data. The bast systems for data analytic: Vartica, HyPer, KDB,...
are using vertical data mode. SparkSQL is also using Parquet file format which provides efficient extraction and processing of data.
With abstract storage API Postgres is also given a chance to implement efficient storage for OLAP data processing. But huge amount of work has to be done here.
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
From: Ashutosh Bapat > Keeping OIDs same across the nodes would require extra communication > between nodes to keep track of next OID, dropped OIDs etc. We need to > weigh the time spent in that communication and the time saved during > parsing. If we manage the system catalog for cluster-wide objects (tables, indexes, databases, users, and almost all other objects) on one central node, and separate the OID space for cluster-wide objects from that for the node-local objects (temporary tables, tablespaces, and what else?), we don't need to keep track of OIDs across nodes. >> In XL, the data >> node returns the data type names of the columns in the result set to >> the coordinator. Then the coordinator seemed to parse each data type >> name with base_yyparse() to convert the name to its OID on the >> coordinator. That's why base_yyparse() appeared at the top in the >> perf profile. > > I do not understand, why do we need base_yyparse() to parse type name. > We already have functions specifically for parsing object names. Looking at the XL source code, the following sequence of functions are called when the coordinator handles the Row Description message ('T') from the data node. I guess the parsing is necessary to process type names combined with type modifiers, e.g. "char(100)". create_tuple_desc parseTypeString typeStringToTypeName raw_parser Regards MauMau
From: Michael Paquier > Greenplum's orca planner (and Citus?) have such facilities if I recall > correctly, just mentioning that pushing down directly to remote nodes > compiled plans ready for execution exists here and there (that's not the > case of XC/XL). For queries whose planning time is way shorter than its > actual execution, like analytical work that would not matter much. But > not for OLTP and short transaction workloads. It seems that Greenplum does: https://greenplum.org/docs/580/admin_guide/query/topics/parallel-proc. html#topic1 "The master receives, parses, and optimizes the query. The resulting query plan is either parallel or targeted. The master dispatches parallel query plans to all segments,..." while Citus doesn't: https://docs.citusdata.com/en/v7.4/develop/reference_processing.html#c itus-query-processing "Next, the planner breaks the query into two parts - the coordinator query which runs on the coordinator and the worker query fragments which run on individual shards on the workers. The planner then assigns these query fragments to the workers such that all their resources are used efficiently. After this step, the distributed query plan is passed on to the distributed executor for execution. ... Once the distributed executor sends the query fragments to the workers, they are processed like regular PostgreSQL queries. The PostgreSQL planner on that worker chooses the most optimal plan for executing that query locally on the corresponding shard table. The PostgreSQL executor then runs that query and returns the query results back to the distributed executor." BTW, the above page states that worker nodes directly exchanges data during query execution. Greenplum also does so among segment nodes to join tables which are distributed by different key columns. XL seems to do so, too. If this type of interaction is necessary, how would the FDW approach do that? The remote servers need to interact with each other. "The task tracker executor is designed to efficiently handle complex queries which require repartitioning and shuffling intermediate data among workers." > Greenplum uses also a single-coordinator, multi-datanode instance. That > looks similar, right? Greenplum uses a single master and multiple workers. That's similar to Citus. But Greenplum is not similar to VoltDB nor Vertica, since those allow applications to connect to any node. >> Our proprietary RDBMS named Symfoware, which is not based on >> PostgreSQL, also doesn't have an extra hop, and can handle distributed >> transactions and deadlock detection/resolution without any special >> node like GTM. > > Interesting to know that. This is an area with difficult problems. At > the closer to merge with Postgres head, the more fun (?) you get into > trying to support new SQL features, and sometimes you finish with hard > ERRORs or extra GUC switches to prevent any kind of inconsistent > operations. Yes, I hope our deadlock detection/resolution can be ported to PostgreSQL. But I'm also concerned like you, because Symfoware is locking-based, not MVCC-based. Regards MauMau
From: Simon Riggs On 5 June 2018 at 17:14, MauMau <maumau307@gmail.com> wrote: >> Furthermore, an extra hop and double parsing/planning could matter for >> analytic queries, too. For example, SAP HANA boasts of scanning 1 >> billion rows in one second. In HANA's scaleout architecture, an >> application can connect to any worker node and the node communicates >> with other nodes only when necessary (there's one special node called >> "master", but it manages the catalog and transactions; it's not an >> extra hop like the coordinator in XL). Vertica is an MPP analytics >> database, but it doesn't have a node like the coordinator, either. To >> achieve maximum performance for real-time queries, the scaleout >> architecture should avoid an extra hop when possible. > Agreed. When possible. > > When is it possible? > > Two ways I know of are: > > 1. have pre-knowledge in the client of where data is located > (difficult to scale) > 2. you must put data in all places the client can connect to (i.e. multimaster) Regarding 1, I understood you meant by "difficult to scale" that whenever the user adds/removes a node from a cluster and data placement changes, he has to change his application's connection string to point to the node where necessary data resides. Oracle Sharding provides a solution for that problem. See "6.1 Direct Routing to a Shard" in the following manual page: https://docs.oracle.com/en/database/oracle/oracle-database/18/shard/sh arding-data-routing.html#GUID-64CAD794-FAAA-406B-9E20-0C35E96D3FA8 [Excerpt] "Oracle clients and connections pools are able to recognize sharding keys specified in the connection string for high performance data dependent routing. A shard routing cache in the connection layer is used to route database requests directly to the shard where the data resides." > Perhaps there are more? Please see 6.1.1 below. The application can connect to multiple nodes and distribute processing without an extra hop. This is also an interesting idea, isn't it? https://docs.voltdb.com/UsingVoltDB/ChapAppDesign.php#DesignAppConnect Multi Regards MauMau
On 2018-Jun-06, Ashutosh Bapat wrote: > On Tue, Jun 5, 2018 at 10:04 PM, MauMau <maumau307@gmail.com> wrote: > > From: Ashutosh Bapat > >> In order to normalize parse trees, we need to at least replace > >> various OIDs in parse-tree with something that the foreign server > >> will understand correctly like table name on the foreign table > >> pointed to by local foreign table OR (schema qualified) function > >> names and so on. > > > > Yes, that's the drawback of each node in the cluster having > > different OIDs for the same object. That applies to XL, too. > > Keeping OIDs same across the nodes would require extra communication > between nodes to keep track of next OID, dropped OIDs etc. We need to > weigh the time spent in that communication and the time saved during > parsing. We already have the ability to give objects some predetermined OID, for pg_upgrade. Maybe an easy (hah) thing to do is use 2PC for DDL, agree on a OID that's free on every node, then create the object in all servers at the same time. We currently use the system-wide OID generator to assign the OID, but seems an easy thing to change (much harder is to prevent concurrent creation of objects using the arranged OID; maybe can reuse speculative tokens in btrees for this). Doing this imposes a cost at DDL-execution-time only, which seems much better than imposing the cost of translating name to OID on every server for every query. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: Alvaro Herrera > Maybe an easy (hah) thing to do is use 2PC for DDL, agree on a OID > that's free on every node, then create the object in all servers at the > same time. We currently use the system-wide OID generator to assign the > OID, but seems an easy thing to change (much harder is to prevent > concurrent creation of objects using the arranged OID; maybe can reuse > speculative tokens in btrees for this). Would it mean that DDL execution requires all (primary) nodes to be running? I don't say it's an immediate problem yet, but FYI, Oracle Sharding records all DDL statements and re-send them to the down nodes later. > Doing this imposes a cost at > DDL-execution-time only, which seems much better than imposing the cost > of translating name to OID on every server for every query. Agreed. Regards MauMau
From: Alvaro Herrera > Maybe an easy (hah) thing to do is use 2PC for DDL, agree on a OID > that's free on every node, then create the object in all servers at the > same time. We currently use the system-wide OID generator to assign the > OID, but seems an easy thing to change (much harder is to prevent > concurrent creation of objects using the arranged OID; maybe can reuse > speculative tokens in btrees for this). Would it mean that DDL execution requires all (primary) nodes to be running? I don't say it's an immediate problem yet, but FYI, Oracle Sharding records all DDL statements and re-send them to the down nodes later. > Doing this imposes a cost at > DDL-execution-time only, which seems much better than imposing the cost > of translating name to OID on every server for every query. Agreed. Regards MauMau
On Wed, Jun 6, 2018 at 8:16 PM, MauMau <maumau307@gmail.com> wrote: > From: Ashutosh Bapat >> Keeping OIDs same across the nodes would require extra communication >> between nodes to keep track of next OID, dropped OIDs etc. We need > to >> weigh the time spent in that communication and the time saved during >> parsing. > > If we manage the system catalog for cluster-wide objects (tables, > indexes, databases, users, and almost all other objects) on one > central node, and separate the OID space for cluster-wide objects from > that for the node-local objects (temporary tables, tablespaces, and > what else?), we don't need to keep track of OIDs across nodes. > I think that will lead to many more network trips to the central node in order to fetch OIDs of various objects. We could reduce that by using node local caches but still invalidation messages will eat some bandwidth. The network trips in this case may be more than what are required for keeping track of OID space across nodes. Do you see something like this in similar other products? > >>> In XL, the data >>> node returns the data type names of the columns in the result set > to >>> the coordinator. Then the coordinator seemed to parse each data > type >>> name with base_yyparse() to convert the name to its OID on the >>> coordinator. That's why base_yyparse() appeared at the top in the >>> perf profile. >> >> I do not understand, why do we need base_yyparse() to parse type > name. >> We already have functions specifically for parsing object names. > > Looking at the XL source code, the following sequence of functions are > called when the coordinator handles the Row Description message ('T') > from the data node. I guess the parsing is necessary to process type > names combined with type modifiers, e.g. "char(100)". > > create_tuple_desc > parseTypeString > typeStringToTypeName > raw_parser > Hmm, strange. I had not seen that in Postgres-XC. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On Wed, Jun 6, 2018 at 11:46 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > On 2018-Jun-06, Ashutosh Bapat wrote: > >> On Tue, Jun 5, 2018 at 10:04 PM, MauMau <maumau307@gmail.com> wrote: >> > From: Ashutosh Bapat >> >> In order to normalize parse trees, we need to at least replace >> >> various OIDs in parse-tree with something that the foreign server >> >> will understand correctly like table name on the foreign table >> >> pointed to by local foreign table OR (schema qualified) function >> >> names and so on. >> > >> > Yes, that's the drawback of each node in the cluster having >> > different OIDs for the same object. That applies to XL, too. >> >> Keeping OIDs same across the nodes would require extra communication >> between nodes to keep track of next OID, dropped OIDs etc. We need to >> weigh the time spent in that communication and the time saved during >> parsing. > > We already have the ability to give objects some predetermined OID, for > pg_upgrade. True. But that's only for a database not in action. We are talking about database in action. Assigning a predetermined OID is just one of and possibly the smallest thing in the bigger picture. > > Maybe an easy (hah) thing to do is use 2PC for DDL, agree on a OID > that's free on every node, then create the object in all servers at the > same time. We currently use the system-wide OID generator to assign the > OID, but seems an easy thing to change (much harder is to prevent > concurrent creation of objects using the arranged OID; maybe can reuse > speculative tokens in btrees for this). Doing this imposes a cost at > DDL-execution-time only, which seems much better than imposing the cost > of translating name to OID on every server for every query. This works if we consider that all the nodes are up always. If a few nodes are down, the rest of the nodes need to determine the OID and communicate it to the failed nodes when they come up. That's easier said than done. The moment we design something like that, we have to deal with split brain problem. Two sets of nodes which think the other set is down, will keep assigning OIDs that they think are OK and later see the conflicts when communicating the assigned OIDs. Not that we can not implement something like this, but it is a lot of work. We will need to be careful to identify the cases where the scheme will fail and plug all the holes. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On Thu, Jun 07, 2018 at 10:28:15AM +0530, Ashutosh Bapat wrote: > On Wed, Jun 6, 2018 at 8:16 PM, MauMau <maumau307@gmail.com> wrote: >> Looking at the XL source code, the following sequence of functions are >> called when the coordinator handles the Row Description message ('T') >> from the data node. I guess the parsing is necessary to process type >> names combined with type modifiers, e.g. "char(100)". >> >> create_tuple_desc >> parseTypeString >> typeStringToTypeName >> raw_parser > > Hmm, strange. I had not seen that in Postgres-XC. Where is that located in the XC code? If you point out to the location it may revive some past memories from either Ashutosh or me. -- Michael
Attachment
On Thu, Jun 7, 2018 at 10:53 AM, Michael Paquier <michael@paquier.xyz> wrote: > On Thu, Jun 07, 2018 at 10:28:15AM +0530, Ashutosh Bapat wrote: >> On Wed, Jun 6, 2018 at 8:16 PM, MauMau <maumau307@gmail.com> wrote: >>> Looking at the XL source code, the following sequence of functions are >>> called when the coordinator handles the Row Description message ('T') >>> from the data node. I guess the parsing is necessary to process type >>> names combined with type modifiers, e.g. "char(100)". >>> >>> create_tuple_desc >>> parseTypeString >>> typeStringToTypeName >>> raw_parser >> >> Hmm, strange. I had not seen that in Postgres-XC. > > Where is that located in the XC code? If you point out to the location > it may revive some past memories from either Ashutosh or me. I think the path will exist in XC as well since that's regular PostgreSQL path, but I didn't see it being a hot path in XC. So, we will need to find the workload/cases where it's a hot path. > -- > Michael -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
From: Michael Paquier [mailto:michael@paquier.xyz] > > On Wed, Jun 6, 2018 at 8:16 PM, MauMau <maumau307@gmail.com> wrote: > >> Looking at the XL source code, the following sequence of functions > >> are called when the coordinator handles the Row Description message > >> ('T') from the data node. I guess the parsing is necessary to > >> process type names combined with type modifiers, e.g. "char(100)". > >> > >> create_tuple_desc > >> parseTypeString > >> typeStringToTypeName > >> raw_parser > > > > Hmm, strange. I had not seen that in Postgres-XC. > > Where is that located in the XC code? If you point out to the location > it may revive some past memories from either Ashutosh or me. The XL-specific code path is this in src/backend/pgxc/pool/execRemote.c (I don't know this also exists in XC): handle_response -> HandleRowDescription -> create_tuple_desc I think the rest exists in PostgreSQL: parseTypeString -> typeStringToTypeName -> raw_parser Regards Takayuki Tsunakawa
> From: Konstantin Knizhnik [mailto:k.knizhnik@postgrespro.ru] > I can not completely agree with it. I have done a lot of benchmarking of > PostgreSQL, CitusDB, SparkSQL and native C/Scala code generated for > TPC-H queries. Wow, you have an amazingly abundant experience. > I do not want to say that it is not possible to implement good analytic > platform for OLAP on top of Postgres. But it is very challenged task. > And IMHO choice of programming language is not so important. What is > more important is format of storing data. The bast systems for data > analytic: Vartica, HyPer, KDB,... > are using vertical data mode. SparkSQL is also using Parquet file format > which provides efficient extraction and processing of data. > With abstract storage API Postgres is also given a chance to implement > efficient storage for OLAP data processing. But huge amount of work has > to be done here. Agreed on huge amount of work... And I must admit my dream is a pipedream now. Regards Takayuki Tsunakawa
On Fri, Jun 1, 2018 at 11:29:43AM -0500, Merlin Moncure wrote: > FWIW, Distributed analytical queries is the right market to be in. > This is the field in which I work, and this is where the action is at. > I am very, very, sure about this. My view is that many of the > existing solutions to this problem (in particular hadoop class > soltuions) have major architectural downsides that make them > inappropriate in use cases that postgres really shines at; direct > hookups to low latency applications for example. postgres is > fundamentally a more capable 'node' with its multiple man-millennia of > engineering behind it. Unlimited vertical scaling (RAC etc) is > interesting too, but this is not the way the market is moving as > hardware advancements have reduced or eliminated the need for that in > many spheres. > > The direction of the project is sound and we are on the cusp of the > point where multiple independent coalescing features (FDW, logical > replication, parallel query, executor enhancements) will open new > scaling avenues that will not require trading off the many other > benefits of SQL that competing contemporary solutions might. The > broader development market is starting to realize this and that is a > major driver of the recent upswing in popularity. This is benefiting > me tremendously personally due to having gone 'all-in' with postgres > almost 20 years ago :-D. (Time sure flies) These are truly > wonderful times for the community. I am coming in late, but I am glad we are having this conversation. We have made great strides toward sharding while adding minimal sharding-specific code. We can now see a time when we will complete the the minimal sharding-specific code tasks. Once we reach that point, we will need to decide what sharding-specific code to add, and to do that, we need to understand which direction to go in, and to do that, we need to know the trade-offs. While I am glad people know a lot about how other projects handle sharding, these can be only guides to how Postgres will handle such workloads. I think we need to get to a point where we have all of the minimal sharding-specific code features done, at least as proof-of-concept, and then test Postgres with various workloads like OLTP/OLAP and read-write/read-only. This will tell us where sharding-specific code will have the greatest impact. What we don't want to do is to add a bunch of sharding-specific code without knowing which workloads it benefits, and how many of our users will actually use sharding. Some projects have it done that, and it didn't end well since they then had a lot of product complexity with little user value. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Fri, Jun 22, 2018 at 12:34 PM Bruce Momjian <bruce@momjian.us> wrote: > > On Fri, Jun 1, 2018 at 11:29:43AM -0500, Merlin Moncure wrote: > > FWIW, Distributed analytical queries is the right market to be in. > > This is the field in which I work, and this is where the action is at. > > I am very, very, sure about this. My view is that many of the > > existing solutions to this problem (in particular hadoop class > > soltuions) have major architectural downsides that make them > > inappropriate in use cases that postgres really shines at; direct > > hookups to low latency applications for example. postgres is > > fundamentally a more capable 'node' with its multiple man-millennia of > > engineering behind it. Unlimited vertical scaling (RAC etc) is > > interesting too, but this is not the way the market is moving as > > hardware advancements have reduced or eliminated the need for that in > > many spheres. > > > > The direction of the project is sound and we are on the cusp of the > > point where multiple independent coalescing features (FDW, logical > > replication, parallel query, executor enhancements) will open new > > scaling avenues that will not require trading off the many other > > benefits of SQL that competing contemporary solutions might. The > > broader development market is starting to realize this and that is a > > major driver of the recent upswing in popularity. This is benefiting > > me tremendously personally due to having gone 'all-in' with postgres > > almost 20 years ago :-D. (Time sure flies) These are truly > > wonderful times for the community. > > While I am glad people know a lot about how other projects handle > sharding, these can be only guides to how Postgres will handle such > workloads. I think we need to get to a point where we have all of the > minimal sharding-specific code features done, at least as > proof-of-concept, and then test Postgres with various workloads like > OLTP/OLAP and read-write/read-only. This will tell us where > sharding-specific code will have the greatest impact. > > What we don't want to do is to add a bunch of sharding-specific code > without knowing which workloads it benefits, and how many of our users > will actually use sharding. Some projects have it done that, and it > didn't end well since they then had a lot of product complexity with > little user value. Key features from my perspective: *) fdw in parallel. how do i do it today? ghetto implemented parallel queries with asynchronous dblink *) column store *) automatic partition management through shards probably some more, gotta run :-) merlin
On 06/22/2018 11:28 AM, Merlin Moncure wrote: > > Key features from my perspective: > *) fdw in parallel. how do i do it today? ghetto implemented parallel > queries with asynchronous dblink > > *) column store Although not in core, we do have this as an extension through Citus don't we? JD -- Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc *** A fault and talent of mine is to tell it exactly how it is. *** PostgreSQL centered full stack support, consulting and development. Advocate: @amplifypostgres || Learn: https://postgresconf.org ***** Unless otherwise stated, opinions are my own. *****
On Fri, Jun 22, 2018 at 01:28:58PM -0500, Merlin Moncure wrote: > On Fri, Jun 22, 2018 at 12:34 PM Bruce Momjian <bruce@momjian.us> wrote: > > > > What we don't want to do is to add a bunch of sharding-specific code > > without knowing which workloads it benefits, and how many of our users > > will actually use sharding. Some projects have it done that, and it > > didn't end well since they then had a lot of product complexity with > > little user value. > > Key features from my perspective: > *) fdw in parallel. how do i do it today? ghetto implemented parallel > queries with asynchronous dblink Andres has outlined what needs to be done here: https://www.postgresql.org/message-id/20180525033538.6ypfwcqcxce6zkjj%40alap3.anarazel.de > *) column store This could be part of the plugable storage engine. > *) automatic partition management through shards Yes, but I am afraid we need to know where we are going before we can implement management. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Fri, Jun 22, 2018 at 01:28:58PM -0500, Merlin Moncure wrote:
> On Fri, Jun 22, 2018 at 12:34 PM Bruce Momjian <bruce@momjian.us> wrote:
> >
> > What we don't want to do is to add a bunch of sharding-specific code
> > without knowing which workloads it benefits, and how many of our users
> > will actually use sharding. Some projects have it done that, and it
> > didn't end well since they then had a lot of product complexity with
> > little user value.
>
> Key features from my perspective:
> *) fdw in parallel. how do i do it today? ghetto implemented parallel
> queries with asynchronous dblink
Andres has outlined what needs to be done here:
https://www.postgresql.org/message-id/20180525033538.6ypfwcqcxce6zkjj%40alap3.anarazel.de
On Sat, Jun 23, 2018 at 12:41:00PM +1000, Haribabu Kommi wrote: > > On Sat, Jun 23, 2018 at 12:17 PM Bruce Momjian <bruce@momjian.us> wrote: > > On Fri, Jun 22, 2018 at 01:28:58PM -0500, Merlin Moncure wrote: > > On Fri, Jun 22, 2018 at 12:34 PM Bruce Momjian <bruce@momjian.us> wrote: > > > > > > What we don't want to do is to add a bunch of sharding-specific code > > > without knowing which workloads it benefits, and how many of our users > > > will actually use sharding. Some projects have it done that, and it > > > didn't end well since they then had a lot of product complexity with > > > little user value. > > > > Key features from my perspective: > > *) fdw in parallel. how do i do it today? ghetto implemented parallel > > queries with asynchronous dblink > > Andres has outlined what needs to be done here: > > https://www.postgresql.org/message-id/ > 20180525033538.6ypfwcqcxce6zkjj%40alap3.anarazel.de > > > Sorry if this was already been discussed in up-thread. > > Just I would like to bring out idea scale out by adding many instances that > can share the lock and buffer pool manager with all the instances with > the help of Remote direct memory access. > > By adding pluggable buffer pool and lock manager, how about adding > many instances and all share the buffers using RDMA to provide > better scaling with shared everything. > > Currently I didn't know have any idea whether is it possible or not and also > the problems in using RDMA. > > Just want to check whether is it worth idea to consider in supporting scale > out? Yes, Robert Haas did mention this. It might be something we consider much later. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Hello, hackers I'm very sorry to have left this thread for a long time. I've come out of hibernation, pushed by the need. Please let meresume the discussion on the scale-out design. I'll read this thread and related ones again to refresh my memory, and I'd like to continue to assemble ideas and opinionsin the following wiki page: Scaleout Design - PostgreSQL wiki https://wiki.postgresql.org/wiki/Scaleout_Design I know there are lots of topics to decide in order to formulate the architectural design and functional specification. Toget the most with minimal waste of efforts, I think we need to clarify what we want to achieve. Some of the most importanttopics are: * What workloads do we target? Does Postgres give up on data warehousing capability that's comparable with existing productsand cloud services? * What architecture(s) do we want? Shared nothing, shared disk, a combination of them, or a totally new one. * FDW or non-FDW approach? (as Simon and I mentioned in this thread, I don't think FDW is suitable for the scale-out, althoughwe should try to reuse the code in postgres_fdw.) From: Bruce Momjian <bruce@momjian.us> > On Sat, Jun 23, 2018 at 12:41:00PM +1000, Haribabu Kommi wrote: > > Just I would like to bring out idea scale out by adding many instances that > > can share the lock and buffer pool manager with all the instances with > > the help of Remote direct memory access. > > > > By adding pluggable buffer pool and lock manager, how about adding > > many instances and all share the buffers using RDMA to provide > > better scaling with shared everything. > > > > Currently I didn't know have any idea whether is it possible or not and also > > the problems in using RDMA. > > > > Just want to check whether is it worth idea to consider in supporting scale > > out? > > Yes, Robert Haas did mention this. It might be something we consider > much later. I said that we wouldn't need shared disk scale-out at or around PGCon developer unconference 2018. But after that, I realizedOracle RAC users want shared disk scale-out for Postgres. So, I wrote about the comparison of shared nothing and shared disk, and the rough design of shared disk scale-out as in theattached PDF file (I also attached it in the above wiki.) The attached file is what I used in PostgreSQL conference Japan2019 to ask about how many users want shared disk scale-out. 25 out of 53 participants raised their hands to show theirfeelings of "want it" or "good to have." That was much more people than I had expected. On the other hand, as I questioned in the last slide of the presentation, I'm not sure if we really need shared disk scale-outin this era that really powerful machines are available even on public clouds. Should we skip shared disk scale-outand just pursue single-node scale-up and shared nothing scale-out? Amazon Aurora as an Alternative to Oracle RAC https://aws.amazon.com/jp/blogs/database/amazon-aurora-as-an-alternative-to-oracle-rac/ "Stepping back and looking at the bigger picture, Amazon Aurora introduces a simplified solution that can function as anOracle RAC alternative for many typical OLTP applications that need high performance writes, scalable reads, very highlevels of high availability with lower operational overhead." Migration Complete -- Amazon’s Consumer Business Just Turned off its Final Oracle Database | AWS News Blog https://aws.amazon.com/jp/blogs/aws/migration-complete-amazons-consumer-business-just-turned-off-its-final-oracle-database/ Could you have a look at my presentation and give your opinion on whether we want shared disk scale-out for Postgres? Regards Takayuki Tsunakawa
Attachment
Hello,
It seems you didn't include pgsql-hackers.
From: Sumanta Mukherjee <sumanta.mukherjee@enterprisedb.com>
> I saw the presentation and it is great except that it seems to be unclear of both SD and SN if the storage and the compute are being explicitly separated. Separation of storage and compute would have some cost advantages as per my understanding. The following two work (ref below) has some information about the usefulness of this technique for scale out and so it would be an interesting addition to see if in the SN architecture that is being proposed could be modified to take care of this phenomenon and reap the gain.
Thanks. Separation of compute and storage is surely to be considered. Unlike the old days when the shared storage was considered to be a bottleneck with slow HDDs and FC-SAN, we could now expect high speed shared storage thanks to flash memory, NVMe-oF, and RDMA.
> 1. Philip A. Bernstein, Colin W. Reid, and Sudipto Das. 2011. Hyder - A
> Transactional Record Manager for Shared Flash. In CIDR 2011.
This is interesting. I'll go into this. Do you know there's any product based on Hyder? OTOH, Hyder seems to require drastic changes when adopting for Postgres -- OCC, log-structured database, etc. I'd like to hear how feasible those are. However, its scale-out capability without the need for data or application partitioning appears appealing.
To explore another possibility that would have more affinity with the current Postgres, let me introduce our proprietary product called Symfoware. It's not based on Postgres.
It has shared nothing scale-out functionality with full ACID based on 2PC, conventional 2PL locking and distributed deadlock resolution. Despite being shared nothing, all the database files and transaction logs are stored on shared storage.
The database is divided into "log groups." Each log group has one transaction log and multiple tablespaces (it's called "database space" instead of tablespace.)
Each DB instance in the cluster owns multiple log groups, and handles reads/writes to the data in its owning log groups. When a DB instance fails, other surviving DB instances take over the log groups of the failed DB instance, recover the data using the transaction log of the log group, and accepts reads/writes to the data in the log group. The DBA configures which DB instance initially owns which log groups and which DB instances are candidates to take over which log groups.
This way, no server is idle as a standby. All DB instances work hard to process read-write transactions. This "no idle server for HA" is one of the things Oracle RAC users want in terms of cost.
However, it still requires data and application partitioning unlike Hyder. Does anyone think of a way to eliminate partitioning? Data and application partitioning is what Oracle RAC users want to avoid or cannot tolerate.
Ref: Introduction of the Symfoware shared nothing scale-out called "load share."
https://pdfs.semanticscholar.org/8b60/163593931cebc58e9f637cfb501500230adc.pdf
Regards
Takayuki Tsunakawa
--- below is Sumanta's original mail ---
From: Sumanta Mukherjee <sumanta.mukherjee@enterprisedb.com>
Sent: Wednesday, June 17, 2020 5:34 PM
To: Tsunakawa, Takayuki/綱川 貴之 <tsunakawa.takay@fujitsu.com>
Cc: Bruce Momjian <bruce@momjian.us>; Merlin Moncure <mmoncure@gmail.com>; Robert Haas <robertmhaas@gmail.com>; maumau307@gmail.com
Subject: Re: I'd like to discuss scaleout at PGCon
Hello,
I saw the presentation and it is great except that it seems to be unclear of both SD and SN if the storage and the compute are being explicitly separated. Separation of storage and compute would have some cost advantages as per my understanding. The following two work (ref below) has some information about the usefulness of this technique for scale out and so it would be an interesting addition to see if in the SN architecture that is being proposed could be modified to take care of this phenomenon and reap the gain.
1. Philip A. Bernstein, Colin W. Reid, and Sudipto Das. 2011. Hyder - A
Transactional Record Manager for Shared Flash. In CIDR 2011.
2. Dhruba Borthakur. 2017. The Birth of RocksDB-Cloud. http://rocksdb.
blogspot.com/2017/05/the-birth-of-rocksdb-cloud.html.
Hello,
It seems you didn't include pgsql-hackers.
From: Sumanta Mukherjee <sumanta.mukherjee@enterprisedb.com>
> I saw the presentation and it is great except that it seems to be unclear of both SD and SN if the storage and the compute are being explicitly separated. Separation of storage and compute would have some cost advantages as per my understanding. The following two work (ref below) has some information about the usefulness of this technique for scale out and so it would be an interesting addition to see if in the SN architecture that is being proposed could be modified to take care of this phenomenon and reap the gain.
Thanks. Separation of compute and storage is surely to be considered. Unlike the old days when the shared storage was considered to be a bottleneck with slow HDDs and FC-SAN, we could now expect high speed shared storage thanks to flash memory, NVMe-oF, and RDMA.
> 1. Philip A. Bernstein, Colin W. Reid, and Sudipto Das. 2011. Hyder - A
> Transactional Record Manager for Shared Flash. In CIDR 2011.
This is interesting. I'll go into this. Do you know there's any product based on Hyder? OTOH, Hyder seems to require drastic changes when adopting for Postgres -- OCC, log-structured database, etc. I'd like to hear how feasible those are. However, its scale-out capability without the need for data or application partitioning appears appealing.
To explore another possibility that would have more affinity with the current Postgres, let me introduce our proprietary product called Symfoware. It's not based on Postgres.
It has shared nothing scale-out functionality with full ACID based on 2PC, conventional 2PL locking and distributed deadlock resolution. Despite being shared nothing, all the database files and transaction logs are stored on shared storage.
The database is divided into "log groups." Each log group has one transaction log and multiple tablespaces (it's called "database space" instead of tablespace.)
Each DB instance in the cluster owns multiple log groups, and handles reads/writes to the data in its owning log groups. When a DB instance fails, other surviving DB instances take over the log groups of the failed DB instance, recover the data using the transaction log of the log group, and accepts reads/writes to the data in the log group. The DBA configures which DB instance initially owns which log groups and which DB instances are candidates to take over which log groups.
This way, no server is idle as a standby. All DB instances work hard to process read-write transactions. This "no idle server for HA" is one of the things Oracle RAC users want in terms of cost.
However, it still requires data and application partitioning unlike Hyder. Does anyone think of a way to eliminate partitioning? Data and application partitioning is what Oracle RAC users want to avoid or cannot tolerate.
Ref: Introduction of the Symfoware shared nothing scale-out called "load share."
https://pdfs.semanticscholar.org/8b60/163593931cebc58e9f637cfb501500230adc.pdf
Regards
Takayuki Tsunakawa
--- below is Sumanta's original mail ---
From: Sumanta Mukherjee <sumanta.mukherjee@enterprisedb.com>
Sent: Wednesday, June 17, 2020 5:34 PM
To: Tsunakawa, Takayuki/綱川 貴之 <tsunakawa.takay@fujitsu.com>
Cc: Bruce Momjian <bruce@momjian.us>; Merlin Moncure <mmoncure@gmail.com>; Robert Haas <robertmhaas@gmail.com>; maumau307@gmail.com
Subject: Re: I'd like to discuss scaleout at PGCon
Hello,
I saw the presentation and it is great except that it seems to be unclear of both SD and SN if the storage and the compute are being explicitly separated. Separation of storage and compute would have some cost advantages as per my understanding. The following two work (ref below) has some information about the usefulness of this technique for scale out and so it would be an interesting addition to see if in the SN architecture that is being proposed could be modified to take care of this phenomenon and reap the gain.
1. Philip A. Bernstein, Colin W. Reid, and Sudipto Das. 2011. Hyder - A
Transactional Record Manager for Shared Flash. In CIDR 2011.
2. Dhruba Borthakur. 2017. The Birth of RocksDB-Cloud. http://rocksdb.
blogspot.com/2017/05/the-birth-of-rocksdb-cloud.html.
Hello all, I've overhauled the scaleout design wiki I presented at PGCon 2018 developer unconference and assembled the research of other DBMSs' scale-out features. Scaleout Design https://wiki.postgresql.org/wiki/Scaleout_Design I intentionally have put little conclusion on our specification and design. I'd like you to look at recent distributed databases, and then think about and discuss what we want to aim for together. I feel it's better to separate a thread per topic or group of topics. I'm sorry, but I'm not confident about the readability at all, because I cannot draw figures due to my visual impairment, and the page is full of text only. What workload do you think we should focus on first, OLTP or analytics? I think OLTP, because open source Postgres probably has been so far getting popular with OLTP. Also, I don't expect many people will use existing popular SaaS for data warehousing like Amazon Redshift, Azure Synapse, Google BigQuery and Snowflake, rather than build their analytics databases on public IaaS or on-premises. Regards MauMau
Hello all, # I'm resending because some error occurred I've overhauled the scaleout design wiki I presented at PGCon 2018 developer unconference and assembled the research of other DBMSs' scale-out features. Scaleout Design https://wiki.postgresql.org/wiki/Scaleout_Design I intentionally have put little conclusion on our specification and design. I'd like you to look at recent distributed databases, and then think about and discuss what we want to aim for together. I feel it's better to separate a thread per topic or group of topics. I'm sorry, but I'm not confident about the readability at all, because I cannot draw figures due to my visual impairment, and the page is full of text only. What workload do you think we should focus on first, OLTP or analytics? I think OLTP, because open source Postgres probably has been so far getting popular with OLTP. Also, I don't expect many people will use existing popular SaaS for data warehousing like Amazon Redshift, Azure Synapse, Google BigQuery and Snowflake, rather than build their analytics databases on public IaaS or on-premises. Regards MauMau
From: MauMau <maumau307@gmail.com> > I intentionally have put little conclusion on our specification and > design. I'd like you to look at recent distributed databases, and > then think about and discuss what we want to aim for together. I feel > it's better to separate a thread per topic or group of topics. Finally, MariaDB is now equiped with a scale-out feature. How MariaDB achieves global scale with Xpand https://www.infoworld.com/article/3574077/how-mariadb-achieves-global-scale-with-xpand.html I haven't read its documentation, and am not planning to read it now, but the feature looks nice. I hope this will alsobe a good stimulus. I believe we should be aware of competitiveness when designing -- "If we adopt this architectureor design to simplify the first version, will it really naturally evolve to a competitive product in the futurewithout distorting the concept, design, and interface?" Regards Takayuki Tsunakawa