Thread: Horizontal scalability/sharding
I have recently increased my public statements about the idea of adding horizontal scaling/sharding to Postgres. I wanted to share with hackers a timeline of how we got here, and where I think we are going in the short term: 2012-2013: As part of writing my scaling talk (http://momjian.us/main/presentations/overview.html#scaling), studying Oracle RAC, and talking to users, it became clear that an XC-like architecture (sharding) was the only architecture that was going to allow for write scaling. Users and conference attendees I talked to were increasingly concerned about the ability of Postgres to scale for high write volumes. They didn't necessarily need that scale now, but they needed to know they could get it if they wanted it, and wouldn't need to switch to a new database in the future. This is similar to wanting a car that can get you on a highway on-ramp fast --- even if you don't need it, you want to know it is there. 2014: I started to shop around the idea that we could use FDWs, parallelism, and a transaction/snapshot manager to get XC features as built-in to Postgres. (I don't remember where the original idea came from.) It was clear that having separate forks of the source code in XC and XL was never going to achieve critical mass --- there just aren't enough people who need high right scale right now, and the fork maintenance overhead is a huge burden. I realized that we would never get community acceptance to dump the XC (or XL) code needed for sharding into community Postgres, but with FDWs, we could add the features as _part_ of improving FDWs, which would benefit FDWs _and_ would be useful for sharding. (We already see some of those FDW features in 9.5.) October, 2014: EDB and NTT started working together in the community to start improving FDWs as a basis for an FDW-based sharding solution. Many of the 9.5 FDW improvements that also benefit sharding were developed by a combined EDB/NTT team. The features improved FDWs independent of sharding, so they didn't need community buy-in on sharding to get them accepted. June, 2015: I attended the PGCon sharding unconference session and there was a huge discussion about where we should go with sharding. I think the big take-away was that most people liked the FDW approach, but had business/customer reasons for wanting to work on XC or XL because those would be production-ready faster. July, 2015: Oleg Bartunov and his new company Postgres Professional (PP) started to think about joining the FDW approach, rather than working on XL, as they had stated at PGCon in June. A joint NTT/EDB/PP phone-in meeting is scheduled for September 1. August, 2015: While speaking at SFPUG, Citus Data approached me about joining the FDW sharding team. They have been invited to the September 1 meeting, as have the XC and XL people. October, 2015: EDB is sponsoring a free 3-hour summit about FDW sharding at the PG-EU conference in Vienna. Everyone is invited, but it is hoped most of the September 1 folks can attend. February, 2016: Oleg is planning a similar meeting at their February Moscow conference. Anyway, I wanted to explain the work that has been happening around sharding. As things move forward, I am increasingly convinced that write scaling will be needed soon, that the XC approach is the only reasonable way to do it, and that FDWs are the cleanest way to get it into community Postgres. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On 30 August 2015 at 03:17, Bruce Momjian <bruce@momjian.us> wrote:
--
I have recently increased my public statements about the idea of adding
horizontal scaling/sharding to Postgres.
Glad to see it. Many people have been pushing such things for years, so it is good to finally see some debate about this on Hackers.
I wanted to share with hackers
a timeline of how we got here, and where I think we are going in the
short term:
2012-2013: As part of writing my scaling talk
(http://momjian.us/main/presentations/overview.html#scaling), studying
Oracle RAC, and talking to users, it became clear that an XC-like
architecture (sharding) was the only architecture that was going to allow
for write scaling.
What other architectures were discussed? Where was that discussion?
Users and conference attendees I talked to were increasingly concerned
about the ability of Postgres to scale for high write volumes. They didn't
necessarily need that scale now, but they needed to know they could get
it if they wanted it, and wouldn't need to switch to a new database in
the future. This is similar to wanting a car that can get you on a highway
on-ramp fast --- even if you don't need it, you want to know it is there.
+1
2014: I started to shop around the idea that we could use FDWs,
parallelism, and a transaction/snapshot manager to get XC features
as built-in to Postgres. (I don't remember where the original idea
came from.) It was clear that having separate forks of the source code
in XC and XL was never going to achieve critical mass --- there just
aren't enough people who need high right scale right now, and the fork
maintenance overhead is a huge burden.
I personally support the view that we should put scalability features into Postgres core, rather than run separate forks.
I realized that we would never get community acceptance to dump the XC
(or XL) code needed for sharding into community Postgres
How or why did you realize that? There has never been any such discussion, AFAIK. Surely it can be possible to move required subsystems across?
, but with FDWs,
we could add the features as _part_ of improving FDWs, which would benefit
FDWs _and_ would be useful for sharding. (We already see some of those
FDW features in 9.5.)
That is a huge presumption. Not discussed or technically analyzed in any way with the community.
October, 2014: EDB and NTT started working together in the community
to start improving FDWs as a basis for an FDW-based sharding solution.
Many of the 9.5 FDW improvements that also benefit sharding were developed
by a combined EDB/NTT team. The features improved FDWs independent of
sharding, so they didn't need community buy-in on sharding to get them
accepted.
June, 2015: I attended the PGCon sharding unconference session and
there was a huge discussion about where we should go with sharding.
I think the big take-away was that most people liked the FDW approach,
but had business/customer reasons for wanting to work on XC or XL because
those would be production-ready faster.
Cough, cough. You must surely be joking that "most people liked the FDW approach"? How did we measure the acceptance of this approach?
What actually is the FDW approach? Since its not been written down anywhere, or even explained verbally, how can anyone actually agree to it?
July, 2015: Oleg Bartunov and his new company Postgres Professional (PP)
started to think about joining the FDW approach, rather than working on
XL, as they had stated at PGCon in June. A joint NTT/EDB/PP phone-in
meeting is scheduled for September 1.
August, 2015: While speaking at SFPUG, Citus Data approached me about
joining the FDW sharding team. They have been invited to the September
1 meeting, as have the XC and XL people.
2ndQuadrant is working in this area, specifically bringing XL 9.5 forwards. Please can invites be posted to myself, Pavan Deolasee and Petr Jelinek also? I'll pass on to others also.
Koichi Suzuki is arranging a meeting in Hong Long for XC/XL discussions. Presumably EDB is invited also? If Koichi is a leading organizer of this, why are there two meetings?
October, 2015: EDB is sponsoring a free 3-hour summit about FDW sharding
at the PG-EU conference in Vienna. Everyone is invited, but it is hoped
most of the September 1 folks can attend.
February, 2016: Oleg is planning a similar meeting at their February
Moscow conference.
Anyway, I wanted to explain the work that has been happening around
sharding.
Thanks
As things move forward, I am increasingly convinced that write
scaling will be needed soon,
+1
that the XC approach is the only reasonable way to do it,
and that FDWs are the cleanest way to get it into community
Postgres.
Those two things aren't at all obvious to me.
Please don't presume my opposition. If the technical information were made public, I might understand and agree with "the FDW approach", perhaps others also. 2ndQuadrant is certainly happy to become involved in any team aiming to add features to Postgres core, as long as that makes sense. There may be areas we can all agree upon even if the full architecture remains in doubt.
Before the community commits to a long term venture together we should see the plan. Like all IT projects, expensive failure is possible and the lack of a design is a huge flashing red warning light for me at present. If that requires a meeting of all Developers, why are the meetings for this specifically not happening at the agreed Developer meetings?
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Aug 30, 2015 at 5:31 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
A little correction about Postgres Professional. We are concentrated on idea to have one distributed transaction manager, originally DTM, now we have better name XTM, which is neutral to actual cluster realization. For example, we are testing it with XL, ported to 9.4, but we were planning to extend tests to pg_shard, postgres_fdw. My idea was to have at least XTM committed to 9.6, so all parties could work on their implementation much easier.
On 30 August 2015 at 03:17, Bruce Momjian <bruce@momjian.us> wrote:I have recently increased my public statements about the idea of adding
horizontal scaling/sharding to Postgres.Glad to see it. Many people have been pushing such things for years, so it is good to finally see some debate about this on Hackers.I wanted to share with hackers
a timeline of how we got here, and where I think we are going in the
short term:
2012-2013: As part of writing my scaling talk
(http://momjian.us/main/presentations/overview.html#scaling), studying
Oracle RAC, and talking to users, it became clear that an XC-like
architecture (sharding) was the only architecture that was going to allow
for write scaling.What other architectures were discussed? Where was that discussion?Users and conference attendees I talked to were increasingly concerned
about the ability of Postgres to scale for high write volumes. They didn't
necessarily need that scale now, but they needed to know they could get
it if they wanted it, and wouldn't need to switch to a new database in
the future. This is similar to wanting a car that can get you on a highway
on-ramp fast --- even if you don't need it, you want to know it is there.+12014: I started to shop around the idea that we could use FDWs,
parallelism, and a transaction/snapshot manager to get XC features
as built-in to Postgres. (I don't remember where the original idea
came from.) It was clear that having separate forks of the source code
in XC and XL was never going to achieve critical mass --- there just
aren't enough people who need high right scale right now, and the fork
maintenance overhead is a huge burden.I personally support the view that we should put scalability features into Postgres core, rather than run separate forks.I realized that we would never get community acceptance to dump the XC
(or XL) code needed for sharding into community PostgresHow or why did you realize that? There has never been any such discussion, AFAIK. Surely it can be possible to move required subsystems across?, but with FDWs,
we could add the features as _part_ of improving FDWs, which would benefit
FDWs _and_ would be useful for sharding. (We already see some of those
FDW features in 9.5.)That is a huge presumption. Not discussed or technically analyzed in any way with the community.October, 2014: EDB and NTT started working together in the community
to start improving FDWs as a basis for an FDW-based sharding solution.
Many of the 9.5 FDW improvements that also benefit sharding were developed
by a combined EDB/NTT team. The features improved FDWs independent of
sharding, so they didn't need community buy-in on sharding to get them
accepted.
June, 2015: I attended the PGCon sharding unconference session and
there was a huge discussion about where we should go with sharding.
I think the big take-away was that most people liked the FDW approach,
but had business/customer reasons for wanting to work on XC or XL because
those would be production-ready faster.Cough, cough. You must surely be joking that "most people liked the FDW approach"? How did we measure the acceptance of this approach?What actually is the FDW approach? Since its not been written down anywhere, or even explained verbally, how can anyone actually agree to it?July, 2015: Oleg Bartunov and his new company Postgres Professional (PP)
started to think about joining the FDW approach, rather than working on
XL, as they had stated at PGCon in June. A joint NTT/EDB/PP phone-in
meeting is scheduled for September 1.
A little correction about Postgres Professional. We are concentrated on idea to have one distributed transaction manager, originally DTM, now we have better name XTM, which is neutral to actual cluster realization. For example, we are testing it with XL, ported to 9.4, but we were planning to extend tests to pg_shard, postgres_fdw. My idea was to have at least XTM committed to 9.6, so all parties could work on their implementation much easier.
August, 2015: While speaking at SFPUG, Citus Data approached me about
joining the FDW sharding team. They have been invited to the September
1 meeting, as have the XC and XL people.2ndQuadrant is working in this area, specifically bringing XL 9.5 forwards. Please can invites be posted to myself, Pavan Deolasee and Petr Jelinek also? I'll pass on to others also.Koichi Suzuki is arranging a meeting in Hong Long for XC/XL discussions. Presumably EDB is invited also? If Koichi is a leading organizer of this, why are there two meetings?
October, 2015: EDB is sponsoring a free 3-hour summit about FDW sharding
at the PG-EU conference in Vienna. Everyone is invited, but it is hoped
most of the September 1 folks can attend.February, 2016: Oleg is planning a similar meeting at their February
Moscow conference.Anyway, I wanted to explain the work that has been happening around
sharding.ThanksAs things move forward, I am increasingly convinced that write
scaling will be needed soon,+1that the XC approach is the only reasonable way to do it,and that FDWs are the cleanest way to get it into community
Postgres.Those two things aren't at all obvious to me.Please don't presume my opposition. If the technical information were made public, I might understand and agree with "the FDW approach", perhaps others also. 2ndQuadrant is certainly happy to become involved in any team aiming to add features to Postgres core, as long as that makes sense. There may be areas we can all agree upon even if the full architecture remains in doubt.Before the community commits to a long term venture together we should see the plan. Like all IT projects, expensive failure is possible and the lack of a design is a huge flashing red warning light for me at present. If that requires a meeting of all Developers, why are the meetings for this specifically not happening at the agreed Developer meetings?
At PGCon we agreed to have such meeting in Vienna at least. But I think we should be prepared and try to clean all our issues before. It looks like we already out of time,but probably we could meet in Hong Kong ?
Honestly, I still don't know which approach is better, we already played with XL (ported to 9.4) and identified some very strong issues with inconsistency, which scared us, especially taking into account how easy we found them. XC people have fixed them, but I'm not sure if they were fundamental and if we could construct more sophisticated tests and find more issues in XC/XL. We also a bit disappointed by Huawei position about CSN patch, we hoped to use for our XTM. FDW approach has been actively criticized by pg_shard people and that's also made me a bit suspicious. It looks like we are doomed to continue several development forks, so we decided to work on very important common project, XTM, which we hoped could be accepted by all parties and eventually committed to 9.6. Now I see we were right, unfortunately.
Honestly, I still don't know which approach is better, we already played with XL (ported to 9.4) and identified some very strong issues with inconsistency, which scared us, especially taking into account how easy we found them. XC people have fixed them, but I'm not sure if they were fundamental and if we could construct more sophisticated tests and find more issues in XC/XL. We also a bit disappointed by Huawei position about CSN patch, we hoped to use for our XTM. FDW approach has been actively criticized by pg_shard people and that's also made me a bit suspicious. It looks like we are doomed to continue several development forks, so we decided to work on very important common project, XTM, which we hoped could be accepted by all parties and eventually committed to 9.6. Now I see we were right, unfortunately.
Again, could we organize meeting somewhere in September ? US is not good for us, but other places should be ok. I want to have an agreement at least on XTM. We still are testing various approaches, though. We could present results of our experiments and are open to discussion. It's not easy project, but it's something we could do for 9.6.
I'm very glad Bruce started this discussion in -hackers, since it's silly to me to participate in both threads :) Let's meet in September !
--Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Aug 30, 2015 at 03:31:10PM +0100, Simon Riggs wrote: > On 30 August 2015 at 03:17, Bruce Momjian <bruce@momjian.us> wrote: > > I have recently increased my public statements about the idea of adding > horizontal scaling/sharding to Postgres. > > Glad to see it. Many people have been pushing such things for years, so it is > good to finally see some debate about this on Hackers. Agreed. Right now, in our community, we are only seeing users who are happy with what Postgres offers but think they might need massive horizontal scalability in the future. I think there is a larger group that cares about massive horizontal scalability, but those people are using other software right now, so we don't see them yet. Without a roadmap for built-in massive horizontal scalability, I think Postgres adoption will eventually suffer. > I wanted to share with hackers > a timeline of how we got here, and where I think we are going in the > short term: > > 2012-2013: As part of writing my scaling talk > (http://momjian.us/main/presentations/overview.html#scaling), studying > Oracle RAC, and talking to users, it became clear that an XC-like > architecture (sharding) was the only architecture that was going to allow > for write scaling. > > > What other architectures were discussed? Where was that discussion? That was mostly my conclusion. I explained it to small groups at conferences and Postgres user groups. No one said I was wrong, but that is about the level of debate I had. > 2014: I started to shop around the idea that we could use FDWs, > parallelism, and a transaction/snapshot manager to get XC features > as built-in to Postgres. (I don't remember where the original idea > came from.) It was clear that having separate forks of the source code > in XC and XL was never going to achieve critical mass --- there just > aren't enough people who need high right scale right now, and the fork > maintenance overhead is a huge burden. > > > I personally support the view that we should put scalability features into > Postgres core, rather than run separate forks. Good, I do think it is time, but as I stated above, there is limited interest in our current community, so the tolerance for additional community code to accomplish this is also limited. This is the big thing that had me excited about using FDWs --- FDW improvements can get us closer to sharding without requiring community acceptance of sharding-only features. > I realized that we would never get community acceptance to dump the XC > (or XL) code needed for sharding into community Postgres > > > How or why did you realize that? There has never been any such discussion, > AFAIK. Surely it can be possible to move required subsystems across? Well, I have had many such discussions with XC/XL folks, and that was my opinion. I have seen almost no public discussion about this because the idea had almost no chance of success. If it was possible, someone would have already suggested it on this list. > , but with FDWs, > we could add the features as _part_ of improving FDWs, which would benefit > FDWs _and_ would be useful for sharding. (We already see some of those > FDW features in 9.5.) > > > That is a huge presumption. Not discussed or technically analyzed in any way > with the community. True. It seemed pretty obvious to me. > October, 2014: EDB and NTT started working together in the community > to start improving FDWs as a basis for an FDW-based sharding solution. > Many of the 9.5 FDW improvements that also benefit sharding were developed > by a combined EDB/NTT team. The features improved FDWs independent of > sharding, so they didn't need community buy-in on sharding to get them > accepted. > > June, 2015: I attended the PGCon sharding unconference session and > there was a huge discussion about where we should go with sharding. > I think the big take-away was that most people liked the FDW approach, > but had business/customer reasons for wanting to work on XC or XL because > those would be production-ready faster. > > > Cough, cough. You must surely be joking that "most people liked the FDW > approach"? How did we measure the acceptance of this approach? Well, I didn't have my audience-meter with me at the time. ;-) The discussion was mostly in the hallway after the unconference session, "Future of PostgreSQL shared-nothing cluster" by Konstantin Knizhnik, Alexander Korotkov, and Oleg Bartunov. Again, when I explained the ability to use FDWs to get sharding into Postgres with minimal additional code, no one said the idea was crazy, which I took as a big thumbs-up! When I asked why to continue with XC/XL, I was told those were more mature and more customer-ready, which is true. I will not quote people from the from the hallway discussion for privacy reasons. > What actually is the FDW approach? Since its not been written down anywhere, or > even explained verbally, how can anyone actually agree to it? Well, my sharding talk just has the outlines of an approach. I think there are five broad segments: * FDW push-down of joins, sorts, aggregates * ability to send FDW requests in parallel * transaction/snapshot manager to allow ACID transctions on shards * simpler user partitioning API * infrastructure to manage shards, including replicated tables used for joins > July, 2015: Oleg Bartunov and his new company Postgres Professional (PP) > started to think about joining the FDW approach, rather than working on > XL, as they had stated at PGCon in June. A joint NTT/EDB/PP phone-in > meeting is scheduled for September 1. > > August, 2015: While speaking at SFPUG, Citus Data approached me about > joining the FDW sharding team. They have been invited to the September > 1 meeting, as have the XC and XL people. > > > 2ndQuadrant is working in this area, specifically bringing XL 9.5 forwards. Yes, I saw the blog post about that: http://blog.2ndquadrant.com/working-towards-postgres-xl-9-5/ > Please can invites be posted to myself, Pavan Deolasee and Petr Jelinek also? > I'll pass on to others also. OK, I will send you a separate email and you can then supply their email addresses. > Koichi Suzuki is arranging a meeting in Hong Long for XC/XL discussions. > Presumably EDB is invited also? If Koichi is a leading organizer of this, why > are there two meetings? I certainly have heard nothing about it, except third-hand people telling me a meeting is happening. I assumed those meetings where XC/XL-specific. > that the XC approach is the only reasonable way to do it, > and that FDWs are the cleanest way to get it into community > Postgres. > > Those two things aren't at all obvious to me. > > Please don't presume my opposition. If the technical information were made > public, I might understand and agree with "the FDW approach", perhaps others > also. Well, the beauty of my approach is that we didn't need any technical direction or buy-in on sharding from the community to improve FDWs. I think now is the right time to try to get that buy-in, or adjust our approach. There isn't really much more to my _analysis_ than I presented. There is certainly a lot more work to do to even decide this is the right approach. Some of the groups already involved have more experience in trying this, e.g. Citus Data. > 2ndQuadrant is certainly happy to become involved in any team aiming to > add features to Postgres core, as long as that makes sense. There may be areas > we can all agree upon even if the full architecture remains in doubt. Right. > Before the community commits to a long term venture together we should see the > plan. Like all IT projects, expensive failure is possible and the lack of a > design is a huge flashing red warning light for me at present. If that requires > a meeting of all Developers, why are the meetings for this specifically not > happening at the agreed Developer meetings? Well, what meetings should it be at? I don't think there was clear enough direction for the June 2015 PGCon meeting. Is there an unconference in Vienna? One thing I saw at the last PGCon is that this is a big topic, so I think having a dedicated room and 3-hour slot for it is nice. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Sun, Aug 30, 2015 at 10:36:23PM +0300, Oleg Bartunov wrote: > Honestly, I still don't know which approach is better, we already played with > XL (ported to 9.4) and identified some very strong issues with inconsistency, > which scared us, especially taking into account how easy we found them. XC > people have fixed them, but I'm not sure if they were fundamental and if we > could construct more sophisticated tests and find more issues in XC/XL. We also > a bit disappointed by Huawei position about CSN patch, we hoped to use for our > XTM. FDW approach has been actively criticized by pg_shard people and that's > also made me a bit suspicious. Yep, that has me concerned too. The pg_shard people will be on the September 1 call and are working on a Google document to explain their concerns about FDWs for sharding. > It looks like we are doomed to continue > several development forks, so we decided to work on very important common > project, XTM, which we hoped could be accepted by all parties and eventually > committed to 9.6. Now I see we were right, unfortunately. Yes, the ability to add independent parts that can eventually be used for sharding is a strong indication that doing this incrementally is a good approach. > Again, could we organize meeting somewhere in September ? US is not good for > us, but other places should be ok. I want to have an agreement at least on > XTM. We still are testing various approaches, though. We could present results > of our experiments and are open to discussion. It's not easy project, but it's > something we could do for 9.6. Good. XTM is a must-have for several use-cases, including sharding. > I'm very glad Bruce started this discussion in -hackers, since it's silly to me > to participate in both threads :) Let's meet in September ! In summary, I think we need to start working on built-in sharding, and FDWs are the only way I can see to do it with minimal code changes, which I think might be a community requirement. It might not work, but right now, it is the only possible approach I can see. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Mon, Aug 31, 2015 at 7:29 AM, Bruce Momjian <bruce@momjian.us> wrote:
-- On Sun, Aug 30, 2015 at 03:31:10PM +0100, Simon Riggs wrote:
> I realized that we would never get community acceptance to dump the XC
> (or XL) code needed for sharding into community Postgres
>
>
> How or why did you realize that? There has never been any such discussion,
> AFAIK. Surely it can be possible to move required subsystems across?
Well, I have had many such discussions with XC/XL folks, and that was my
opinion. I have seen almost no public discussion about this because the
idea had almost no chance of success. If it was possible, someone would
have already suggested it on this list.
Or perhaps people invested in this area had other obligations or lacked motivation and/or time to work to push up for things in core. That's not possible to know, and what is done is done.
> July, 2015: Oleg Bartunov and his new company Postgres Professional (PP)
> started to think about joining the FDW approach, rather than working on
> XL, as they had stated at PGCon in June. A joint NTT/EDB/PP phone-in
> meeting is scheduled for September 1.
>
> August, 2015: While speaking at SFPUG, Citus Data approached me about
> joining the FDW sharding team. They have been invited to the September
> 1 meeting, as have the XC and XL people.
>
>
> 2ndQuadrant is working in this area, specifically bringing XL 9.5 forwards.
Yes, I saw the blog post about that:
http://blog.2ndquadrant.com/working-towards-postgres-xl-9-5/
> Please can invites be posted to myself, Pavan Deolasee and Petr Jelinek also?
> I'll pass on to others also.
OK, I will send you a separate email and you can then supply their email
addresses.
FWIW, I would be interested in that as well. I worked in this area of things for a couple of years as well FWIW.
> Koichi Suzuki is arranging a meeting in Hong Long for XC/XL discussions.
> Presumably EDB is invited also? If Koichi is a leading organizer of this, why
> are there two meetings?
I certainly have heard nothing about it, except third-hand people
telling me a meeting is happening. I assumed those meetings where
XC/XL-specific.
Yep, that's my understanding as well and AFAIK as I know things have been carried this way until now, aka XC/XL and Postgres core are aimed to live as separate communities.
Michael
On Mon, Aug 31, 2015 at 09:53:57AM +0900, Michael Paquier wrote: > Well, I have had many such discussions with XC/XL folks, and that was my > opinion. I have seen almost no public discussion about this because the > idea had almost no chance of success. If it was possible, someone would > have already suggested it on this list. > > > Or perhaps people invested in this area had other obligations or lacked > motivation and/or time to work to push up for things in core. That's not > possible to know, and what is done is done. Well, I have talked to everyone privately about this, and concluded that while horizontal scalability/sharding is useful, it is unlikely that the code volume of something like XC or XL would be accepted into the community, and frankly, now that we have FDWs, it is hard to imagine why we would _not_ go in the FDW direction. Of course, people have concerns, and FDWs might need to be improved, but it is something worth researching. We might find out FDWs can't be used at all, and that we have to either add much more code to Postgres to do sharding, do something like pg_shard, or not implement built-in sharding at all, but at least it is time to research this. > OK, I will send you a separate email and you can then supply their email > addresses. > > > FWIW, I would be interested in that as well. I worked in this area of things > for a couple of years as well FWIW. OK, I will send you an email. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Sun, Aug 30, 2015 at 10:08:06PM -0400, Bruce Momjian wrote: > On Mon, Aug 31, 2015 at 09:53:57AM +0900, Michael Paquier wrote: > > Well, I have had many such discussions with XC/XL folks, and that was my > > opinion. I have seen almost no public discussion about this because the > > idea had almost no chance of success. If it was possible, someone would > > have already suggested it on this list. > > > > > > Or perhaps people invested in this area had other obligations or lacked > > motivation and/or time to work to push up for things in core. That's not > > possible to know, and what is done is done. > > Well, I have talked to everyone privately about this, and concluded that > while horizontal scalability/sharding is useful, it is unlikely that the > code volume of something like XC or XL would be accepted into the > community, and frankly, now that we have FDWs, it is hard to imagine why > we would _not_ go in the FDW direction. Actually, there was hope that XC or XL would get popular enough that it would justify adding their code into community Postgres, but that never happened. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Mon, Aug 31, 2015 at 11:08 AM, Bruce Momjian <bruce@momjian.us> wrote: > > On Mon, Aug 31, 2015 at 09:53:57AM +0900, Michael Paquier wrote: > > Well, I have had many such discussions with XC/XL folks, and that was my > > opinion. I have seen almost no public discussion about this because the > > idea had almost no chance of success. If it was possible, someone would > > have already suggested it on this list. > > > > > > Or perhaps people invested in this area had other obligations or lacked > > motivation and/or time to work to push up for things in core. That's not > > possible to know, and what is done is done. > > Well, I have talked to everyone privately about this, and concluded that > while horizontal scalability/sharding is useful, it is unlikely that the > code volume of something like XC or XL would be accepted into the > community, and frankly, now that we have FDWs, it is hard to imagine why > we would _not_ go in the FDW direction. If I recall correctly in terms of numbers, that's indeed 40k of code, the main areas of XC code being the GTM, the planner changes for expression and join push down, and the connection pooler for parallel query execution. ISTM that FDW is a portion of the puzzle, there are other pieces that could be used toward an in-core integration, like the parallel stuff Amit Kapila is working on to allow remote query execution in parallel of local scans. Also, XC/XL were performing well on OLTP thanks to the connection pooler: this should indeed be part of the FDW portion managing the foreign scans. This may sound like a minor issue compared to the others, but already established connections help a lot when scaling out with foreign servers. > Of course, people have concerns, and FDWs might need to be improved, but > it is something worth researching. We might find out FDWs can't be used > at all, and that we have to either add much more code to Postgres to do > sharding, do something like pg_shard, or not implement built-in sharding > at all, but at least it is time to research this. I am really looking forward to hearing the arguments of the authors of pg_shard on the matter. > > OK, I will send you a separate email and you can then supply their email > > addresses. > > > > > > FWIW, I would be interested in that as well. I worked in this area of things > > for a couple of years as well FWIW. > > OK, I will send you an email. Thanks. -- Michael
On Mon, Aug 31, 2015 at 11:48 AM, Bruce Momjian <bruce@momjian.us> wrote: > On Sun, Aug 30, 2015 at 10:08:06PM -0400, Bruce Momjian wrote: >> On Mon, Aug 31, 2015 at 09:53:57AM +0900, Michael Paquier wrote: >> > Well, I have had many such discussions with XC/XL folks, and that was my >> > opinion. I have seen almost no public discussion about this because the >> > idea had almost no chance of success. If it was possible, someone would >> > have already suggested it on this list. >> > >> > >> > Or perhaps people invested in this area had other obligations or lacked >> > motivation and/or time to work to push up for things in core. That's not >> > possible to know, and what is done is done. >> >> Well, I have talked to everyone privately about this, and concluded that >> while horizontal scalability/sharding is useful, it is unlikely that the >> code volume of something like XC or XL would be accepted into the >> community, and frankly, now that we have FDWs, it is hard to imagine why >> we would _not_ go in the FDW direction. > > Actually, there was hope that XC or XL would get popular enough that it > would justify adding their code into community Postgres, but that never > happened. Forks are aimed to die without proper maintenance resources. Still, for XC/XL, what does not help is the complication of the architecture and SPOF management, particularly thinking with the GTM that was something completely new and not well understood (there is a GTM standby but this model is weak IMO and does not scale similarly to what you get with standbys, and impacts the overall performance of the cluster). -- Michael
On Sun, Aug 30, 2015 at 7:47 AM, Bruce Momjian <bruce@momjian.us> wrote:
>
> I have recently increased my public statements about the idea of adding
> horizontal scaling/sharding to Postgres. I wanted to share with hackers
> a timeline of how we got here, and where I think we are going in the
> short term:
>
> 2012-2013: As part of writing my scaling talk
> (http://momjian.us/main/presentations/overview.html#scaling), studying
> Oracle RAC, and talking to users, it became clear that an XC-like
> architecture (sharding) was the only architecture that was going to allow
> for write scaling.
>
>
> I have recently increased my public statements about the idea of adding
> horizontal scaling/sharding to Postgres. I wanted to share with hackers
> a timeline of how we got here, and where I think we are going in the
> short term:
>
> 2012-2013: As part of writing my scaling talk
> (http://momjian.us/main/presentations/overview.html#scaling), studying
> Oracle RAC, and talking to users, it became clear that an XC-like
> architecture (sharding) was the only architecture that was going to allow
> for write scaling.
>
I think sharding like architecture is quite useful for certain kind of workloads
where users can manage to arrange queries and data layout in an optimized
way which I hope users might agree to change if required. One thing to
consider here is what kind of scaling are we expecting in such a system and
is it sufficient considering we will keep focussed on this architecture for
horizontal scalability?
Generally speaking, the scaling in such systems is limited by the number of
profitable partitions user can create based on data and then cross-partition
transactions sucks the performance/scalability in such systems. I
understand that there is definitely a benefit in proceeding with sharding like
architecture as there are already some PostgreSQL based forks which uses
such architecture, so if we follow same way, we can save some effort rather
than inventing or following some other architecture, however there is no harm
is discussing pros and cons of some other architectures like Oracle RAC,
Google F1 or others.
At PGCon we agreed to have such meeting in Vienna at least. But I think we should be prepared and try to clean all our issues before. It looks like we already out of time,but probably we could meet in Hong Kong ?
Honestly, I still don't know which approach is better, we already played with XL (ported to 9.4) and identified some very strong issues with inconsistency, which scared us, especially taking into account how easy we found them. XC people have fixed them, but I'm not sure if they were fundamental and if we could construct more sophisticated tests and find more issues in XC/XL. We also a bit disappointed by Huawei position about CSN patch, we hoped to use for our XTM. FDW approach has been actively criticized by pg_shard people and that's also made me a bit suspicious. It looks like we are doomed to continue several development forks, so we decided to work on very important common project, XTM, which we hoped could be accepted by all parties and eventually committed to 9.6. Now I see we were right, unfortunately.
Distributed transaction manager should support at least three things
1. Atomic commit
2. Atomic visibility
3. Consistent snapshots (e.g. required for repeatable reads and higher isolation levels).
I have submitted patch for implementing first for FDWs. The patch adds infrastructure to be used by all FDWs including postgres_fdw. It also adds postgres_fdw code to use this infrastructure. The same can be used to achieve atomic commit in postgres_fdw based sharding. Please see if XTM can benefit from it. If there are things that are required by XTM, please post the requirements on that thread and I will work on those. You can find the latest patch at http://www.postgresql.org/message-id/CAFjFpRfANWL53+x2HdM9TCNe5pup=oPkQSSJ-KGfr-d2efj+CQ@mail.gmail.com
Again, could we organize meeting somewhere in September ? US is not good for us, but other places should be ok. I want to have an agreement at least on XTM. We still are testing various approaches, though. We could present results of our experiments and are open to discussion. It's not easy project, but it's something we could do for 9.6.I'm very glad Bruce started this discussion in -hackers, since it's silly to me to participate in both threads :) Let's meet in September !
--Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
On Mon, Aug 31, 2015 at 5:48 AM, Bruce Momjian <bruce@momjian.us> wrote:
On Sun, Aug 30, 2015 at 10:08:06PM -0400, Bruce Momjian wrote:
> On Mon, Aug 31, 2015 at 09:53:57AM +0900, Michael Paquier wrote:
> > Well, I have had many such discussions with XC/XL folks, and that was my
> > opinion. I have seen almost no public discussion about this because the
> > idea had almost no chance of success. If it was possible, someone would
> > have already suggested it on this list.
> >
> >
> > Or perhaps people invested in this area had other obligations or lacked
> > motivation and/or time to work to push up for things in core. That's not
> > possible to know, and what is done is done.
>
> Well, I have talked to everyone privately about this, and concluded that
> while horizontal scalability/sharding is useful, it is unlikely that the
> code volume of something like XC or XL would be accepted into the
> community, and frankly, now that we have FDWs, it is hard to imagine why
> we would _not_ go in the FDW direction.
Actually, there was hope that XC or XL would get popular enough that it
would justify adding their code into community Postgres, but that never
happened.
AFAIK, XC/XL has already some customers and that is an additional pressure on their development team, which is now called X2. I don't exactly know how internal Huawei's MPPDB is connected to XC/XL.
We need community test suite for cluster and our company is working on this. It's non-trivial work, but community will never accepts any cluster solution without thorough testing of functionality and performance. Our XC/XL experience was not good.
We need community test suite for cluster and our company is working on this. It's non-trivial work, but community will never accepts any cluster solution without thorough testing of functionality and performance. Our XC/XL experience was not good.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
We also a bit disappointed by Huawei position about CSN patch, we hoped to use for our XTM.
Disappointed in what way? Moving to some sort of CSN approach seems to open things up for different future ideas. In the short term, it would mean replacing potentially large snapshots and longer visibility checks. In the long term, perhaps CSN could help simplify the design of multi-master replication schemes.
FDW approach has been actively criticized by pg_shard people and that's also made me a bit suspicious. It looks like we are doomed to continue several development forks, so we decided to work on very important common project, XTM, which we hoped could be accepted by all parties and eventually committed to 9.6. Now I see we were right, unfortunately.
I think the original XC project probably would have taken the FDW approach as a basis if it had existed, with focus on push-down optimizations.
I assume that future work around PG sharding probably would be more likely to be accepted with the FDW approach. One could perhaps work on pushing down joins, aggregates and order by, then look at any optimizations gained if code is moved outside of FDW. It would make sense if some kind of generic optimization for foreign tables for SQL-based sources could be leveraged across all databases, rather than having to re-implement for each FDW.
There are different approaches and related features that may need to be improved.
Do we want multiple copies of shards, like the pg_shard approach? Or keep things simpler and leave it up to the DBA to add standbys?
Do we want to leverage table inheritance? If so, we may want to spend time improving performance for when the number of shards becomes large with what currently exists. If using table inheritance, we could add the ability to specify what node (er, foreign server) the subtable lives on. We could create top level sharding expressions that allow these to be implicitly created.
Should we allow arbitrary expressions for shards, not just range, list and hash?
Maybe the most community-acceptable approach would look something like
- Use FDWs, and continue to optimize push-down operations, also for non-PostgreSQL databases.
- Use table inheritance for defining the shards. Ideally allow for specifying that some shards may be replicated to other foreign servers (and itself) (for pushing down joins with lookup/static tables; at this point it should be decent for star schema based data warehouses).
- XTM/GTM hooks. Preferably we move to CSN for snapshots in core PostgreSQL though.
Longer term, efficient internode joins would require a lot more work.
The devil is in the details. There are things that have to be addressed, for example, if using global XIDs via GTM, not every transaction is on every node, so we need to make sure that new clog pages get added properly. There is also the potential to require a lot more code to be added, like for cursor handling and stored functions. Perhaps some limitations when using shards to foreign servers are acceptable if it is desired to minimize code changes. XC and XL code help.
Regards,
Mason
On Mon, Aug 31, 2015 at 02:48:31PM -0400, Mason S wrote: > I assume that future work around PG sharding probably would be more likely to > be accepted with the FDW approach. One could perhaps work on pushing down > joins, aggregates and order by, then look at any optimizations gained if code > is moved outside of FDW. It would make sense if some kind of generic > optimization for foreign tables for SQL-based sources could be leveraged across > all databases, rather than having to re-implement for each FDW. > > There are different approaches and related features that may need to be > improved. > > Do we want multiple copies of shards, like the pg_shard approach? Or keep > things simpler and leave it up to the DBA to add standbys? I agree with all of the above. > Do we want to leverage table inheritance? If so, we may want to spend time > improving performance for when the number of shards becomes large with what > currently exists. If using table inheritance, we could add the ability to > specify what node (er, foreign server) the subtable lives on. We could create > top level sharding expressions that allow these to be implicitly created. > > Should we allow arbitrary expressions for shards, not just range, list and > hash? > > Maybe the most community-acceptable approach would look something like I think everyone agrees that our current partitioning setup is just too verbose and error-prone for users, and needs a simpler interface, and one that can be better optimized internally. I assume FDW-based sharding will benefit from that work as well. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Mon, Aug 31, 2015 at 9:48 PM, Mason S <masonlists@gmail.com> wrote:
We also a bit disappointed by Huawei position about CSN patch, we hoped to use for our XTM.Disappointed in what way? Moving to some sort of CSN approach seems to open things up for different future ideas. In the short term, it would mean replacing potentially large snapshots and longer visibility checks. In the long term, perhaps CSN could help simplify the design of multi-master replication schemes.
We are disappointed because at PGCon talk Huawei announced publishing of their CSN patch and further work in this direction together with community. However, it's even not published yet despite all the promises. Nobody from Huawei answers CSN thread in the hackers.
So, I think we got nothing from Huawei except teasing and should rely only on ourselves. That is disappointing.
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Mon, Aug 31, 2015 at 2:12 AM, Oleg Bartunov <obartunov@gmail.com> wrote: > > AFAIK, XC/XL has already some customers and that is an additional pressure > on their development team, which is now called X2. I don't exactly know how > internal Huawei's MPPDB is connected to XC/XL. > Huawei's MPPDB is based on PG-XC and tailored it more targeting OLAP scenarios. The basic idea is that OLAP needs a shared nothing scale out architecture for read and write. It needs ok-TP-performance, a restricted set of functionality, and thus avoids some problems like GTM being a central scaling bottleneck. I advocate to merge PostgreSQL core with scale-out features, if we are ready to face some long time functional discrepancies between the two deployments. Regards, Qingqing
All, Bruce: First, let me put out there that I think the horizontal scaling project which has buy-in from the community and we're working on is infinitely better than the one we're not working on or is an underresourced fork. So we're in agreement on that. However, I think there's a lot of room for discussion; I feel like the FDW approach was decided in exclusive meetings involving a very small number of people. The FDW approach *may* be the right approach, but I'd like to see some rigorous questioning of that before it's final. Particularly, I'm concerned that we already have two projects in process aimed at horizontal scalability, and it seems like we could bring either (or both) projects to production quality MUCH faster than we could make an FDW-based solution work. These are: * pg_shard * BDR It seems worthwhile, just as a thought experiment, if we can get where we want using those, faster, or by combining those with new FDW features. It's also important to recognize that there are three major use-cases for write-scalable clustering: * OLTP: small-medium cluster, absolute ACID consistency, bottlnecked on small writes per second * DW: small-large cluster, ACID optional, bottlenecked on bulk reads/writes * Web: medium to very large cluster, ACID optional, bottlenecked on # of connections We cannot possibly solve all of the above at once, but to the extent that we recognize all 3 use cases, we can build core features which can be adapted to all of them. I'm also going to pontificate that, for a future solution, we should not focus on write *IO*, but rather on CPU and RAM. The reason for this thinking is that, with the latest improvements in hardware and 9.5 improvements, it's increasingly rare for machines to be bottlenecked on writes to the transaction log (or the heap). This has some implications for system design. For example, solutions which require all connections to go through a single master node do not scale sufficiently to be worth bothering with. On some other questions from Mason: > Do we want multiple copies of shards, like the pg_shard approach? Or > keep things simpler and leave it up to the DBA to add standbys? We want multiple copies of shards created by the sharding system itself.Having a separate, and completely orthagonal, redundancysystem to the sharding system is overly burdensome on the DBA and makes low-data-loss HA impossible. > Do we want to leverage table inheritance? If so, we may want to spend > time improving performance for when the number of shards becomes large > with what currently exists. If using table inheritance, we could add the > ability to specify what node (er, foreign server) the subtable lives on. > We could create top level sharding expressions that allow these to be > implicitly created. IMHO, given that we're looking at replacing inheritance because of its many documented limitations, building sharding on top of inheritance seems unwise. For example, many sharding systems are hash-based; how would an inheritance system transparently use hash keys? > Should we allow arbitrary expressions for shards, not just range, list > and hash? That seems like a 2.0 feature. It also doesn't seem necessary to support it for the moderately skilled user; that is, requiring a special C sharding function for this seems fine to me. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 08/31/2015 01:16 PM, Josh Berkus wrote: > All, Bruce: > > I'm also going to pontificate that, for a future solution, we should not > focus on write *IO*, but rather on CPU and RAM. The reason for this > thinking is that, with the latest improvements in hardware and 9.5 > improvements, it's increasingly rare for machines to be bottlenecked on > writes to the transaction log (or the heap). This has some implications > for system design. For example, solutions which require all connections > to go through a single master node do not scale sufficiently to be worth > bothering with. We see this already, under very high concurrency (lots of connections, many cores) we often see a significant drop in performance that is not related to IO in any meaningful way. JD -- Command Prompt, Inc. - http://www.commandprompt.com/ 503-667-4564 PostgreSQL Centered full stack support, consulting and development. Announcing "I'm offended" is basically telling the world you can't control your own emotions, so everyone else should do it for you.
On Mon, Aug 31, 2015 at 4:16 PM, Josh Berkus <josh@agliodbs.com> wrote: > First, let me put out there that I think the horizontal scaling project > which has buy-in from the community and we're working on is infinitely > better than the one we're not working on or is an underresourced fork. > So we're in agreement on that. However, I think there's a lot of room > for discussion; I feel like the FDW approach was decided in exclusive > meetings involving a very small number of people. The FDW approach > *may* be the right approach, but I'd like to see some rigorous > questioning of that before it's final. It seems to me that sharding consists of (1) breaking your data set up into shards, (2) possibly replicating some of those shards onto multiple machines, and then (3) being able to access the remote data from local queries. As far as (1) is concerned, we need declarative partitioning, which is being worked on by Amit Langote. As far as (2) is concerned, I hope and expect BDR, or technology derived therefrom, to eventually fill that need. As far as (3) is concerned, why wouldn't we use the foreign data wrapper interface, and specifically postgres_fdw? That interface was designed for the explicit purpose of allowing access to remote data sources, and a lot of work has been put into it, so it would be highly surprising if we decided to throw that away and develop something completely new from the ground up. It's true that postgres_fdw doesn't do everything we need yet. The new join pushdown hooks aren't used by postgres_fdw yet, and the API itself has some bugs with EvalPlanQual handling. Aggregate pushdown is waiting on upper planner path-ification. DML pushdown doesn't exist yet, and the hooks that would enable pushdown of ORDER BY clauses to the remote side aren't being used by postgres_fdw. But all of these things have been worked on. Patches for many of them have already been posted. They have suffered from a certain amount of neglect by senior hackers, and perhaps also from a shortage of time on the part of the authors. But an awful lot of the work that is needed here has already been done, if only we could get it committed. Aggregate pushdown is a notable exception, but abandoning the foreign data wrapper approach in favor of something else won't fix that. Postgres-XC developed a purpose-built system for talking to other nodes instead of using the FDW interface, for the very good reason that the FDW interface did not yet exist at the time that Postgres-XC was created. But several people associated with the XC project have said, including one on this thread, that if it had existed, they probably would have used it. And it's hard to see why you wouldn't: with XC's approach, the remote data source is presumed to be PostgreSQL (or Postgres-XC/XL/X2/whatever); and you can only use the facility as part of a sharding solution. The FDW interface can talk to anything, and it can be used for stuff other than sharding, like making one remote table appear local because you just happen to want that for some reason. This makes the XC approach look rather brittle by comparison. I don't blame the XC folks for taking the shortest path between two points, but FDWs are better, and we ought to try to leverage that. > Particularly, I'm concerned that we already have two projects in process > aimed at horizontal scalability, and it seems like we could bring either > (or both) projects to production quality MUCH faster than we could make > an FDW-based solution work. These are: > > * pg_shard > * BDR > > It seems worthwhile, just as a thought experiment, if we can get where > we want using those, faster, or by combining those with new FDW features. I think it's abundantly clear that we need a logical replication solution as part of any horizontal scalability story. People will want to do things like have 10 machines with each piece of data on 3 of them, and there won't be any reasonable way of doing that without logical replication. I assume that BDR, or some technology derived from it, will end up in core and solve that problem. I had actually hoped we were going to get that in 9.5, but it didn't happen that way. Still, I think that getting first single-master, and then eventually multi-master, logical replication in core is absolutely critical. And not just for sharding specifically: replicating your whole database to several nodes and load-balancing your clients across them isn't sharding, but it does give you read scalability and is a good fit for people with geographically dispersed data with good geographical locality. I think a lot of people will want that. I'm not quite sure yet how we can marry declarative partitioning and better FDW-pushdown and logical replication into one seamless, easy to deploy solution that requires very low administrator effort. But I am sure that each of those things, taken individually, is very useful, and that being able to construct a solution from those building blocks would be a big improvement over what we have today. I can't imagine that trying to do one monolithic project that provides all of those things, but only if you combine them in the specific way that the designer had in mind, is ever going to be successful. People _will_ want access to each of those features in an unbundled fashion. And, trying to do them altogether leads to trying to solve too many problems at once. I think the history of Postgres-XC is a cautionary tale there. I don't really understand how pg_shard fits into this equation. It looks to me like it does some interesting things but, for example, it doesn't support JOIN pushdown, and suggests that you use the proprietary CitusDB engine if you need that. But I think JOIN pushdown is something we want to have in core, not something where we want to point people to proprietary alternatives. And it has some restrictions on INSERT statements - they have to contain only values which are constants or which can be folded to constants. I'm just guessing, but I bet that's probably due to some limitation which pg_shard, being out of core, has difficulty overcoming, but we can do better in core. Basically I guess I expect much of what pg_shard does to be subsumed as we improve FDWs, but maybe not all of it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi Bruce,
Sumedh from Citus Data here.
> August, 2015: While speaking at SFPUG, Citus Data approached me about joining the FDW sharding team. They have been invited to the September 1 meeting, as have the XC and XL people.
I'd like to add a clarification. We already tried the FDW APIs for pg_shard two years ago and failed. We figured sharing our learnings could contribute to the technical discussion and that's why we wanted to be in the call.
Ozgun summarized our technical learnings in this design document: https://goo.gl/vJWF85
In the document, we focused on one of the four learnings we had with FDW APIs. For us, we switched to the hook API based approach, and things went smoothly from there.
Best,
Sumedh
On Sat, Aug 29, 2015 at 7:17 PM, Bruce Momjian <bruce@momjian.us> wrote:
I have recently increased my public statements about the idea of adding
horizontal scaling/sharding to Postgres. I wanted to share with hackers
a timeline of how we got here, and where I think we are going in the
short term:
2012-2013: As part of writing my scaling talk
(http://momjian.us/main/presentations/overview.html#scaling), studying
Oracle RAC, and talking to users, it became clear that an XC-like
architecture (sharding) was the only architecture that was going to allow
for write scaling.
Users and conference attendees I talked to were increasingly concerned
about the ability of Postgres to scale for high write volumes. They didn't
necessarily need that scale now, but they needed to know they could get
it if they wanted it, and wouldn't need to switch to a new database in
the future. This is similar to wanting a car that can get you on a highway
on-ramp fast --- even if you don't need it, you want to know it is there.
2014: I started to shop around the idea that we could use FDWs,
parallelism, and a transaction/snapshot manager to get XC features
as built-in to Postgres. (I don't remember where the original idea
came from.) It was clear that having separate forks of the source code
in XC and XL was never going to achieve critical mass --- there just
aren't enough people who need high right scale right now, and the fork
maintenance overhead is a huge burden.
I realized that we would never get community acceptance to dump the XC
(or XL) code needed for sharding into community Postgres, but with FDWs,
we could add the features as _part_ of improving FDWs, which would benefit
FDWs _and_ would be useful for sharding. (We already see some of those
FDW features in 9.5.)
October, 2014: EDB and NTT started working together in the community
to start improving FDWs as a basis for an FDW-based sharding solution.
Many of the 9.5 FDW improvements that also benefit sharding were developed
by a combined EDB/NTT team. The features improved FDWs independent of
sharding, so they didn't need community buy-in on sharding to get them
accepted.
June, 2015: I attended the PGCon sharding unconference session and
there was a huge discussion about where we should go with sharding.
I think the big take-away was that most people liked the FDW approach,
but had business/customer reasons for wanting to work on XC or XL because
those would be production-ready faster.
July, 2015: Oleg Bartunov and his new company Postgres Professional (PP)
started to think about joining the FDW approach, rather than working on
XL, as they had stated at PGCon in June. A joint NTT/EDB/PP phone-in
meeting is scheduled for September 1.
August, 2015: While speaking at SFPUG, Citus Data approached me about
joining the FDW sharding team. They have been invited to the September
1 meeting, as have the XC and XL people.
October, 2015: EDB is sponsoring a free 3-hour summit about FDW sharding
at the PG-EU conference in Vienna. Everyone is invited, but it is hoped
most of the September 1 folks can attend.
February, 2016: Oleg is planning a similar meeting at their February
Moscow conference.
Anyway, I wanted to explain the work that has been happening around
sharding. As things move forward, I am increasingly convinced that write
scaling will be needed soon, that the XC approach is the only reasonable
way to do it, and that FDWs are the cleanest way to get it into community
Postgres.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 08/31/2015 02:47 PM, Robert Haas wrote: > On Mon, Aug 31, 2015 at 4:16 PM, Josh Berkus <josh@agliodbs.com> wrote: >> First, let me put out there that I think the horizontal scaling project >> which has buy-in from the community and we're working on is infinitely >> better than the one we're not working on or is an underresourced fork. >> So we're in agreement on that. However, I think there's a lot of room >> for discussion; I feel like the FDW approach was decided in exclusive >> meetings involving a very small number of people. The FDW approach >> *may* be the right approach, but I'd like to see some rigorous >> questioning of that before it's final. > > It seems to me that sharding consists of (1) breaking your data set up > into shards, (2) possibly replicating some of those shards onto > multiple machines, and then (3) being able to access the remote data > from local queries. As far as (1) is concerned, we need declarative > partitioning, which is being worked on by Amit Langote. As far as (2) > is concerned, I hope and expect BDR, or technology derived therefrom, > to eventually fill that need. Well, maybe. If you look at pg_shard, you'll see that it works by multiplexing writes to all copies. There's a good reason to do that; it allows you to have a tight feedback loop between the success of writes and the availability of "good" nodes. If you're depending on a separate replication system to handle getting row copies from one shard to another, then you need a different way to deal with bad nodes and with inconsistency between copies of shards. That's why the existing multinode non-relational databases don't separate replication from writes, either. For that matter, if what you want is transactional fully ACID sharding, I really don't see a way to do it via BDR, since BDR is purely asynchronous replication, as far as I know. Also, if we want BDR to do this, that's pretty far afield of what BDR is currently capable of, so someone will need to put serious work into it rather than just assuming functionality will show up. > As far as (3) is concerned, why > wouldn't we use the foreign data wrapper interface, and specifically > postgres_fdw? That interface was designed for the explicit purpose of > allowing access to remote data sources, and a lot of work has been put > into it, so it would be highly surprising if we decided to throw that > away and develop something completely new from the ground up. Well, query hooks are also a capability which we already have, and is mature. Citus has already posted about why they chose to use them instead. As long as you recognize that the FDW API (not just the existing fdws) will need to expand to make this work, it's a viable path. Also consider that (3) includes both reads and writes. > I think it's abundantly clear that we need a logical replication > solution as part of any horizontal scalability story. People will > want to do things like have 10 machines with each piece of data on 3 > of them, and there won't be any reasonable way of doing that without > logical replication. I assume that BDR, or some technology derived > from it, will end up in core and solve that problem. I had actually > hoped we were going to get that in 9.5, but it didn't happen that way. > Still, I think that getting first single-master, and then eventually > multi-master, logical replication in core is absolutely critical. And > not just for sharding specifically: replicating your whole database to > several nodes and load-balancing your clients across them isn't > sharding, but it does give you read scalability and is a good fit for > people with geographically dispersed data with good geographical > locality. I think a lot of people will want that. Well, the latter thing is something which BDR is designed for, so all that needs to happen with that is getting the rest of the plumbing into core. Also documentation, packaging, productization, etc. But the heavy lifting has already been done. However, integrating BDR with sharding has some major design issues which aren't trivial and may be unresolvable, per above. > I'm not quite sure yet how we can marry declarative partitioning and > better FDW-pushdown and logical replication into one seamless, easy to > deploy solution that requires very low administrator effort. But I am > sure that each of those things, taken individually, is very useful, > and that being able to construct a solution from those building blocks > would be a big improvement over what we have today. I can't imagine > that trying to do one monolithic project that provides all of those > things, but only if you combine them in the specific way that the > designer had in mind, is ever going to be successful. People _will_ > want access to each of those features in an unbundled fashion. And, > trying to do them altogether leads to trying to solve too many > problems at once. I think the history of Postgres-XC is a cautionary > tale there. Yes. It's also a cautionary tale about not skipping over major design elements (like HA and DR) until after version 1.0, which is one of the reasons I'm harping on certain things here. I don't want us to repeat those mistakes. > I don't really understand how pg_shard fits into this equation. It > looks to me like it does some interesting things but, for example, it > doesn't support JOIN pushdown, and suggests that you use the > proprietary CitusDB engine if you need that. But I think JOIN > pushdown is something we want to have in core, not something where we > want to point people to proprietary alternatives. And it has some > restrictions on INSERT statements - they have to contain only values > which are constants or which can be folded to constants. I'm just > guessing, but I bet that's probably due to some limitation which > pg_shard, being out of core, has difficulty overcoming, but we can do > better in core. Basically I guess I expect much of what pg_shard does > to be subsumed as we improve FDWs, but maybe not all of it. pg_shard provides an alternate implementation based on planner hooks instead of FDWs. Even if you pursue an FDW-based design, you should look at (a) why the Citus team found FDWs to be unworkable and (b) what got implemented in planner hooks. Otherwise you're liable to repeat the exact same "learning experience". -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Mon, 2015-08-31 at 22:21 +0000, Robert Haas wrote: > It seems to me that sharding consists of (1) breaking your data set up > into shards, (2) possibly replicating some of those shards onto > multiple machines, and then (3) being able to access the remote data > from local queries. [...] I believe there is another aspect to sharding that I have not yet seen mentioned, which is one of connection routing. One use case that I have been involved in, is to simply partition the application into entirely, or almost entirely, separate datasets running on separate databases with little or no need for queries to access remote data. This allows each database to deal only with connections from clients that actually want its local data, greatly reducing the number of connections on any individual database. If this works for your application, your ability to scale is great. The pain point comes from trying to route queries to the correct database. Inevitably, everyone taking this route builds custom connection-selection layers into their apps. It seems to me that even with the more sophisticated types of sharding being discussed here, the ability to conditionally route a query/connection to a suitable starting database could be quite beneficial. Although this is probably a job for the pgbouncer/pgpool developers rather than the hackers on this list, this thread seems to be a good place to mention it. __ Marc
On Mon, Aug 31, 2015 at 05:10:11PM -0700, Josh Berkus wrote: > > As far as (3) is concerned, why > > wouldn't we use the foreign data wrapper interface, and specifically > > postgres_fdw? That interface was designed for the explicit purpose of > > allowing access to remote data sources, and a lot of work has been put > > into it, so it would be highly surprising if we decided to throw that > > away and develop something completely new from the ground up. > > Well, query hooks are also a capability which we already have, and is > mature. Citus has already posted about why they chose to use them instead. > > As long as you recognize that the FDW API (not just the existing fdws) > will need to expand to make this work, it's a viable path. Uh, we already have a list of things we need to add to FDWs to make them work, and Citus Data has provided a document of more things that are needed, https://goo.gl/vJWF85. I am not sure how much bigger a red flag you want to confirm that everyone agrees that major FDW improvements are a requirement for this. My hope is that many FDW improvements will benefit sharding and non-sharding workloads, but I bet some improvements are going to be sharding-specific. I would say we are still in the exploratory stage, but based on the number of people who care about this feature and want to be involved, I think we are off to a very good start. :-) -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Bruce Momjian wrote: > My hope is that many FDW improvements will benefit sharding and > non-sharding workloads, but I bet some improvements are going to be > sharding-specific. I would say we are still in the exploratory stage, > but based on the number of people who care about this feature and want > to be involved, I think we are off to a very good start. :-) Having lots of interested people doesn't help with some problems, though. The Citus document says: And the issue with these four limitations wasn't with foreigndata wrappers. We wrote mongo_fdw and cstore_fdw, and we'requitehappy with the contract FDWs provide. The problem was thatwe were trying to retrofit an API for something thatit wasfundamentally not designed to do. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2015/09/01 9:54, Bruce Momjian wrote: > On Mon, Aug 31, 2015 at 05:10:11PM -0700, Josh Berkus wrote: >>> As far as (3) is concerned, why >>> wouldn't we use the foreign data wrapper interface, and specifically >>> postgres_fdw? That interface was designed for the explicit purpose of >>> allowing access to remote data sources, and a lot of work has been put >>> into it, so it would be highly surprising if we decided to throw that >>> away and develop something completely new from the ground up. >> >> Well, query hooks are also a capability which we already have, and is >> mature. Citus has already posted about why they chose to use them instead. >> >> As long as you recognize that the FDW API (not just the existing fdws) >> will need to expand to make this work, it's a viable path. > > Uh, we already have a list of things we need to add to FDWs to make them > work, and Citus Data has provided a document of more things that are > needed, https://goo.gl/vJWF85. My top priority is postgres_fdw join pushdown, but I also plan to work on update pushdown [1] for 9.6, which couldn't make it into 9.5. I think that would resolve the following issue mentioned in the document: UPDATE and DELETE operations are performed by first fetchingrecords from the table scanning functions, and then going overthefetched records. If the user wanted to update a single row,this involved first pulling rows and then updating relatedrecords. Best regards, Etsuro Fujita [1] https://commitfest.postgresql.org/4/162/
On Tue, Sep 1, 2015 at 3:17 AM, Robert Haas <robertmhaas@gmail.com> wrote:
--
It seems to me that sharding consists of (1) breaking your data set up
into shards, (2) possibly replicating some of those shards onto
multiple machines, and then (3) being able to access the remote data
from local queries. As far as (1) is concerned, we need declarative
partitioning, which is being worked on by Amit Langote. As far as (2)
is concerned, I hope and expect BDR, or technology derived therefrom,
to eventually fill that need. As far as (3) is concerned, why
wouldn't we use the foreign data wrapper interface, and specifically
postgres_fdw? That interface was designed for the explicit purpose of
allowing access to remote data sources, and a lot of work has been put
into it, so it would be highly surprising if we decided to throw that
away and develop something completely new from the ground up.
It's true that postgres_fdw doesn't do everything we need yet. The
new join pushdown hooks aren't used by postgres_fdw yet, and the API
itself has some bugs with EvalPlanQual handling. Aggregate pushdown
is waiting on upper planner path-ification. DML pushdown doesn't
exist yet, and the hooks that would enable pushdown of ORDER BY
clauses to the remote side aren't being used by postgres_fdw. But all
of these things have been worked on. Patches for many of them have
already been posted. They have suffered from a certain amount of
neglect by senior hackers, and perhaps also from a shortage of time on
the part of the authors. But an awful lot of the work that is needed
here has already been done, if only we could get it committed.
Aggregate pushdown is a notable exception, but abandoning the foreign
data wrapper approach in favor of something else won't fix that.
Postgres-XC developed a purpose-built system for talking to other
nodes instead of using the FDW interface, for the very good reason
that the FDW interface did not yet exist at the time that Postgres-XC
was created. But several people associated with the XC project have
said, including one on this thread, that if it had existed, they
probably would have used it. And it's hard to see why you wouldn't:
with XC's approach, the remote data source is presumed to be
PostgreSQL (or Postgres-XC/XL/X2/whatever); and you can only use the
facility as part of a sharding solution. The FDW interface can talk
to anything, and it can be used for stuff other than sharding, like
making one remote table appear local because you just happen to want
that for some reason. This makes the XC approach look rather brittle
by comparison. I don't blame the XC folks for taking the shortest
path between two points, but FDWs are better, and we ought to try to
leverage that.
In my discussions on this topic with various folks including Robert, I've conceded that if FDW was available when XC was first written, in all likelihood we would have used and extended that interface. But that wasn't the case and we did what we thought was the best solution at that time, given the resources and the schedule. To be honest, when XC project was started, I was quite skeptical about the whole thing given the goal was to built something which can replace Oracle RAC with may be less than 1% resources of what Oracle must have invested in building RAC. The lack of resources at the start of the project keeps showing up in the quality issues that users report from time to time. Having said that, I am quite satisfied with what we have been able to build within the constraints.
But FDW is just one part of the story. There is this entire global consistency problem that would require something like GTM to give out XIDs and snapshots, atomicity which would require managing transactions across multiple shards, join pushdowns when all data is not available locally, something that XL is attempting to solve with datanode-datanode exchange of information, other global states such as sequences, replicating some part of the data to multiple shards for efficient operations, ability to add/remove shards with least disruption, globally consistent backups/restore. XC/XL has attempted to solve each of them to some extent. I don't claim that they are completely solved and there are no corner cases left, but we have made fairly good progress on each of them.
My worry is that if we start implementing them again from scratch, it will take a few years before we get them in a usable state. What XC/XL lacked is probably a Robert Haas or a Tom Lane who could look at the work and suggest major edits. If that had happened, the quality of the product could have been much better today. I don't mean to derate the developers who worked on XC/XL, but there is no harm in accepting that if someone with a much better understanding of the whole system was part of the team, that would have positively impacted the project. Is that an angle worth exploring? Does it make sense to commit some more resources to say XC or XL and try to improve the quality of the product even further? To be honest, XL is in far far better shape (haven't really tried XC in a while) and some more QA/polishing can make it production ready much sooner.
Yet another possibility is rework the design such that only coordinator needs to be a fork of PostgreSQL but the shards are all PostgreSQL instances, queried using standard client APIs. That would reduce the code that needs to go in the core to build the entire scalable system and also shorten the timeline considerably.
Thanks,
Pavan
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
PostgreSQL Development, 24x7 Support, Training & Services
On 2015-08-31 20:54:51 -0400, Bruce Momjian wrote: > Uh, we already have a list of things we need to add to FDWs to make them > work, and Citus Data has provided a document of more things that are > needed, https://goo.gl/vJWF85. I am not sure how much bigger a red flag > you want to confirm that everyone agrees that major FDW improvements are > a requirement for this. Several people saying that the FDW infrastructure isn't sufficient right now is pretty far from implying that all of them agree that the FDW API is the way to go. I'm not sure myself. If it works out it's going to save us some work and make it more realistic to get there sometime not too far off. But I'm afraid that the resulting system will feel like our current partitioning implemenentation. Yes, it kinda works, but it's hard to get started, it doesn't support too many features and you're kind afraid your relatives will see what you've done.
On Mon, Aug 31, 2015 at 01:16:21PM -0700, Josh Berkus wrote: > I'm also going to pontificate that, for a future solution, we should not > focus on write *IO*, but rather on CPU and RAM. The reason for this > thinking is that, with the latest improvements in hardware and 9.5 > improvements, it's increasingly rare for machines to be bottlenecked on > writes to the transaction log (or the heap). This has some implications > for system design. For example, solutions which require all connections > to go through a single master node do not scale sufficiently to be worth > bothering with. Well, I highlighted write IO for sharding because sharding is the only solution that allows write scaling. If we want to scale CPU, we are better off using server parallelism, and to scale CPU and RAM, a multi-master/BDR solution seems best. (Multi-master doesn't do write scaling because you eventually have to write all the data to each node.) For read-scaling, having the read queries go to streaming slave seems best. Only sharding allows for write scaling, but only in very limited use-cases, where the queries use multiple shards and the sharding/communication overhead doesn't overwhelm the benefits. For these reasons, I think sharding has a limited use, and hence, I don't think the community will be willing to add a lot of code just to enable auto-sharding. I think it has to be done in a way that adding sharding also gives other benefits, like better FDWs and cross-node ACID control. In summary, I don't think adding a ton of code just to do sharding will be acceptable. A corollary of that, is that if FDWs are unable to provide useful sharding, I don't see an acceptable way of adding built-in sharding to Postgres. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Tue, Sep 1, 2015 at 10:15:27AM +0200, Andres Freund wrote: > On 2015-08-31 20:54:51 -0400, Bruce Momjian wrote: > > Uh, we already have a list of things we need to add to FDWs to make them > > work, and Citus Data has provided a document of more things that are > > needed, https://goo.gl/vJWF85. I am not sure how much bigger a red flag > > you want to confirm that everyone agrees that major FDW improvements are > > a requirement for this. > > Several people saying that the FDW infrastructure isn't sufficient right > now is pretty far from implying that all of them agree that the FDW API > is the way to go. > > I'm not sure myself. If it works out it's going to save us some work and > make it more realistic to get there sometime not too far off. But I'm > afraid that the resulting system will feel like our current partitioning > implemenentation. Yes, it kinda works, but it's hard to get started, it > doesn't support too many features and you're kind afraid your relatives > will see what you've done. Well, reworking our partitioning system is one of the things required for sharding, so at least we will clean up one mess while we create another. ;-) Seem my post to Josh Berkus just now --- I think if we don't use FDWs, that sharding is such a limited use-case that we will not implement it inside of Postgres. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Tue, Sep 1, 2015 at 09:30:41AM +0530, Pavan Deolasee wrote: > My worry is that if we start implementing them again from scratch, it will take > a few years before we get them in a usable state. What XC/XL lacked is probably > a Robert Haas or a Tom Lane who could look at the work and suggest major edits. > If that had happened, the quality of the product could have been much better > today. I don't mean to derate the developers who worked on XC/XL, but there is > no harm in accepting that if someone with a much better understanding of the > whole system was part of the team, that would have positively impacted the > project. Is that an angle worth exploring? Does it make sense to commit some > more resources to say XC or XL and try to improve the quality of the product > even further? To be honest, XL is in far far better shape (haven't really tried > XC in a while) and some more QA/polishing can make it production ready much > sooner. There is no question that using XC/XL will get us to a usable solution faster, but see my recent post to Josh Berkus --- the additional code will be so burdensome that I doubt it would be accepted. If it was, I bet we would have considered it long ago. I think the only way we are going to get sharding into Postgres is to do it in a way that enhances existing Postgres capabilities. (I think we got a little too cute in enhancing existing Postgres capabilities to add partitioning, and I think we need to fix that. Hopefully we don't do the same thing with sharding.) -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On 01/09/15 21:41, Bruce Momjian wrote: > > Well, reworking our partitioning system is one of the things required > for sharding, so at least we will clean up one mess while we create > another. ;-) > > Seem my post to Josh Berkus just now --- I think if we don't use FDWs, > that sharding is such a limited use-case that we will not implement it > inside of Postgres. > I'm thinking that partitioning and sharding are two different things: Partitioning is about reducing the amount of table data accessed and also perhaps easing admin activities (archiving/removing old stuff etc). Sharding is a about parallelism and redundancy...copies of stuff in different places and concurrent access by virtue of it being on different nodes! Now *maybe* FDW is a good way to approach this, but really would be nice to see a more rigorous analysis (I note that like XC and XL, Greenplum looked at the existing mechanisms around at the time and ended up writing their own). Now I'm aware that things have moved on - but I think there needs to be a proper discussion about design and what we think distributed data/sharding etc should provide *before* grabbing hold of FDW as *the answer*! Regards Mark
On Mon, Aug 31, 2015 at 3:08 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
On Mon, Aug 31, 2015 at 9:48 PM, Mason S <masonlists@gmail.com> wrote:We also a bit disappointed by Huawei position about CSN patch, we hoped to use for our XTM.Disappointed in what way? Moving to some sort of CSN approach seems to open things up for different future ideas. In the short term, it would mean replacing potentially large snapshots and longer visibility checks. In the long term, perhaps CSN could help simplify the design of multi-master replication schemes.We are disappointed because at PGCon talk Huawei announced publishing of their CSN patch and further work in this direction together with community. However, it's even not published yet despite all the promises. Nobody from Huawei answers CSN thread in the hackers.So, I think we got nothing from Huawei except teasing and should rely only on ourselves. That is disappointing.
Oh, I see. I was in contact with some of the folks involved a few weeks ago. The intention is to do a bit more work on the patch and then post it. This should be happening "soon".
Thanks,
Mason
On Mon, Aug 31, 2015 at 11:23:58PM -0300, Alvaro Herrera wrote: > Bruce Momjian wrote: > > > My hope is that many FDW improvements will benefit sharding and > > non-sharding workloads, but I bet some improvements are going to be > > sharding-specific. I would say we are still in the exploratory stage, > > but based on the number of people who care about this feature and want > > to be involved, I think we are off to a very good start. :-) > > Having lots of interested people doesn't help with some problems, > though. The Citus document says: > > And the issue with these four limitations wasn't with foreign > data wrappers. We wrote mongo_fdw and cstore_fdw, and we're > quite happy with the contract FDWs provide. The problem was that > we were trying to retrofit an API for something that it was > fundamentally not designed to do. I had a chance to review the Citus Data document just now: https://goo.gl/vJWF85 Particularly, it links to this document, which is clearer about the issues they are trying to solve: https://www.citusdata.com/blog/114-how-to-build-your-distributed-database The document opens a big question --- when queries can't be processed in a traditional top/down fashion, Citus has the goal of sending groups of results up the the coordinator, reordering them, then sending them back to the shards for further processing, basically using the shards as compute engines because the shards are no longer using local data to do their computations. The two examples they give are COUNT(DISTINCT) and a join across two sharded tables ("CANADA"). I assumed these queries were going to be solved by sending as digested data as possible to the coordinator, and having the coordinator complete any remaining processing. I think we are going to need to decide if such "sending data back to shards" is something we are ever going to implement. I can see FDWs _not_ working well for that use-case. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Tue, Sep 1, 2015 at 6:55 AM, Bruce Momjian <bruce@momjian.us> wrote:
On Mon, Aug 31, 2015 at 11:23:58PM -0300, Alvaro Herrera wrote:
> Bruce Momjian wrote:
>
> > My hope is that many FDW improvements will benefit sharding and
> > non-sharding workloads, but I bet some improvements are going to be
> > sharding-specific. I would say we are still in the exploratory stage,
> > but based on the number of people who care about this feature and want
> > to be involved, I think we are off to a very good start. :-)
>
> Having lots of interested people doesn't help with some problems,
> though. The Citus document says:
>
> And the issue with these four limitations wasn't with foreign
> data wrappers. We wrote mongo_fdw and cstore_fdw, and we're
> quite happy with the contract FDWs provide. The problem was that
> we were trying to retrofit an API for something that it was
> fundamentally not designed to do.
I had a chance to review the Citus Data document just now:
https://goo.gl/vJWF85
Particularly, it links to this document, which is clearer about the
issues they are trying to solve:
https://www.citusdata.com/blog/114-how-to-build-your-distributed-database
The document opens a big question --- when queries can't be processed in
a traditional top/down fashion, Citus has the goal of sending groups of
results up the the coordinator, reordering them, then sending them back
to the shards for further processing, basically using the shards as
compute engines because the shards are no longer using local data to do
their computations. The two examples they give are COUNT(DISTINCT) and
a join across two sharded tables ("CANADA").
I assumed these queries were going to be solved by sending as digested
data as possible to the coordinator, and having the coordinator complete
any remaining processing. I think we are going to need to decide if
such "sending data back to shards" is something we are ever going to
implement. I can see FDWs _not_ working well for that use-case.
For efficient internodes joins with row shipping, FDWs may also not be easy to do. Maybe it is possible if we optionally pass in lists of other nodes and information about how they are partitioned so data knows where to get shipped.
A challenge for planning with arbitrary copies of different shards is that sometimes you may be able to push down joins, sometimes not. Planning and execution get ugly. Maybe this can be simplified by parent-child tables following the same partitioning scheme.
Mason
On 09/01/2015 02:48 AM, Bruce Momjian wrote: > On Tue, Sep 1, 2015 at 09:30:41AM +0530, Pavan Deolasee wrote: > There is no question that using XC/XL will get us to a usable solution > faster, but see my recent post to Josh Berkus --- the additional code > will be so burdensome that I doubt it would be accepted. If it was, I > bet we would have considered it long ago. > > I think the only way we are going to get sharding into Postgres is to do > it in a way that enhances existing Postgres capabilities. So that we have XL again? This reads like: We don't want to merge a bunch of code that a non .Org project wrote. Instead we want to write a bunch of code that will all but duplicate what that other project wrote because.... ? Don't get me wrong, I am all about this goal and if XL truly doesn't fit that's cool but I think the idea that our project will somehow reinvent a better wheel is naive. > (I think we got a little too cute in enhancing existing Postgres > capabilities to add partitioning, and I think we need to fix that. > Hopefully we don't do the same thing with sharding.) Not cute, lazy. We implemented the barest minimum within our existing infrastructure and we have been kicking ourselves ever since. Sincerely, JD -- Command Prompt, Inc. - http://www.commandprompt.com/ 503-667-4564 PostgreSQL Centered full stack support, consulting and development. Announcing "I'm offended" is basically telling the world you can't control your own emotions, so everyone else should do it for you.
On 09/01/2015 02:58 AM, Mark Kirkwood wrote: > On 01/09/15 21:41, Bruce Momjian wrote: > I'm thinking that partitioning and sharding are two different things: > > Partitioning is about reducing the amount of table data accessed and > also perhaps easing admin activities (archiving/removing old stuff etc). > > Sharding is a about parallelism and redundancy...copies of stuff in > different places and concurrent access by virtue of it being on > different nodes! In our world, they are complimentary. Consider partitioning that uses FDW tables with proper plan push down etc.... > > Now *maybe* FDW is a good way to approach this, but really would be nice > to see a more rigorous analysis (I note that like XC and XL, Greenplum > looked at the existing mechanisms around at the time and ended up > writing their own). Now I'm aware that things have moved on - but I > think there needs to be a proper discussion about design and what we > think distributed data/sharding etc should provide *before* grabbing > hold of FDW as *the answer*! Agreed. Sincerely, Joshua D. Drake -- Command Prompt, Inc. - http://www.commandprompt.com/ 503-667-4564 PostgreSQL Centered full stack support, consulting and development. Announcing "I'm offended" is basically telling the world you can't control your own emotions, so everyone else should do it for you.
On Tue, Sep 01, 2015 at 10:15:27AM +0200, Andres Freund wrote: > On 2015-08-31 20:54:51 -0400, Bruce Momjian wrote: > > Uh, we already have a list of things we need to add to FDWs to make them > > work, and Citus Data has provided a document of more things that are > > needed, https://goo.gl/vJWF85. I am not sure how much bigger a red flag > > you want to confirm that everyone agrees that major FDW improvements are > > a requirement for this. > > Several people saying that the FDW infrastructure isn't sufficient right > now is pretty far from implying that all of them agree that the FDW API > is the way to go. > > I'm not sure myself. If it works out it's going to save us some work and > make it more realistic to get there sometime not too far off. But I'm > afraid that the resulting system will feel like our current partitioning > implemenentation. Yes, it kinda works, but it's hard to get started, it > doesn't support too many features and you're kind afraid your relatives > will see what you've done. Whatever we decide on, we can only count on built-in multi-node being adopted if all the needed bits needed ship with every PostgreSQL installation. If we require people do Install More Software™ in order to get a feature, we're going to lose a majority of our potential base. If it turns out we need the PostgreSQL FDW, and I believe some consensus is starting to gel around that, I can see not installing it in template1 by default. Even that's a questionable decision, as merely having the software in place does not credibly increase the attack surface, and does up the installation procedures by a fallible step. Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Tue, Sep 1, 2015 at 12:00 AM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > My worry is that if we start implementing them again from scratch, it will > take a few years before we get them in a usable state. What XC/XL lacked is > probably a Robert Haas or a Tom Lane who could look at the work and suggest > major edits. If that had happened, the quality of the product could have > been much better today. I don't mean to derate the developers who worked on > XC/XL, but there is no harm in accepting that if someone with a much better > understanding of the whole system was part of the team, that would have > positively impacted the project. Is that an angle worth exploring? Does it > make sense to commit some more resources to say XC or XL and try to improve > the quality of the product even further? To be honest, XL is in far far > better shape (haven't really tried XC in a while) and some more QA/polishing > can make it production ready much sooner. From my point of view, and EnterpriseDB's point of view, anything that doesn't go into the core PostgreSQL distribution isn't really getting us where we need to be. If there's code in XL that would be valuable to merge into core PostgreSQL, then let's do it. If the code cannot be used but there are lessons we can learn that will make what does go into core PostgreSQL better, let's learn them. However, I don't think it's serving anybody very well that we have the XC fork, and multiple forks of the XC fork, floating around out there and people are working on those instead of working on core PostgreSQL. The reality is that we don't have enough brainpower to spread it across 2 or 3 or 4 or 5 different projects and have all of them be good. The reality is, also, that horizontal scalability isn't an optional feature. There was a point in time at which the PostgreSQL project's official policy on replication was that it did not belong in core. That was a bad policy; thankfully, it was reversed, and the result was Hot Standby and Streaming Replication, incredibly important technologies without which we would not be where we are today. Horizontal scalability is just as essential. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Sep 1, 2015 at 4:15 AM, Andres Freund <andres@anarazel.de> wrote: > On 2015-08-31 20:54:51 -0400, Bruce Momjian wrote: >> Uh, we already have a list of things we need to add to FDWs to make them >> work, and Citus Data has provided a document of more things that are >> needed, https://goo.gl/vJWF85. I am not sure how much bigger a red flag >> you want to confirm that everyone agrees that major FDW improvements are >> a requirement for this. > > Several people saying that the FDW infrastructure isn't sufficient right > now is pretty far from implying that all of them agree that the FDW API > is the way to go. > > I'm not sure myself. If it works out it's going to save us some work and > make it more realistic to get there sometime not too far off. But I'm > afraid that the resulting system will feel like our current partitioning > implemenentation. Yes, it kinda works, but it's hard to get started, it > doesn't support too many features and you're kind afraid your relatives > will see what you've done. I'm not averse to making the "connect to the remote nodes" part of this solution use something other than the FDW infrastructure at some point in time if somebody's prepared to build something better. On the other hand, I think it's extremely clear that the FDW infrastructure has a large amount of potential upon which we have thoroughly failed to capitalize. Patches have already been written for UPDATE/DELETE pushdown and for join pushdown. Those patches have been around for some time, but progress has been slow. Core infrastructure exists to allow sort pushdown, but nobody's done anything with it. Aggregate pushdown hasn't happened yet due to the dependency on upper planner path-ification, but it's not as if some alternative to the FDW interface is going to dodge that problem. It would be a bad idea to cling blindly to the FDW infrastructure if it's fundamentally inadequate to do what we want. On the other hand, it would also be a bad idea to set about recreating it without a really good reason, and - just to take one example - the fact that it doesn't currently push down DML operations to the remote side is not a really good reason to rewrite the whole thing. On the contrary, it's a reason to put some energy into the already-written patch which implements that optimization. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 09/01/2015 09:08 AM, Robert Haas wrote: > On Tue, Sep 1, 2015 at 12:00 AM, Pavan Deolasee > From my point of view, and EnterpriseDB's point of view, anything that > doesn't go into the core PostgreSQL distribution isn't really getting > us where we need to be. If there's code in XL that would be valuable > to merge into core PostgreSQL, then let's do it. If the code cannot > be used but there are lessons we can learn that will make what does go > into core PostgreSQL better, let's learn them. I think that is pretty much anybody in the industry's point of view and I agree with it. Although there is definitely an argument to be had about whether it is a module or it is in core proper, no matter what it should be a .Org project. You are also correct on the replication statement. Replication should have never been outside of core and I say that as the company that wrote a replication solution that even Fortune 50 companies used. I think that we should consider that there are very smart, very talented engineers already devoting time to their respective communities (XL/XC) that may very much enjoy being part of our community so we don't have to reinvent the wheel, again. JD -- Command Prompt, Inc. - http://www.commandprompt.com/ 503-667-4564 PostgreSQL Centered full stack support, consulting and development. Announcing "I'm offended" is basically telling the world you can't control your own emotions, so everyone else should do it for you.
On Tue, Sep 1, 2015 at 6:55 AM, Bruce Momjian <bruce@momjian.us> wrote: > I assumed these queries were going to be solved by sending as digested > data as possible to the coordinator, and having the coordinator complete > any remaining processing. I think we are going to need to decide if > such "sending data back to shards" is something we are ever going to > implement. I can see FDWs _not_ working well for that use-case. I do think we are going to want to support that. All the people I've talked to about parallel and distributed query processing agree that you need to do that sort of thing to get really good and scalable performance. I think that we could make a lot of headway as compared with the status quo just by implementing more pushdown optimizations than we have today. Right now, SELECT COUNT(*) FROM table will suck back the whole remote table and count the rows locally, and that's stupid. We can fix that case with better pushdown logic. We can also fix the case of N-way join nests where the joins are either on the partitioning key or to replicated tables. But suppose you have a join between two tables which are sharded across the cluster but not on the partitioning key. There's no way to push the join down, so all the work comes back to the coordinator, which is possibly OK if such queries are rare, but not so hot if they are frequent. To leverage the full CPU power of the cluster in such a case, you need to be able to shuffle data around between the nodes. You pick one of the two tables being joined, and based on the partitioning key of that table, each node examines the other table and, for each row, sends it to the machine where it MIGHT have one or more join partners. Then each node can join its shard of the first table against the rows from the second table that were sent to it. Now the question is, where should the code that does all of this live?postgres_fdw? Some new, sharding-specific FDW? Incore? I don't know for sure, but what I do know is that we could make a lot of progress over where we are today by just improving postgres_fdw, and I don't think those improvements are even all that difficult. If we decide we need to implement something new, it's going to be a huge project that will take years to complete, with uncertain results. I'd rather have a postgres_fdw-based implementation that is imperfect and can't handle some kinds of queries in 9.6 than a promise that by 9.9 we'll have something really great that handles MPP perfectly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Sep 1, 2015 at 7:01 AM, Mason S <masonlists@gmail.com> wrote: > For efficient internodes joins with row shipping, FDWs may also not be easy > to do. Maybe it is possible if we optionally pass in lists of other nodes > and information about how they are partitioned so data knows where to get > shipped. > > A challenge for planning with arbitrary copies of different shards is that > sometimes you may be able to push down joins, sometimes not. Planning and > execution get ugly. Maybe this can be simplified by parent-child tables > following the same partitioning scheme. This gets at a problem which Ozgun also mentioned in his Google document, and which I also discussed with Etsuro Fujita at PGCon: good query planning requires good metadata, and we don't really have that today. I think that a big part of the goal of the declarative partitioning work that Amit Langote has recently been undertaking is to provide a catalog representation of the partitioning structure that is easy to work with, as opposed to just having a bunch of CHECK constraints that you have to try to reason about. That's one part of the solution. You also might need to know more about the remote table than is captured by the column and data type list. Most particularly, you might want to know what indexes exist on the remote side, but currently, to figure out that out, you'd need to send queries to retrieve that information to the remote side every time you do planning, or maybe you could contrive a session-lifespan cache. That's pretty annoying. One idea for solving this problem is to allow CREATE INDEX on foreign tables, but I don't like that much. There's no guarantee that the remote side is a PostgreSQL instance, and if it isn't, the relevant details about the indexes that exist may not be convenient to represent in our catalogs. Heck, that can be true even if it is a PostgreSQL instance, if the remote side relies on an AM or a function that doesn't exist locally. But even if both sides are PostgreSQL instances using only btree indexes on raw columns, now you've put the burden on the DBA to make sure that the index definitions on the local and remote sides match, and that's a pain in the neck. What seems better to me is to allow ANALYZE of a foreign table a place to record an arbitrary blob of metadata about the remote side that it can then get access to during planning. Then, you can record details about indexes, or statistics that don't fit into the mold of pg_statistic, or really, anything else you're going to need to figure out the best plan, and if the DBA changes the configuration on the remote side, they don't need to update the local configuration to match; a re-ANALYZE will do the trick. I'm open to other ideas as well. Repartitioning will also arise for an all-local parallel join, and it would be nice if the planner smarts could be shared between that case and the remote-table case. Therefore, while I suspect that some of the logic here will end up inside one or more FDWs, I'm pretty confident that a significant chunk of it needs to go into the core optimizer. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 09/01/2015 02:39 AM, Bruce Momjian wrote: > On Mon, Aug 31, 2015 at 01:16:21PM -0700, Josh Berkus wrote: >> I'm also going to pontificate that, for a future solution, we should not >> focus on write *IO*, but rather on CPU and RAM. The reason for this >> thinking is that, with the latest improvements in hardware and 9.5 >> improvements, it's increasingly rare for machines to be bottlenecked on >> writes to the transaction log (or the heap). This has some implications >> for system design. For example, solutions which require all connections >> to go through a single master node do not scale sufficiently to be worth >> bothering with. > > Well, I highlighted write IO for sharding because sharding is the only > solution that allows write scaling. If we want to scale CPU, we are > better off using server parallelism, and to scale CPU and RAM, a > multi-master/BDR solution seems best. (Multi-master doesn't do write > scaling because you eventually have to write all the data to each node.) You're assuming that our primary bottleneck for writes is IO. It's not at present for most users, and it certainly won't be in the future. You need to move your thinking on systems resources into the 21st century, instead of solving the resource problems from 15 years ago. Currently, CPU resources and locking are the primary bottlenecks on writing for the vast majority of the hundreds of servers I tune every year. This even includes AWS, with EBS's horrible latency; even in that environment, most users can outstrip PostgreSQL's ability to handle requests by getting 20K PRIOPs. Our real future bottlenecks are: * ability to handle more than a few hundred connections * locking limits on the scalability of writes * ability to manage large RAM and data caches The only place where IO becomes the bottleneck is for the batch-processing, high-throughput DW case ... and I would argue that existing forks already handle that case. Any sharding solution worth bothering with will solve some or all of the above by extending our ability to process requests across multiple nodes. Any solution which does not is merely an academic curiosity. > For these reasons, I think sharding has a limited use, and hence, I > don't think the community will be willing to add a lot of code just to > enable auto-sharding. I think it has to be done in a way that adding > sharding also gives other benefits, like better FDWs and cross-node ACID > control. > > In summary, I don't think adding a ton of code just to do sharding will > be acceptable. A corollary of that, is that if FDWs are unable to > provide useful sharding, I don't see an acceptable way of adding > built-in sharding to Postgres. So, while I am fully in agreement with you that having side benefits to our sharding tools, I think you're missing the big picture entirely. In a few years, clustered/sharded PostgreSQL will be the default installation, or we'll be a legacy database. Single-node and single-master databases are rapidly becoming history. From my perspective, we don't need an awkward, limited, bolt-on solution for write-scaling. We need something which will become core to how PostgreSQL works. I just don't see us getting there with the described FDW approach, which is why I keep raising issues with it. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Tue, Sep 1, 2015 at 1:06 PM, Josh Berkus <josh@agliodbs.com> wrote: > You're assuming that our primary bottleneck for writes is IO. It's not > at present for most users, and it certainly won't be in the future. You > need to move your thinking on systems resources into the 21st century, > instead of solving the resource problems from 15 years ago. Your experience doesn't match mine. I find that it's frequently impossible to get the system to use all of the available CPU capacity, either because you're bottlenecked on locks or because you are bottlenecked on the I/O subsystem, and with the locking improvements in newer versions, the former is becoming less and less common. Amit's recent work on scalability demonstrates this trend: he goes looking for lock bottlenecks, and finds problems that only occur at 128+ concurrent connections running full tilt. The patches show limited benefit - a few percentage points - at lesser concurrency levels. Either there are other locking bottlenecks that limit performance at lower client counts but which mysteriously disappear as concurrency increases, which I would find surprising, or the limit is somewhere else. I haven't seen any convincing evidence that the I/O subsystem is the bottleneck, but I'm having a hard time figuring out what else it could be. > Our real future bottlenecks are: > > * ability to handle more than a few hundred connections > * locking limits on the scalability of writes > * ability to manage large RAM and data caches I do agree that all of those things are problems, FWIW. > Any sharding solution worth bothering with will solve some or all of the > above by extending our ability to process requests across multiple > nodes. Any solution which does not is merely an academic curiosity. I think the right solution to those problems is to attack them head-on. Sharding solutions should cater to use cases where using all the resources of one machine isn't sufficient no matter how efficiently we do it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 09/01/2015 10:06 AM, Josh Berkus wrote: > On 09/01/2015 02:39 AM, Bruce Momjian wrote: >> On Mon, Aug 31, 2015 at 01:16:21PM -0700, Josh Berkus wrote: > Our real future bottlenecks are: > > * ability to handle more than a few hundred connections This, 1000 times this. No a connection pooler doesn't help, we are using 1000 connections WITH a connection pooler. The amount of data centric, always connected (even if stateless) apps that are out there is astounding. Our performance curve is better (especially since Robert's fantastic work on 9.2) but it isn't where a lot of the demanding users need to be. We are also seeing a vast uptick in always connected apps that aren't web centric (or the client at least isn't). In other words we are seeing situations where 1000 connections is 1000 users, connected. JD -- Command Prompt, Inc. - http://www.commandprompt.com/ 503-667-4564 PostgreSQL Centered full stack support, consulting and development. Announcing "I'm offended" is basically telling the world you can't control your own emotions, so everyone else should do it for you.
On 09/01/2015 10:17 AM, Robert Haas wrote: > On Tue, Sep 1, 2015 at 1:06 PM, Josh Berkus <josh@agliodbs.com> wrote: >> Any sharding solution worth bothering with will solve some or all of the >> above by extending our ability to process requests across multiple >> nodes. Any solution which does not is merely an academic curiosity. > > I think the right solution to those problems is to attack them > head-on. Sharding solutions should cater to use cases where using all > the resources of one machine isn't sufficient no matter how > efficiently we do it. As long as "all the resources" != "just IO", I'm completely on board with that. The reason I raised this is that the initial FDW-based proposals pretty much scale IO and nothing else. pg_shard also currently only scales IO, but they're working on that. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Hi, On 09/01/2015 07:17 PM, Robert Haas wrote: > On Tue, Sep 1, 2015 at 1:06 PM, Josh Berkus <josh@agliodbs.com> wrote: >> You're assuming that our primary bottleneck for writes is IO. It's >> not at present for most users, and it certainly won't be in the >> future. You need to move your thinking on systems resources into >> the 21st century, instead of solving the resource problems from 15 >> years ago. > > Your experience doesn't match mine. I find that it's frequently > impossible to get the system to use all of the available CPU > capacity, either because you're bottlenecked on locks or because you > are bottlenecked on the I/O subsystem, and with the locking > improvements in newer versions, the former is becoming less and less > common. Amit's recent work on scalability demonstrates this trend: he > goes looking for lock bottlenecks, and finds problems that only occur > at 128+ concurrent connections running full tilt. The patches show > limited benefit - a few percentage points - at lesser concurrency > levels. Either there are other locking bottlenecks that limit > performance at lower client counts but which mysteriously disappear > as concurrency increases, which I would find surprising, or the limit > is somewhere else. I haven't seen any convincing evidence that the > I/O subsystem is the bottleneck, but I'm having a hard time figuring > out what else it could be. Memory bandwidth, for example. It's quite difficult to spot, because the intuition is that memory is fast, but thanks to improvements in storage (and stagnation in RAM bandwidth), this is becoming a significant issue. Process-management overhead is another thing we tend to ignore, but once you get to many processes all willing to work at the same time, you need to account for that. Of course, this applies differently to different sharding use cases. For example analytics workloads have serious issues with memory bandwidth, but not so much with process management overhead (because the number of connections is usually about number of cores). Use cases with many clients (in web-scale use cases) tends to run into both (all the processes also have to share all the caches, killing them). I don't know if sharding can help solving (or at least improve) these issues. And if sharding in general can, I don't know if it still holds for FDW-based solution. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Sep 1, 2015 at 2:04 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > Memory bandwidth, for example. It's quite difficult to spot, because the > intuition is that memory is fast, but thanks to improvements in storage (and > stagnation in RAM bandwidth), this is becoming a significant issue. I'd appreciate any tips on how to spot problems of this type. But it's my impression that perf, top, vmstat, and other Linux performance tools will count time spent waiting for memory as CPU time, not idle time. If that's correct, that wouldn't explain workloads where CPU utilization doesn't reach 100%. Rather, it would show up as CPU time hitting 100% while tps remains low. > Process-management overhead is another thing we tend to ignore, but once you > get to many processes all willing to work at the same time, you need to > account for that. Any tips on spotting problems in that area? Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015-09-01 14:11:21 -0400, Robert Haas wrote: > On Tue, Sep 1, 2015 at 2:04 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: > > Memory bandwidth, for example. It's quite difficult to spot, because the > > intuition is that memory is fast, but thanks to improvements in storage (and > > stagnation in RAM bandwidth), this is becoming a significant issue. > > I'd appreciate any tips on how to spot problems of this type. But > it's my impression that perf, top, vmstat, and other Linux performance > tools will count time spent waiting for memory as CPU time, not idle > time. If that's correct, that wouldn't explain workloads where CPU > utilization doesn't reach 100%. Rather, it would show up as CPU time > hitting 100% while tps remains low. Yea. -e bus-cycles is a good start to measure where bus traffic isrelevant. Depending on the individual cpu other events can behelpful. > > Process-management overhead is another thing we tend to ignore, but once you > > get to many processes all willing to work at the same time, you need to > > account for that. > > Any tips on spotting problems in that area? Not perfect, but -e context-switches (general context switches) and -e syscalls:sys_enter_semop (for postgres enforced context switches) is rather useful when combined with --call-graph dwarf ('fp' sometimes doesn't see through libc which is most of the time not compiled with -fno-omit-frame-pointer). Greetings, Andres Freund
Hi, On 08/31/2015 10:16 PM, Josh Berkus wrote: > It's also important to recognize that there are three major use-cases > for write-scalable clustering: > > * OLTP: small-medium cluster, absolute ACID consistency, > bottlnecked on small writes per second > * DW: small-large cluster, ACID optional, > bottlenecked on bulk reads/writes > * Web: medium to very large cluster, ACID optional, > bottlenecked on # of connections > > We cannot possibly solve all of the above at once, but to the extent > that we recognize all 3 use cases, we can build core features which > can be adapted to all of them. It would be good to have a discussion about use-cases first - each of us is mostly concerned about the use cases they're dealing with, with bottlenecks specific to their environment. These three basic use-cases seem like a good start, but some of the details certainly don't match my experience ... For example I can't see how ACID can be optional for the DWH use-case, but maybe there's a good explanation - I can imagine sacrificing various ACID properties at the node level, but I can't really imagine sacrificing any of the ACID properties for the cluster as a whole. So this would deserve some explanation. I also don't share the view that write scalability is the only (or even main) issue, that we should aim to solve. For the business-intelligence use cases I've been working on recently, handling complex read-only ad-hoc queries is often much more important. And in those cases the bottleneck is often CPU and/or RAM. > > I'm also going to pontificate that, for a future solution, we should > not focus on write *IO*, but rather on CPU and RAM. The reason for > this thinking is that, with the latest improvements in hardware and > 9.5 improvements, it's increasingly rare for machines to be > bottlenecked on writes to the transaction log (or the heap). This has > some implications for system design. For example, solutions which > require all connections to go through a single master node do not > scale sufficiently to be worth bothering with. +1 > On some other questions from Mason: > >> Do we want multiple copies of shards, like the pg_shard approach? Or >> keep things simpler and leave it up to the DBA to add standbys? > > We want multiple copies of shards created by the sharding system itself. > Having a separate, and completely orthagonal, redundancy system to the > sharding system is overly burdensome on the DBA and makes low-data-loss > HA impossible. IMHO it'd be quite unfortunate if the design would make it impossible to combine those two features (e.g. creating standbys for shards and failing over to them). It's true that solving HA at the sharding level (by keeping multiple copies of a each shard) may be simpler than combining sharding and standbys, but I don't see why it makes low-data-loss HA impossible. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 09/01/2015 08:22 PM, Andres Freund wrote: > On 2015-09-01 14:11:21 -0400, Robert Haas wrote: >> On Tue, Sep 1, 2015 at 2:04 PM, Tomas Vondra >> <tomas.vondra@2ndquadrant.com> wrote: >>> Memory bandwidth, for example. It's quite difficult to spot, because the >>> intuition is that memory is fast, but thanks to improvements in storage (and >>> stagnation in RAM bandwidth), this is becoming a significant issue. >> >> I'd appreciate any tips on how to spot problems of this type. But >> it's my impression that perf, top, vmstat, and other Linux performance >> tools will count time spent waiting for memory as CPU time, not idle >> time. If that's correct, that wouldn't explain workloads where CPU >> utilization doesn't reach 100%. Rather, it would show up as CPU time >> hitting 100% while tps remains low. > > Yea. > > -e bus-cycles is a good start to measure where bus traffic is > relevant. Depending on the individual cpu other events can be helpful. long-story: https://people.freebsd.org/~lstewart/articles/cpumemory.pdf It's from 2007 and only explains oprofile (chapter 7), which is mostly abandoned in favor of perf nowadays. Perf can produce similar stats, so the discussion is still valid. But it also shows cachegrind (valgrind module). perf examples: http://www.brendangregg.com/perf.html Most of the examples with "CPU" in the comment are relevant. Usually "perf stat" and "perf stat -d" are good starting points - once you get a lot of LLC misses or too many instructions per cycle, it's a sign of memory bandwidth problems. Sadly, this is partially caused by our volcano-style executor and sharding alone can do nothing about that. > >>> Process-management overhead is another thing we tend to ignore, but once you >>> get to many processes all willing to work at the same time, you need to >>> account for that. >> >> Any tips on spotting problems in that area? > > Not perfect, but -e context-switches (general context switches) and -e > syscalls:sys_enter_semop (for postgres enforced context switches) is > rather useful when combined with --call-graph dwarf ('fp' sometimes > doesn't see through libc which is most of the time not compiled with > -fno-omit-frame-pointer). Right, this is about the best I'm aware of. The problem often is not in the number of context switches, but in the fact that all the processes share the same (very limited) L caches on the CPU. Each process dirties the caches for the other processes, lowering the hit ratios. Which can be spotted using the commands above. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 09/01/2015 11:36 AM, Tomas Vondra wrote: >> We want multiple copies of shards created by the sharding system itself. >> Having a separate, and completely orthagonal, redundancy system to the >> sharding system is overly burdensome on the DBA and makes low-data-loss >> HA impossible. > > IMHO it'd be quite unfortunate if the design would make it impossible to > combine those two features (e.g. creating standbys for shards and > failing over to them). > > It's true that solving HA at the sharding level (by keeping multiple > copies of a each shard) may be simpler than combining sharding and > standbys, but I don't see why it makes low-data-loss HA impossible. Other way around, that is, having replication standbys as the only method of redundancy requires either high data loss or high latency for all writes. In the case of async rep, every time we fail over a node, the entire cluser would need to roll back to the last common known-good replay point, hence high data loss. In the case of sync rep, we are required to wait for at least double network lag time in order to do a single write ... making write-scalability quite difficult. Futher, if using replication the sharding system would have no way to (a) find out immediately if a copy was bad and (b) fail over quickly to a copy of the shard if the first requested copy was not responding. With async replication, we also can't use multiple copies of the same shard as a way to balance read workloads. If we write to multiple copies as a part of the sharding feature, then that can be parallelized, so that we are waiting only as long as the slowest write (or in failure cases, as long as the shard timeout). Further, we can check for shard-copy health and update shard availability data with each user request, so that the ability to see stale/bad data is minimized. There are obvious problems with multiplexing writes, which you can figure out if you knock pg_shard around a bit. But I really think that solving those problems is the only way to go. Mind you, I see a strong place for binary replication and BDR for multi-region redundancy; you really don't want that to be part of the sharding system if you're aiming for write scalability. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Tue, Sep 1, 2015 at 3:19 PM, Josh Berkus <josh@agliodbs.com> wrote: > Mind you, I see a strong place for binary replication and BDR for > multi-region redundancy; you really don't want that to be part of the > sharding system if you're aiming for write scalability. I mostly agree, but keep in mind that you don't have to replicate everything. A lot of people might find it useful to replicate the data that is common to all nodes even if they don't replicate the sharded data. Admittedly, there are some problems with snapshots here: if you don't do anything special about snapshots, then what you have here will be "eventually consistent" behavior. But that might be suitable for some environments, such as very loosely coupled system where not all nodes are connected all the time. And, for those environments where you do need consistent snapshots, we can imagine ways to get that behavior, like having the GTM consider the transaction uncommitted until it's been logically replicated to every node. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 09/01/2015 09:19 PM, Josh Berkus wrote: > On 09/01/2015 11:36 AM, Tomas Vondra wrote: >>> We want multiple copies of shards created by the sharding system >>> itself. Having a separate, and completely orthagonal, redundancy >>> system to the sharding system is overly burdensome on the DBA and >>> makes low-data-loss HA impossible. >> >> IMHO it'd be quite unfortunate if the design would make it >> impossible to combine those two features (e.g. creating standbys >> for shards and failing over to them). >> >> It's true that solving HA at the sharding level (by keeping >> multiple copies of a each shard) may be simpler than combining >> sharding and standbys, but I don't see why it makes low-data-loss >> HA impossible. > > Other way around, that is, having replication standbys as the only > method of redundancy requires either high data loss or high latency > for all writes. I haven't said that. I said that we should allow that topology, not that it should be the only method of redundancy. > > In the case of async rep, every time we fail over a node, the entire > cluser would need to roll back to the last common known-good replay > point, hence high data loss. > > In the case of sync rep, we are required to wait for at least double > network lag time in order to do a single write ... making > write-scalability quite difficult. Which assumes that latency (or rather the increase due to syncrep) is a problem for the use case. Which may be the case for many use cases, but certainly is not a problem for many BI/DWH use cases performing mostly large batch loads. In those cases the network bandwidth may be quite important resource. For example assume that there are just two shards in two separate data centers, connected by a link with limited bandwidth. Now, let's assume you always keep a local replica for failover. So you have A1+A2 in DC1, B1+B2 in DC2. If you're in DC1, then writing data to B1 means you also have to write data to B2 and wait for it. So either you send the data to each node separately (consuming 2x the bandwidth), or send it to B1 and let it propagate to B2 e.g. through sync rep. So while you may be right in single-DC deployments, with multi-DC deployments the situation is quite different - not only that the network bandwidth is not unlimited, but because latencies within DC may be a fraction of latencies between the locations (to the extent that the increase due to syncrep may be just noise). So the local replication may be actually way faster. I can imagine forwarding the data between B1 and B2 even with a purely sharding solution, but at that point you effectively re-implemented syncrep. IMHO the design has to address the multi-DC setups somehow. I think that many of the customers who are so concerned about scaling to many shards are also concerned about availability in case of DC outages, no? We should also consider support for custom topologies (not just a full mesh, or whatever we choose as the default/initial topology), which is somehow related. > > Futher, if using replication the sharding system would have no way > to (a) find out immediately if a copy was bad and (b) fail over > quickly to a copy of the shard if the first requested copy was not > responding. With async replication, we also can't use multiple copies > of the same shard as a way to balance read workloads. I don't follow. With sync rep we do know whether the copy is OK or not, because the node either confirms writes or not. The failover certainly is more complicated and is not immediate (to the extent of keeping a copy at the sharding level), but it's a question of trade-offs. It's true we don't have auto-failover solution at the moment, but as I said - I can easily imagine most people using just sharding, while some deployments use syncrep with manual failover. > > If we write to multiple copies as a part of the sharding feature, > then that can be parallelized, so that we are waiting only as long as > the slowest write (or in failure cases, as long as the shard > timeout). Further, we can check for shard-copy health and update > shard availability data with each user request, so that the ability > to see stale/bad data is minimized. Again, this assumes infinite network bandwidth. > > There are obvious problems with multiplexing writes, which you can > figure out if you knock pg_shard around a bit. But I really think > that solving those problems is the only way to go. > > Mind you, I see a strong place for binary replication and BDR for > multi-region redundancy; you really don't want that to be part of > the sharding system if you're aiming for write scalability. I haven't mentioned BDR at all, and given the async nature I don't have a clear idea of how it fits into the sharding world at this point. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Sep 1, 2015 at 08:18:38AM -0700, Joshua Drake wrote: > On 09/01/2015 02:48 AM, Bruce Momjian wrote: > >On Tue, Sep 1, 2015 at 09:30:41AM +0530, Pavan Deolasee wrote: > > >There is no question that using XC/XL will get us to a usable solution > >faster, but see my recent post to Josh Berkus --- the additional code > >will be so burdensome that I doubt it would be accepted. If it was, I > >bet we would have considered it long ago. > > > >I think the only way we are going to get sharding into Postgres is to do > >it in a way that enhances existing Postgres capabilities. > > So that we have XL again? Kind of. If XC/XL used FDWs I think we would try to use their code first. The issue is that FDWs didn't exist at the time. I would say our first approach might be doing XC/XL again with FDWs. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Sep 1, 2015, at 1:47 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Admittedly, there are some problems with snapshots here: if you don't > do anything special about snapshots, then what you have here will be > "eventually consistent" behavior. But that might be suitable for some > environments, such as very loosely coupled system where not all nodes > are connected all the time. Given that we’re discussing multi-node architectures here, you should expect that not all nodes will be connected at anytime. Nodes fail, but the cluster should not. > And, for those environments where you do > need consistent snapshots, we can imagine ways to get that behavior, > like having the GTM consider the transaction uncommitted until it's > been logically replicated to every node. Again, you need a way to deal with nodes going down. I can envision building a cluster with twelve nodes replicated to eachof three geographically-distributed data centers. Each replication/sync model needs to be able to handle nodes goingup and down, data centers or racks going up or down, and nodes being added and removed. But even with smaller clusters, there’s no way around the fact that no system can guarantee that all nodes will be availableat all times. Best, David
On 09/01/2015 02:29 PM, Tomas Vondra wrote: > Hi, > > On 09/01/2015 09:19 PM, Josh Berkus wrote: >> Other way around, that is, having replication standbys as the only >> method of redundancy requires either high data loss or high latency >> for all writes. > > I haven't said that. I said that we should allow that topology, not that > it should be the only method of redundancy. Ah, OK, I didn't understand you. Of course I'm in favor of supporting both methods of redundancy if we can. >> In the case of sync rep, we are required to wait for at least double >> network lag time in order to do a single write ... making >> write-scalability quite difficult. > > Which assumes that latency (or rather the increase due to syncrep) is a > problem for the use case. Which may be the case for many use cases, but > certainly is not a problem for many BI/DWH use cases performing mostly > large batch loads. In those cases the network bandwidth may be quite > important resource. I'll argue that BI/DW is the least interesting use case for mainstream PostgreSQL because there are production-quality forks which do this (mostly propietary, but we can work on that). We really need a solution which works for OLTP. > For example assume that there are just two shards in two separate data > centers, connected by a link with limited bandwidth. Now, let's assume > you always keep a local replica for failover. So you have A1+A2 in DC1, > B1+B2 in DC2. If you're in DC1, then writing data to B1 means you also > have to write data to B2 and wait for it. So either you send the data to > each node separately (consuming 2x the bandwidth), or send it to B1 and > let it propagate to B2 e.g. through sync rep. > > So while you may be right in single-DC deployments, with multi-DC > deployments the situation is quite different - not only that the network > bandwidth is not unlimited, but because latencies within DC may be a > fraction of latencies between the locations (to the extent that the > increase due to syncrep may be just noise). So the local replication may > be actually way faster. I'm not seeing how the above is better using syncrep than using shard copying? > I can imagine forwarding the data between B1 and B2 even with a purely > sharding solution, but at that point you effectively re-implemented > syncrep. Not really, the mechanism is different and the behavior is different. One critical deficiency in using binary syncrep is that you can't do round-robin redundancy at all; every redundant node has to be an exact mirror of another node. In a good HA distributed system, you want multiple shards per node, and you want each shard to be replicated to a different node, so that in the event of node failure you're not dumping the full load on one other server. > IMHO the design has to address the multi-DC setups somehow. I think that > many of the customers who are so concerned about scaling to many shards > are also concerned about availability in case of DC outages, no? Certainly. But users located in a single DC shouldn't pay the same overhead as users who are geographically spread. > I don't follow. With sync rep we do know whether the copy is OK or not, > because the node either confirms writes or not. The failover certainly > is more complicated and is not immediate (to the extent of keeping a > copy at the sharding level), but it's a question of trade-offs. > > It's true we don't have auto-failover solution at the moment, but as I > said - I can easily imagine most people using just sharding, while some > deployments use syncrep with manual failover. As long as direct shard copying is available, I'm happy. I have no complaints about additional mechanisms. I'm bringing this up because the FDW proposal made at pgCon did not include *any* mechanism for HA/redundancy, just some handwaving about replication and/or BDR. This was one of the critical design failures of PostgresXC. A multinode system without automated node failover and replacement is a low-availability system. >> If we write to multiple copies as a part of the sharding feature, >> then that can be parallelized, so that we are waiting only as long as >> the slowest write (or in failure cases, as long as the shard >> timeout). Further, we can check for shard-copy health and update >> shard availability data with each user request, so that the ability >> to see stale/bad data is minimized. > > Again, this assumes infinite network bandwidth. In what way is the total network bandwitdh used in the system different for shard copying than for sync replication? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Tue, Sep 1, 2015 at 12:40:40PM -0400, Robert Haas wrote: > On Tue, Sep 1, 2015 at 6:55 AM, Bruce Momjian <bruce@momjian.us> wrote: > > I assumed these queries were going to be solved by sending as digested > > data as possible to the coordinator, and having the coordinator complete > > any remaining processing. I think we are going to need to decide if > > such "sending data back to shards" is something we are ever going to > > implement. I can see FDWs _not_ working well for that use-case. > > I do think we are going to want to support that. All the people I've > talked to about parallel and distributed query processing agree that > you need to do that sort of thing to get really good and scalable > performance. I think that we could make a lot of headway as compared > with the status quo just by implementing more pushdown optimizations > than we have today. Right now, SELECT COUNT(*) FROM table will suck > back the whole remote table and count the rows locally, and that's > stupid. We can fix that case with better pushdown logic. We can also > fix the case of N-way join nests where the joins are either on the > partitioning key or to replicated tables. But suppose you have a join > between two tables which are sharded across the cluster but not on the > partitioning key. There's no way to push the join down, so all the > work comes back to the coordinator, which is possibly OK if such > queries are rare, but not so hot if they are frequent. Let me clearer about what the Citus Data paper shows. I said originally that the data was sent to the coordinator, sorted, then resent to the shards, but the document: https://goo.gl/vJWF85https://www.citusdata.com/blog/114-how-to-build-your-distributed-database has the shards create the groups and the groups are sent to the other shards. For example, to do COUNT(DISTINCT) if you have three shards, then each shard breaks its data into 3 buckets (1B in size), then the first bucket from each of the three shards goes to the first shard, and the second bucket goes to the second shared, etc. Basically, they are doing map-reduce, and the shards are creating additional batches that get shipped to other shards. I can see FDWs not working well in that case as you are really creating a new data layout just for the query. This explains why the XC/XL people are saying they would use FDWs if they existed at the time they started development, while the Citus Data people are saying they couldn't use FDWs as they currently exist. They probably both needed FDW improvements, but I think the Citus Data features would need a lot more. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Tue, Sep 1, 2015 at 10:17 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Sep 1, 2015 at 1:06 PM, Josh Berkus <josh@agliodbs.com> wrote: >> You're assuming that our primary bottleneck for writes is IO. It's not >> at present for most users, and it certainly won't be in the future. You >> need to move your thinking on systems resources into the 21st century, >> instead of solving the resource problems from 15 years ago. > > Your experience doesn't match mine. I find that it's frequently > impossible to get the system to use all of the available CPU capacity, > either because you're bottlenecked on locks or because you are > bottlenecked on the I/O subsystem, and with the locking improvements > in newer versions, the former is becoming less and less common. I think you're both right. I think that we need to fix the buffer manager, to make its caching algorithm smarter. Since we're mostly using the filesystem cache, this is particularly important for PostgreSQL. We need to remember usage information for evicted blocks for some period of time afterwards. This problem is largely a problem with Postgres in particular, I suspect. At the same time, I agree with Josh's assessment that long-term, we are going to have the biggest problem with memory latency and memory bandwidth, which are usually considered facets of CPU performance, and with internal lock contention. Addressing the memory access bottleneck dovetails with parallelism, in that it must be considered alongside parallelism. Josh's "21st century" remark seems quite justified to me. For a further example of this, check out my latest progress with external sorting, which I plan to post later in the week. I/O isn't the big problem there at all, and I now think we can make external sorts close to internal sorts in performance across the board. I imagine that Josh's experience is based on workloads that mostly fit in shared_buffers, so I can see why you'd disagree if that was something you've seen less of. I'll quote Hellerstein and Stonebraker in 2007 [1]: "Copying data in memory can be a serious bottleneck. Copies contribute latency, consume CPU cycles, and can flood the CPU data. This fact is often a surprise to people who have not operated or implemented a database system, and assume that main-memory operations are “free” compared to disk I/O. But in practice, throughput in a well-tuned transaction processing DBMS is typically not I/O-bound. This is achieved in high-end installations by purchasing sufficient disks and RAM so that repeated page requests are absorbed by the buffer pool, and disk I/Os are shared across the disk arms at a rate that can feed the data appetite of all the processors in the system." >> Our real future bottlenecks are: >> >> * ability to handle more than a few hundred connections >> * locking limits on the scalability of writes >> * ability to manage large RAM and data caches > > I do agree that all of those things are problems, FWIW. These seem like our long term problems, to me. [1] http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf, page 213 -- Peter Geoghegan
On 2015-09-02 00:09, Josh Berkus wrote: > On 09/01/2015 02:29 PM, Tomas Vondra wrote: >> For example assume that there are just two shards in two separate data >> centers, connected by a link with limited bandwidth. Now, let's assume >> you always keep a local replica for failover. So you have A1+A2 in DC1, >> B1+B2 in DC2. If you're in DC1, then writing data to B1 means you also >> have to write data to B2 and wait for it. So either you send the data to >> each node separately (consuming 2x the bandwidth), or send it to B1 and >> let it propagate to B2 e.g. through sync rep. >> >> So while you may be right in single-DC deployments, with multi-DC >> deployments the situation is quite different - not only that the network >> bandwidth is not unlimited, but because latencies within DC may be a >> fraction of latencies between the locations (to the extent that the >> increase due to syncrep may be just noise). So the local replication may >> be actually way faster. > > I'm not seeing how the above is better using syncrep than using shard > copying? Shard copying usually assumes that the origin node does the copy - the data has to go twice through the slow connection. With replication you can replicate locally over fast connection. > >> I can imagine forwarding the data between B1 and B2 even with a purely >> sharding solution, but at that point you effectively re-implemented >> syncrep. > > Not really, the mechanism is different and the behavior is different. > One critical deficiency in using binary syncrep is that you can't do > round-robin redundancy at all; every redundant node has to be an exact > mirror of another node. In a good HA distributed system, you want > multiple shards per node, and you want each shard to be replicated to a > different node, so that in the event of node failure you're not dumping > the full load on one other server. > This assumes that we use binary replication, but we can reasonably use logical replication which can quite easily do filtering of what's replicated where. >> IMHO the design has to address the multi-DC setups somehow. I think that >> many of the customers who are so concerned about scaling to many shards >> are also concerned about availability in case of DC outages, no? > > Certainly. But users located in a single DC shouldn't pay the same > overhead as users who are geographically spread. > Agreed, so we should support both ways, but I don't think it's necessary to support both ways in version 0.1. It's just important to not paint ourselves into a corner with design decisions that would make one of the ways impossible. >>> If we write to multiple copies as a part of the sharding feature, >>> then that can be parallelized, so that we are waiting only as long as >>> the slowest write (or in failure cases, as long as the shard >>> timeout). Further, we can check for shard-copy health and update >>> shard availability data with each user request, so that the ability >>> to see stale/bad data is minimized. >> >> Again, this assumes infinite network bandwidth. > > In what way is the total network bandwitdh used in the system different > for shard copying than for sync replication? > Again, when shards are distributed over multiple DCs (or actually even multiple racks) the bandwidth and latency of local copy will be much better then the one of the remote copy so the local replication can have much lower impact on the cluster performance than remote shard copy will. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
> On Tue, Sep 1, 2015 at 9:48 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Tue, Sep 1, 2015 at 4:15 AM, Andres Freund <andres@anarazel.de> wrote:
> > > On 2015-08-31 20:54:51 -0400, Bruce Momjian wrote:
> > >> Uh, we already have a list of things we need to add to FDWs to make them
> > >> work, and Citus Data has provided a document of more things that are
> > >> needed, https://goo.gl/vJWF85. I am not sure how much bigger a red flag
> > >> you want to confirm that everyone agrees that major FDW improvements are
> > >> a requirement for this.
> > >
> > > Several people saying that the FDW infrastructure isn't sufficient right
> > > now is pretty far from implying that all of them agree that the FDW API
> > > is the way to go.
> > >
> > > I'm not sure myself. If it works out it's going to save us some work and
> > > make it more realistic to get there sometime not too far off. But I'm
> > > afraid that the resulting system will feel like our current partitioning
> > > implemenentation. Yes, it kinda works, but it's hard to get started, it
> > > doesn't support too many features and you're kind afraid your relatives
> > > will see what you've done.
> >
> > I'm not averse to making the "connect to the remote nodes" part of
> > this solution use something other than the FDW infrastructure at some
> > point in time if somebody's prepared to build something better. On
> > the other hand, I think it's extremely clear that the FDW
> > infrastructure has a large amount of potential upon which we have
> > thoroughly failed to capitalize. Patches have already been written
> > for UPDATE/DELETE pushdown and for join pushdown.
> >
> > On Tue, Sep 1, 2015 at 4:15 AM, Andres Freund <andres@anarazel.de> wrote:
> > > On 2015-08-31 20:54:51 -0400, Bruce Momjian wrote:
> > >> Uh, we already have a list of things we need to add to FDWs to make them
> > >> work, and Citus Data has provided a document of more things that are
> > >> needed, https://goo.gl/vJWF85. I am not sure how much bigger a red flag
> > >> you want to confirm that everyone agrees that major FDW improvements are
> > >> a requirement for this.
> > >
> > > Several people saying that the FDW infrastructure isn't sufficient right
> > > now is pretty far from implying that all of them agree that the FDW API
> > > is the way to go.
> > >
> > > I'm not sure myself. If it works out it's going to save us some work and
> > > make it more realistic to get there sometime not too far off. But I'm
> > > afraid that the resulting system will feel like our current partitioning
> > > implemenentation. Yes, it kinda works, but it's hard to get started, it
> > > doesn't support too many features and you're kind afraid your relatives
> > > will see what you've done.
> >
> > I'm not averse to making the "connect to the remote nodes" part of
> > this solution use something other than the FDW infrastructure at some
> > point in time if somebody's prepared to build something better. On
> > the other hand, I think it's extremely clear that the FDW
> > infrastructure has a large amount of potential upon which we have
> > thoroughly failed to capitalize. Patches have already been written
> > for UPDATE/DELETE pushdown and for join pushdown.
Will pushing down writes (Update/Delete) sufficient to maintain sane locking
behaviour and deadlock detection that can occur during writes on multiple
shards? For example it could easily be the case where a single Update
statement could effect multiple shards and cause deadlock due to waits
across the nodes. Now unless we have some distributed lock manager or
some other way to know the information of locks that happens across
shards, it could be difficult to detect deadlocks.
On Tue, Sep 1, 2015 at 4:25 PM, Bruce Momjian <bruce@momjian.us> wrote:
>
>
> The document opens a big question --- when queries can't be processed in
> a traditional top/down fashion, Citus has the goal of sending groups of
> results up the the coordinator, reordering them, then sending them back
> to the shards for further processing, basically using the shards as
> compute engines because the shards are no longer using local data to do
> their computations. The two examples they give are COUNT(DISTINCT) and
> a join across two sharded tables ("CANADA").
>
> I assumed these queries were going to be solved by sending as digested
> data as possible to the coordinator, and having the coordinator complete
> any remaining processing. I think we are going to need to decide if
> such "sending data back to shards" is something we are ever going to
> implement. I can see FDWs _not_ working well for that use-case.
>
Here one related point to think is how do we envision to handle statement
>
>
> The document opens a big question --- when queries can't be processed in
> a traditional top/down fashion, Citus has the goal of sending groups of
> results up the the coordinator, reordering them, then sending them back
> to the shards for further processing, basically using the shards as
> compute engines because the shards are no longer using local data to do
> their computations. The two examples they give are COUNT(DISTINCT) and
> a join across two sharded tables ("CANADA").
>
> I assumed these queries were going to be solved by sending as digested
> data as possible to the coordinator, and having the coordinator complete
> any remaining processing. I think we are going to need to decide if
> such "sending data back to shards" is something we are ever going to
> implement. I can see FDWs _not_ working well for that use-case.
>
Here one related point to think is how do we envision to handle statement
requests, do we want to have centeralized coordinator which will process
all requests or the requests could be received by any node?
I think both kind of systems have their own pros and cons like if we want
to have centralized coordinator kind of system, then it might be limited
by the number of simultaneous requests it can handle and if go other way
like allow requests to be processed by each individual nodes, then we
have to think about replicating all meta-data on all nodes.
I think Collecting statistics about different objects is another thing which
can differ depending on the strategy we choose to allow requests.
On 2015-09-02 PM 01:28, Amit Kapila wrote: >> On Tue, Sep 1, 2015 at 9:48 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> >>> I'm not averse to making the "connect to the remote nodes" part of >>> this solution use something other than the FDW infrastructure at some >>> point in time if somebody's prepared to build something better. On >>> the other hand, I think it's extremely clear that the FDW >>> infrastructure has a large amount of potential upon which we have >>> thoroughly failed to capitalize. Patches have already been written >>> for UPDATE/DELETE pushdown and for join pushdown. > > Will pushing down writes (Update/Delete) sufficient to maintain sane locking > behaviour and deadlock detection that can occur during writes on multiple > shards? For example it could easily be the case where a single Update > statement could effect multiple shards and cause deadlock due to waits > across the nodes. Now unless we have some distributed lock manager or > some other way to know the information of locks that happens across > shards, it could be difficult to detect deadlocks. > I wonder if Ashutosh's atomic foreign transactions patch would address any issues inherent in such cases... Thanks, Amit
On 2015/09/02 14:28, Amit Langote wrote: > On 2015-09-02 PM 01:28, Amit Kapila wrote: >>> On Tue, Sep 1, 2015 at 9:48 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>>> I'm not averse to making the "connect to the remote nodes" part of >>>> this solution use something other than the FDW infrastructure at some >>>> point in time if somebody's prepared to build something better. On >>>> the other hand, I think it's extremely clear that the FDW >>>> infrastructure has a large amount of potential upon which we have >>>> thoroughly failed to capitalize. Patches have already been written >>>> for UPDATE/DELETE pushdown and for join pushdown. >> Will pushing down writes (Update/Delete) sufficient to maintain sane locking >> behaviour and deadlock detection that can occur during writes on multiple >> shards? For example it could easily be the case where a single Update >> statement could effect multiple shards and cause deadlock due to waits >> across the nodes. Now unless we have some distributed lock manager or >> some other way to know the information of locks that happens across >> shards, it could be difficult to detect deadlocks. > I wonder if Ashutosh's atomic foreign transactions patch would address any > issues inherent in such cases... The UPDATE/DELETE pushdown, which I've proposed, would ensure the sane behaviour for inherited UPDATEs/DELETEs, as existing non-pushed-down UPDATE/DELETE does, because inheritance_planner guarantees that all backends lock inheritance children in the same order to avoid needless deadlocks. Best regards, Etsuro Fujita
On Wed, Sep 2, 2015 at 11:35 AM, Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp> wrote:
On 2015/09/02 14:28, Amit Langote wrote:On 2015-09-02 PM 01:28, Amit Kapila wrote:On Tue, Sep 1, 2015 at 9:48 PM, Robert Haas <robertmhaas@gmail.com> wrote:I'm not averse to making the "connect to the remote nodes" part of
this solution use something other than the FDW infrastructure at some
point in time if somebody's prepared to build something better. On
the other hand, I think it's extremely clear that the FDW
infrastructure has a large amount of potential upon which we have
thoroughly failed to capitalize. Patches have already been written
for UPDATE/DELETE pushdown and for join pushdown.Will pushing down writes (Update/Delete) sufficient to maintain sane locking
behaviour and deadlock detection that can occur during writes on multiple
shards? For example it could easily be the case where a single Update
statement could effect multiple shards and cause deadlock due to waits
across the nodes. Now unless we have some distributed lock manager or
some other way to know the information of locks that happens across
shards, it could be difficult to detect deadlocks.I wonder if Ashutosh's atomic foreign transactions patch would address any
issues inherent in such cases...
The UPDATE/DELETE pushdown, which I've proposed, would ensure the sane behaviour for inherited UPDATEs/DELETEs, as existing non-pushed-down UPDATE/DELETE does, because inheritance_planner guarantees that all backends lock inheritance children in the same order to avoid needless deadlocks.
Will it be able to do it for row level locks, row level locking occurs
during updation of a row, so will it be possible to ensure the order of
locks on rows?
Will it handle deadlocks across different table partitions. Consider
a case as below:
T1
1. Updates row R1 of T1 on shard S1
2. Updates row R2 of T2 on shard S2
T2
1. Updates row R2 of T2 on shard S2
2. Updates row R1 of T1 on shard S1
On 2015-09-02 PM 03:25, Amit Kapila wrote: > On Wed, Sep 2, 2015 at 11:35 AM, Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp> >> >> The UPDATE/DELETE pushdown, which I've proposed, would ensure the sane >> behaviour for inherited UPDATEs/DELETEs, as existing non-pushed-down >> UPDATE/DELETE does, because inheritance_planner guarantees that all >> backends lock inheritance children in the same order to avoid needless >> deadlocks. >> >> > Will it be able to do it for row level locks, row level locking occurs > during updation of a row, so will it be possible to ensure the order of > locks on rows? > > Will it handle deadlocks across different table partitions. Consider > a case as below: > > T1 > 1. Updates row R1 of T1 on shard S1 > 2. Updates row R2 of T2 on shard S2 > > T2 > 1. Updates row R2 of T2 on shard S2 > 2. Updates row R1 of T1 on shard S1 > As long as shards are processed in the same order in different transactions, ISTM, this issue should not arise? I can imagine it becoming a concern if parallel shard processing enters the scene. Am I missing something? Thanks, Amit
On 2015/09/02 15:40, Amit Langote wrote: > On 2015-09-02 PM 03:25, Amit Kapila wrote: >> On Wed, Sep 2, 2015 at 11:35 AM, Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp> >>> The UPDATE/DELETE pushdown, which I've proposed, would ensure the sane >>> behaviour for inherited UPDATEs/DELETEs, as existing non-pushed-down >>> UPDATE/DELETE does, because inheritance_planner guarantees that all >>> backends lock inheritance children in the same order to avoid needless >>> deadlocks. >> Will it be able to do it for row level locks, row level locking occurs >> during updation of a row, so will it be possible to ensure the order of >> locks on rows? >> Will it handle deadlocks across different table partitions. Consider >> a case as below: >> >> T1 >> 1. Updates row R1 of T1 on shard S1 >> 2. Updates row R2 of T2 on shard S2 >> >> T2 >> 1. Updates row R2 of T2 on shard S2 >> 2. Updates row R1 of T1 on shard S1 > As long as shards are processed in the same order in different > transactions, ISTM, this issue should not arise? I can imagine it becoming > a concern if parallel shard processing enters the scene. Am I missing > something? Yeah, I thinks so, too. Sorry, maybe my explanation above was not enough, but in the inherted UPDATEs/DELETEs, the table modification is also ensured to be done in the same order. So, as Amit Langote said, both transactions will do the updates in the same order. Best regards, Etsuro Fujita
Amit Langote wrote: > On 2015-09-02 PM 03:25, Amit Kapila wrote: >> Will it handle deadlocks across different table partitions. Consider >> a case as below: >> >> T1 >> 1. Updates row R1 of T1 on shard S1 >> 2. Updates row R2 of T2 on shard S2 >> >> T2 >> 1. Updates row R2 of T2 on shard S2 >> 2. Updates row R1 of T1 on shard S1 > As long as shards are processed in the same order in different > transactions, ISTM, this issue should not arise? I can imagine it becoming > a concern if parallel shard processing enters the scene. Am I missing > something? That would only hold for a single query, right? If 1. and 2. in the above example come from different queries within one transaction, you cannot guarantee that shards are processed in the same order. So T1 and T2 could deadlock. Yours, Laurenz Albe
On 2015-09-02 PM 04:07, Albe Laurenz wrote: > Amit Langote wrote: >> On 2015-09-02 PM 03:25, Amit Kapila wrote: >>> Will it handle deadlocks across different table partitions. Consider >>> a case as below: >>> >>> T1 >>> 1. Updates row R1 of T1 on shard S1 >>> 2. Updates row R2 of T2 on shard S2 >>> >>> T2 >>> 1. Updates row R2 of T2 on shard S2 >>> 2. Updates row R1 of T1 on shard S1 > >> As long as shards are processed in the same order in different >> transactions, ISTM, this issue should not arise? I can imagine it becoming >> a concern if parallel shard processing enters the scene. Am I missing >> something? > > That would only hold for a single query, right? > > If 1. and 2. in the above example come from different queries within one > transaction, you cannot guarantee that shards are processed in the same order. > > So T1 and T2 could deadlock. > Sorry, I failed to see why that would be the case. Could you elaborate? Thanks, Amit
On 2015/09/02 16:40, Amit Langote wrote: > On 2015-09-02 PM 04:07, Albe Laurenz wrote: >> Amit Langote wrote: >>> On 2015-09-02 PM 03:25, Amit Kapila wrote: >>>> Will it handle deadlocks across different table partitions. Consider >>>> a case as below: >>>> >>>> T1 >>>> 1. Updates row R1 of T1 on shard S1 >>>> 2. Updates row R2 of T2 on shard S2 >>>> >>>> T2 >>>> 1. Updates row R2 of T2 on shard S2 >>>> 2. Updates row R1 of T1 on shard S1 >> >>> As long as shards are processed in the same order in different >>> transactions, ISTM, this issue should not arise? I can imagine it becoming >>> a concern if parallel shard processing enters the scene. Am I missing >>> something? >> >> That would only hold for a single query, right? >> >> If 1. and 2. in the above example come from different queries within one >> transaction, you cannot guarantee that shards are processed in the same order. >> >> So T1 and T2 could deadlock. > Sorry, I failed to see why that would be the case. Could you elaborate? I think Laurenz would assume that the updates 1. and 2. in the above transactions are performed *in a non-inherited manner*. If that's right, T1 and T2 could deadlock, but I think we assume here to run transactions over shards *in an inherited manner*. Best regards, Etsuro Fujita
Etsuro Fujita wrote: > On 2015/09/02 16:40, Amit Langote wrote: >> On 2015-09-02 PM 04:07, Albe Laurenz wrote: >>> Amit Langote wrote: >>>> On 2015-09-02 PM 03:25, Amit Kapila wrote: >>>>> Will it handle deadlocks across different table partitions. Consider >>>>> a case as below: >>>>> >>>>> T1 >>>>> 1. Updates row R1 of T1 on shard S1 >>>>> 2. Updates row R2 of T2 on shard S2 >>>>> >>>>> T2 >>>>> 1. Updates row R2 of T2 on shard S2 >>>>> 2. Updates row R1 of T1 on shard S1 >>> >>>> As long as shards are processed in the same order in different >>>> transactions, ISTM, this issue should not arise? I can imagine it becoming >>>> a concern if parallel shard processing enters the scene. Am I missing >>>> something? >>> >>> That would only hold for a single query, right? >>> >>> If 1. and 2. in the above example come from different queries within one >>> transaction, you cannot guarantee that shards are processed in the same order. >>> >>> So T1 and T2 could deadlock. > >> Sorry, I failed to see why that would be the case. Could you elaborate? > > I think Laurenz would assume that the updates 1. and 2. in the above > transactions are performed *in a non-inherited manner*. If that's > right, T1 and T2 could deadlock, but I think we assume here to run > transactions over shards *in an inherited manner*. Yes, but does every update affect all shards? If I say "UPDATE t1 SET col = 1 WHERE id = 42" and the row with id 42 happens to be on shard S1, the update would only affect that shard, right? Now if "UPDATE t2 SET col = 1 WHERE id = 42" would only take place on shard S2, and two transactions issue both updates in different order, one transaction would be waiting for a lock on shard S1, while the other would be waiting for a lock on shard S2, right? But maybe I'm missing something fundamental. Yours, Laurenz Albe
On 2015-09-02 PM 05:07, Etsuro Fujita wrote: > On 2015/09/02 16:40, Amit Langote wrote: >> On 2015-09-02 PM 04:07, Albe Laurenz wrote: >>> >>> That would only hold for a single query, right? >>> >>> If 1. and 2. in the above example come from different queries within one >>> transaction, you cannot guarantee that shards are processed in the same >>> order. >>> >>> So T1 and T2 could deadlock. > >> Sorry, I failed to see why that would be the case. Could you elaborate? > > I think Laurenz would assume that the updates 1. and 2. in the above > transactions are performed *in a non-inherited manner*. If that's right, > T1 and T2 could deadlock, but I think we assume here to run transactions > over shards *in an inherited manner*. > I think Albe may have a point here... Even inherited updates case appears to cause a deadlock if they are in different queries. Demonstrated below: -- setup CREATE TABLE t(a int); CREATE TABLE t1() INHERITS(t); CREATE TABLE t2() INHERITS(t); INSERT INTO t1 VALUES (1); INSERT INTO t2 VALUES (2); -- in session 1 BEGIN; UPDATE t SET a = a + 1 WHERE a = 1; <ok> -- in session 2 BEGIN; UPDATE t SET a = a + 1 WHERE a = 2; <ok> -- back in session 1 UPDATE t SET a = a + 1 WHERE a = 2; <waits> -- back in session 2 UPDATE t SET a = a + 1 WHERE a = 1; <deadlock is detected> Thanks, Amit
On Wed, Sep 2, 2015 at 12:49 AM, Josh Berkus <josh@agliodbs.com> wrote:
On 09/01/2015 11:36 AM, Tomas Vondra wrote:
>> We want multiple copies of shards created by the sharding system itself.
>> Having a separate, and completely orthagonal, redundancy system to the
>> sharding system is overly burdensome on the DBA and makes low-data-loss
>> HA impossible.
>
> IMHO it'd be quite unfortunate if the design would make it impossible to
> combine those two features (e.g. creating standbys for shards and
> failing over to them).
>
> It's true that solving HA at the sharding level (by keeping multiple
> copies of a each shard) may be simpler than combining sharding and
> standbys, but I don't see why it makes low-data-loss HA impossible.
Other way around, that is, having replication standbys as the only
method of redundancy requires either high data loss or high latency for
all writes.
In the case of async rep, every time we fail over a node, the entire
cluser would need to roll back to the last common known-good replay
point, hence high data loss.
In the case of sync rep, we are required to wait for at least double
network lag time in order to do a single write ... making
write-scalability quite difficult.
Futher, if using replication the sharding system would have no way to
(a) find out immediately if a copy was bad and (b) fail over quickly to
a copy of the shard if the first requested copy was not responding.
With async replication, we also can't use multiple copies of the same
shard as a way to balance read workloads.
If we write to multiple copies as a part of the sharding feature, then
that can be parallelized, so that we are waiting only as long as the
slowest write (or in failure cases, as long as the shard timeout).
Further, we can check for shard-copy health and update shard
availability data with each user request, so that the ability to see
stale/bad data is minimized.
XC (and I guess XL, pgPool II as well) did this by firing same DML statement to all the copies after resolving any volatile references (e.g. now()) in DML, so that all the copies get the same values. That method however needed some row identifier which can identify same row on all the replicas. Primary key is used as row identifier usually, but not all use cases which require shards to be replicated have primary key in their sharded tables.
There are obvious problems with multiplexing writes, which you can
figure out if you knock pg_shard around a bit. But I really think that
solving those problems is the only way to go.
Mind you, I see a strong place for binary replication and BDR for
multi-region redundancy; you really don't want that to be part of the
sharding system if you're aiming for write scalability.
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
On 2015-09-02 PM 06:41, Amit Langote wrote: > > I think Albe may have a point here... > > Even inherited updates case appears to cause a deadlock if they are in > different queries. Demonstrated below: > > -- setup > CREATE TABLE t(a int); > CREATE TABLE t1() INHERITS(t); > CREATE TABLE t2() INHERITS(t); > > INSERT INTO t1 VALUES (1); > INSERT INTO t2 VALUES (2); > > -- in session 1 > BEGIN; > UPDATE t SET a = a + 1 WHERE a = 1; > <ok> > > -- in session 2 > BEGIN; > UPDATE t SET a = a + 1 WHERE a = 2; > <ok> > > -- back in session 1 > UPDATE t SET a = a + 1 WHERE a = 2; > <waits> > > -- back in session 2 > UPDATE t SET a = a + 1 WHERE a = 1; > <deadlock is detected> > Which, I now realize, is not the worry Amit Kapila's expresses. The deadlock was *indeed detected* in this case, with all the locks in the same PG instance. In a sharded environment with multiple PG instances, that becomes tricky. DLM (distributed lock manager/deadlock detector) seems indeed necessary as Amit K. suspects. Thanks, Amit
On Wed, Sep 2, 2015 at 3:55 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
On 2015-09-02 PM 06:41, Amit Langote wrote:
>
> I think Albe may have a point here...
>
> Even inherited updates case appears to cause a deadlock if they are in
> different queries. Demonstrated below:
>
> -- setup
> CREATE TABLE t(a int);
> CREATE TABLE t1() INHERITS(t);
> CREATE TABLE t2() INHERITS(t);
>
> INSERT INTO t1 VALUES (1);
> INSERT INTO t2 VALUES (2);
>
> -- in session 1
> BEGIN;
> UPDATE t SET a = a + 1 WHERE a = 1;
> <ok>
>
> -- in session 2
> BEGIN;
> UPDATE t SET a = a + 1 WHERE a = 2;
> <ok>
>
> -- back in session 1
> UPDATE t SET a = a + 1 WHERE a = 2;
> <waits>
>
> -- back in session 2
> UPDATE t SET a = a + 1 WHERE a = 1;
> <deadlock is detected>
>
Which, I now realize, is not the worry Amit Kapila's expresses.
The deadlock was *indeed detected* in this case, with all the locks in the
same PG instance. In a sharded environment with multiple PG instances,
that becomes tricky. DLM (distributed lock manager/deadlock detector)
seems indeed necessary as Amit K. suspects.
Thanks,
Pavan
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Sep 2, 2015 at 4:19 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
> On Wed, Sep 2, 2015 at 3:55 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>
>> On 2015-09-02 PM 06:41, Amit Langote wrote:
>>
>> Which, I now realize, is not the worry Amit Kapila's expresses.
>>
>> The deadlock was *indeed detected* in this case, with all the locks in the
>> same PG instance. In a sharded environment with multiple PG instances,
>> that becomes tricky. DLM (distributed lock manager/deadlock detector)
>> seems indeed necessary as Amit K. suspects.
>>
>
> Right. XC/XL did not address this issue and they rely on statement timeouts to break distributed deadlocks.
>
I think that will be difficult for application to decide and then it
> On Wed, Sep 2, 2015 at 3:55 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>
>> On 2015-09-02 PM 06:41, Amit Langote wrote:
>>
>> Which, I now realize, is not the worry Amit Kapila's expresses.
>>
>> The deadlock was *indeed detected* in this case, with all the locks in the
>> same PG instance. In a sharded environment with multiple PG instances,
>> that becomes tricky. DLM (distributed lock manager/deadlock detector)
>> seems indeed necessary as Amit K. suspects.
>>
>
> Right. XC/XL did not address this issue and they rely on statement timeouts to break distributed deadlocks.
>
I think that will be difficult for application to decide and then it
needs to decide the same for all statements which is tricky because
different statements could take different time. I think it is better to
have solution for this problem and deadlock should be detected.
On 2015/09/02 20:42, Amit Kapila wrote: > On Wed, Sep 2, 2015 at 4:19 PM, Pavan Deolasee <pavan.deolasee@gmail.com > <mailto:pavan.deolasee@gmail.com>> wrote: > > On Wed, Sep 2, 2015 at 3:55 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp <mailto:Langote_Amit_f8@lab.ntt.co.jp>> > wrote: > >> On 2015-09-02 PM 06:41, Amit Langote wrote: > >> Which, I now realize, is not the worry Amit Kapila's expresses. > >> The deadlock was *indeed detected* in this case, with all the locks > in the > >> same PG instance. In a sharded environment with multiple PG instances, > >> that becomes tricky. DLM (distributed lock manager/deadlock detector) > >> seems indeed necessary as Amit K. suspects. Ah, you are right. > > Right. XC/XL did not address this issue and they rely on statement > timeouts to break distributed deadlocks. > I think that will be difficult for application to decide and then it > needs to decide the same for all statements which is tricky because > different statements could take different time. I think it is better to > have solution for this problem and deadlock should be detected. +1 Best regards, Etsuro Fujita
On Tue, Sep 1, 2015 at 06:11:45PM -0400, Bruce Momjian wrote: > Let me clearer about what the Citus Data paper shows. I said originally > that the data was sent to the coordinator, sorted, then resent to the > shards, but the document: > > https://goo.gl/vJWF85 > https://www.citusdata.com/blog/114-how-to-build-your-distributed-database > > has the shards create the groups and the groups are sent to the other > shards. For example, to do COUNT(DISTINCT) if you have three shards, > then each shard breaks its data into 3 buckets (1B in size), then the > first bucket from each of the three shards goes to the first shard, and > the second bucket goes to the second shared, etc. > > Basically, they are doing map-reduce, and the shards are creating > additional batches that get shipped to other shards. I can see FDWs not > working well in that case as you are really creating a new data layout > just for the query. This explains why the XC/XL people are saying they > would use FDWs if they existed at the time they started development, > while the Citus Data people are saying they couldn't use FDWs as they > currently exist. They probably both needed FDW improvements, but I > think the Citus Data features would need a lot more. To expand on this, using FDWs, it means each shard would create a temporary table on the other shards and send some if its data to those shards. Once a shard gets all its data from the other shards, it will process the data and send the result to the collector. That certainly seems like something FDWs would not do well. Frankly, I am unclear how Citus Data was able to do this with only backend hooks. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Tue, Sep 1, 2015 at 7:08 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Sep 1, 2015 at 12:00 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
> My worry is that if we start implementing them again from scratch, it will
> take a few years before we get them in a usable state. What XC/XL lacked is
> probably a Robert Haas or a Tom Lane who could look at the work and suggest
> major edits. If that had happened, the quality of the product could have
> been much better today. I don't mean to derate the developers who worked on
> XC/XL, but there is no harm in accepting that if someone with a much better
> understanding of the whole system was part of the team, that would have
> positively impacted the project. Is that an angle worth exploring? Does it
> make sense to commit some more resources to say XC or XL and try to improve
> the quality of the product even further? To be honest, XL is in far far
> better shape (haven't really tried XC in a while) and some more QA/polishing
> can make it production ready much sooner.
From my point of view, and EnterpriseDB's point of view, anything that
doesn't go into the core PostgreSQL distribution isn't really getting
us where we need to be. If there's code in XL that would be valuable
to merge into core PostgreSQL, then let's do it. If the code cannot
be used but there are lessons we can learn that will make what does go
into core PostgreSQL better, let's learn them. However, I don't think
it's serving anybody very well that we have the XC fork, and multiple
forks of the XC fork, floating around out there and people are working
on those instead of working on core PostgreSQL. The reality is that
we don't have enough brainpower to spread it across 2 or 3 or 4 or 5
different projects and have all of them be good. The reality is,
also, that horizontal scalability isn't an optional feature. There
was a point in time at which the PostgreSQL project's official policy
on replication was that it did not belong in core. That was a bad
policy; thankfully, it was reversed, and the result was Hot Standby
and Streaming Replication, incredibly important technologies without
which we would not be where we are today. Horizontal scalability is
just as essential.
Agree with you, Robert.
One lesson from XL we got is that we need testing framework for cluster, so any cluster project should at least pass functional and performance testing. XL was very easy to break and I'm wondering how many corner cases still exists. We tried several other approaches and while reading the papers was a fun, in practice we found many devil details, which made the paper be just a paper.
One lesson from XL we got is that we need testing framework for cluster, so any cluster project should at least pass functional and performance testing. XL was very easy to break and I'm wondering how many corner cases still exists. We tried several other approaches and while reading the papers was a fun, in practice we found many devil details, which made the paper be just a paper.
On 09/01/2015 04:14 PM, Petr Jelinek wrote: > On 2015-09-02 00:09, Josh Berkus wrote: >> On 09/01/2015 02:29 PM, Tomas Vondra wrote: >>> So while you may be right in single-DC deployments, with multi-DC >>> deployments the situation is quite different - not only that the network >>> bandwidth is not unlimited, but because latencies within DC may be a >>> fraction of latencies between the locations (to the extent that the >>> increase due to syncrep may be just noise). So the local replication may >>> be actually way faster. >> >> I'm not seeing how the above is better using syncrep than using shard >> copying? > > Shard copying usually assumes that the origin node does the copy - the > data has to go twice through the slow connection. With replication you > can replicate locally over fast connection. Ah, I was thinking of the case of having a single set of copies in the remote DC, but of course that isn't going to be the case with a highly redundant setup. Basically this seems to be saying that, in an ideal setup, we'd have some kind of synchronous per-shard replication. We don't have that at present (sync rep is whole-node, and BDR is asynchronous). There's also the question of how to deal with failures and taking bad nodes out of circulation in such a setup, especially considering that the writes could be coming from multiple other nodes. >> Not really, the mechanism is different and the behavior is different. >> One critical deficiency in using binary syncrep is that you can't do >> round-robin redundancy at all; every redundant node has to be an exact >> mirror of another node. In a good HA distributed system, you want >> multiple shards per node, and you want each shard to be replicated to a >> different node, so that in the event of node failure you're not dumping >> the full load on one other server. >> > > This assumes that we use binary replication, but we can reasonably use > logical replication which can quite easily do filtering of what's > replicated where. Is there a way to do logical synchronous replication? I didn't think there was. >>> IMHO the design has to address the multi-DC setups somehow. I think that >>> many of the customers who are so concerned about scaling to many shards >>> are also concerned about availability in case of DC outages, no? >> >> Certainly. But users located in a single DC shouldn't pay the same >> overhead as users who are geographically spread. >> > > Agreed, so we should support both ways, but I don't think it's necessary > to support both ways in version 0.1. It's just important to not paint > ourselves into a corner with design decisions that would make one of the > ways impossible. Exactly! Let me explain why I'm so vocal on this point. PostgresXC didn't deal with the redundancy/node replacement at all until after version 1.0. Then, when they tried to address it, they discovered that the code was chock full of assumptions that "1 node == 1 shard", and breaking that assumption would require a total refactor of the code (which never happened). I don't want to see a repeat of that mistake. Even if it's only on paper, any new sharding design needs to address these questions: 1. How do we ensure no/minimal data is lost if we lose a node? 2. How do we replace a lost node (without taking the cluster down)? 2. a. how do we allow an out-of-sync node to "catchup"? 3. How do we maintain metadata about good/bad nodes (and shard locations)? 4. How do we add nodes to expand the cluster? There doesn't need to be code for all of the above from version 0.1, but there needs to be a plan to tackle those problems. Otherwise, we'll just end up with another dead-end, not-useful-in-production technology. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Tue, Sep 1, 2015 at 11:18 AM, Robert Haas <robertmhaas@gmail.com> wrote: > It would be a bad idea to cling blindly to the FDW infrastructure if > it's fundamentally inadequate to do what we want. On the other hand, > it would also be a bad idea to set about recreating it without a > really good reason, and - just to take one example - the fact that it > doesn't currently push down DML operations to the remote side is not a > really good reason to rewrite the whole thing. On the contrary, it's > a reason to put some energy into the already-written patch which > implements that optimization. The problem with FDW for these purposes as I see it is that too much intelligence is relegated to the implementer of the API. There needs to be a mechanism so that the planner can rewrite the remote query and then do some after the fact processing. This exactly what citus does; if you send out AVG(foo) it rewrites that to SUM(foo) and COUNT(foo) so that aggregation can be properly weighted to the result. To do this (distributed OLAP-type processing) right, the planner needs to *know* that this table is in fact distributed and also know that it can make SQL compatible adjustments to the query. This strikes me as a bit of a conflict of interest with FDW which seems to want to hide the fact that it's foreign; the FDW implementation makes it's own optimization decisions which might make sense for single table queries but breaks down in the face of joins. merlin
On Wed, Sep 2, 2015 at 1:59 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Tue, Sep 1, 2015 at 11:18 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> It would be a bad idea to cling blindly to the FDW infrastructure if >> it's fundamentally inadequate to do what we want. On the other hand, >> it would also be a bad idea to set about recreating it without a >> really good reason, and - just to take one example - the fact that it >> doesn't currently push down DML operations to the remote side is not a >> really good reason to rewrite the whole thing. On the contrary, it's >> a reason to put some energy into the already-written patch which >> implements that optimization. > > The problem with FDW for these purposes as I see it is that too much > intelligence is relegated to the implementer of the API. There needs > to be a mechanism so that the planner can rewrite the remote query and > then do some after the fact processing. This exactly what citus does; > if you send out AVG(foo) it rewrites that to SUM(foo) and COUNT(foo) > so that aggregation can be properly weighted to the result. To do > this (distributed OLAP-type processing) right, the planner needs to > *know* that this table is in fact distributed and also know that it > can make SQL compatible adjustments to the query. > > This strikes me as a bit of a conflict of interest with FDW which > seems to want to hide the fact that it's foreign; the FDW > implementation makes it's own optimization decisions which might make > sense for single table queries but breaks down in the face of joins. Well, I don't think that ALL of the logic should go into the FDW. In particular, in this example, parallel aggregate needs the same query rewrite, so the logic for that should live in core so that both parallel and distributed queries get the benefit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Sep 2, 2015 at 1:57 PM, Josh Berkus <josh@agliodbs.com> wrote: > Even if it's only on paper, any new sharding design needs to address > these questions: > > 1. How do we ensure no/minimal data is lost if we lose a node? > 2. How do we replace a lost node (without taking the cluster down)? > 2. a. how do we allow an out-of-sync node to "catch up"? > 3. How do we maintain metadata about good/bad nodes (and shard locations)? > 4. How do we add nodes to expand the cluster? > > There doesn't need to be code for all of the above from version 0.1, but > there needs to be a plan to tackle those problems. Otherwise, we'll > just end up with another dead-end, not-useful-in-production technology. This is a good point, and I think I agree with it. Let me make a few observations: 1. None of this stuff matters very much when the data is strictly read-only. You don't lose any data because you made enough copies at some point in the distant past to ensure that you wouldn't. You replace a lost node by taking anew copy. Nodes never need to catch up because there are no changes happening. To make bring up a new node, you make a copy of an existing node (which doesn't change in the meantime). So most of these concerns are about how to handle writes. 2. None of this stuff matters when you only have one copy of the data. Your system is low-availability, but you just don't care for whatever reason. The issue arises when you have multiple copies of the data, and the data is being changed. Now, you have to worry about the copies getting out of sync with each other, especially when failures happen. 3. IIUC, Postgres-XC handles this problem by reducing at least volatile functions, maybe all functions, to constants. Then it generates an SQL statement to be sent to the data node to make the appropriate change. If there's more than one copy of the data, we send a separate copy of the SQL statement to every node. I'm not sure exactly what happens if some of those nodes are not available, but I don't think it's anything good. Fundamentally, this model doesn't allow for many good options in that case. 4. Therefore, I think that we should instead use logical replication, which might be either synchronous or asynchronous. When you modify one copy of the data, that change will then be replicated to all other nodes. If you are OK with eventual consistency, this replication can be asynchronous, and nodes that are off-line will catch up when they are on-line. If you are not OK with that, then you must replicate synchronously to every node before transaction commit; or at least you must replicate synchronously to every node that is currently on-line. This presents some challenges: logical decoding currently can't replicate transactions that are still in process - replication starts when the transaction commits. Also, we don't have any way for synchronous replication to wait for multiple nodes. But in theory those seem like limitations that can be lifted. Also, the GTM needs to be aware that this stuff is happening, or it will DTWT. That too seems like a problem that can be solved. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 09/02/2015 11:41 AM, Robert Haas wrote: > On Wed, Sep 2, 2015 at 1:57 PM, Josh Berkus <josh@agliodbs.com> wrote: >> Even if it's only on paper, any new sharding design needs to address >> these questions: >> >> 1. How do we ensure no/minimal data is lost if we lose a node? >> 2. How do we replace a lost node (without taking the cluster down)? >> 2. a. how do we allow an out-of-sync node to "catch up"? >> 3. How do we maintain metadata about good/bad nodes (and shard locations)? >> 4. How do we add nodes to expand the cluster? >> >> There doesn't need to be code for all of the above from version 0.1, but >> there needs to be a plan to tackle those problems. Otherwise, we'll >> just end up with another dead-end, not-useful-in-production technology. > > This is a good point, and I think I agree with it. Let me make a few > observations: > > 1. None of this stuff matters very much when the data is strictly > read-only. Yep. > 2. None of this stuff matters when you only have one copy of the data. > Your system is low-availability, but you just don't care for whatever > reason. Uh-huh. > 3. IIUC, Postgres-XC handles this problem by reducing at least > volatile functions, maybe all functions, to constants. Then it > generates an SQL statement to be sent to the data node to make the > appropriate change. If there's more than one copy of the data, we > send a separate copy of the SQL statement to every node. I'm not sure > exactly what happens if some of those nodes are not available, but I > don't think it's anything good. Fundamentally, this model doesn't > allow for many good options in that case. pg_shard also sends the data to each node, and automatically notices which nodes are not responding and takes them out of availability. There isn't a "catch up" feature yet (AFAIK), or any attempt to reduce volatile functions. For that matter, last I worked on it Greenplum also did multiplexing via the writing node (or via the data loader). So this is a popular approach; it has a number of drawbacks, though, of which volatile functions are a major one. > 4. Therefore, I think that we should instead use logical replication, > which might be either synchronous or asynchronous. When you modify > one copy of the data, that change will then be replicated to all other > nodes. If you are OK with eventual consistency, this replication can > be asynchronous, and nodes that are off-line will catch up when they > are on-line. If you are not OK with that, then you must replicate > synchronously to every node before transaction commit; or at least you > must replicate synchronously to every node that is currently on-line. > This presents some challenges: logical decoding currently can't > replicate transactions that are still in process - replication starts > when the transaction commits. Also, we don't have any way for > synchronous replication to wait for multiple nodes. Well, there is a WIP patch for that, which IMHO would be much improved by having a concrete use-case like this one. What nobody is working on -- and we've vetoed in the past -- is a way of automatically failing and removing from replication any node which repeatedly fails to sync, which would be a requirement for this model. You'd also need a way to let the connection nodes know when a replica has fallen behind so that they can be taken out of load-balancing/sharding for read queries. For the synchronous model, that would be "fallen behind at all"; for asynchronous it would be "fallen more than ### behind". > But in theory > those seem like limitations that can be lifted. Also, the GTM needs > to be aware that this stuff is happening, or it will DTWT. That too > seems like a problem that can be solved. Yeah? I'd assume that a GTM would be antithetical to two-stage copying.I'm not a big fan of a GTM at all, frankly; it makesclusters much harder to set up, and becomes a SPOF. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Wed, Sep 2, 2015 at 3:03 PM, Josh Berkus <josh@agliodbs.com> wrote: >> 4. Therefore, I think that we should instead use logical replication, >> which might be either synchronous or asynchronous. When you modify >> one copy of the data, that change will then be replicated to all other >> nodes. If you are OK with eventual consistency, this replication can >> be asynchronous, and nodes that are off-line will catch up when they >> are on-line. If you are not OK with that, then you must replicate >> synchronously to every node before transaction commit; or at least you >> must replicate synchronously to every node that is currently on-line. >> This presents some challenges: logical decoding currently can't >> replicate transactions that are still in process - replication starts >> when the transaction commits. Also, we don't have any way for >> synchronous replication to wait for multiple nodes. > > Well, there is a WIP patch for that, which IMHO would be much improved > by having a concrete use-case like this one. What nobody is working on > -- and we've vetoed in the past -- is a way of automatically failing and > removing from replication any node which repeatedly fails to sync, which > would be a requirement for this model. Yep. It's clear to me we need that in general, not just for sharding. To me, the key is to make sure there's a way for the cluster-ware to know about the state transitions. Currently, when the synchronous standby changes, PostgreSQL doesn't tell anyone. That's a problem. > You'd also need a way to let the connection nodes know when a replica > has fallen behind so that they can be taken out of > load-balancing/sharding for read queries. For the synchronous model, > that would be "fallen behind at all"; for asynchronous it would be > "fallen more than ### behind". How is that different from the previous thing? Just that we'd treat "lagging" as "down" beyond some threshold? That doesn't seem like a mandatory feature. >> But in theory >> those seem like limitations that can be lifted. Also, the GTM needs >> to be aware that this stuff is happening, or it will DTWT. That too >> seems like a problem that can be solved. > > Yeah? I'd assume that a GTM would be antithetical to two-stage copying. I don't think so. If transaction A writes data on X which is replicated to Y and then commits, a new snapshot which shows A as committed can't be used on Y until A's changes have been replicated there. That could be enforced by having the commit of A wait for replication, or by having an attempt by a later transaction to use the snapshot on Y wait until replication completes, or some even more sophisticated strategy that considers whether the replication backlog touches the same data that the new transaction will read. It's complicated, but it doesn't seem intractable. > I'm not a big fan of a GTM at all, frankly; it makes clusters much > harder to set up, and becomes a SPOF. I partially agree. I think it's very important that the GTM is an optional feature of whatever we end up with, rather than an indispensable component. People who don't want it shouldn't have to pay the price in performance and administrative complexity. But at the same time, I think a lot of people will want it, because without it, the fact that sharding is in use is much less transparent to the application. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 09/02/2015 08:27 PM, Robert Haas wrote: > On Wed, Sep 2, 2015 at 1:59 PM, Merlin Moncure <mmoncure@gmail.com> > wrote: >> >> This strikes me as a bit of a conflict of interest with FDW which >> seems to want to hide the fact that it's foreign; the FDW >> implementation makes it's own optimization decisions which might >> make sense for single table queries but breaks down in the face of >> joins. +1 to these concerns > Well, I don't think that ALL of the logic should go into the FDW. Then maybe we shouldn't call this "FDW-based sharding" (or "FDW approach" or whatever was used in this thread so far) because that kinda implies that the proposal is to build on FDW. In my mind, FDW is a wonderful tool to integrate PostgreSQL with external data sources, and it's nicely shaped for this purpose, which implies the abstractions and assumptions in the code. The truth however is that many current uses of the FDW API are actually using it for different purposes because there's no other way to do that, not because FDWs are the "right way". And this includes the attempts to build sharding on FDW, I think. Situations like this result in "improvements" of the API that seem to improve the API for the second group, but make the life harder for the original FDW API audience by making the API needlessly complex. And I say "seem to improve" because the second group eventually runs into the fundamental abstractions and assumptions the API is based on anyway. And based on the discussions at pgcon, I think this is the main reason why people cringe when they hear "FDW" and "sharding" in the same sentence. I'm not opposed to reusing the FDW infrastructure, of course. > In particular, in this example, parallel aggregate needs the same> query rewrite, so the logic for that should live incore so that> both parallel and distributed queries get the benefit. I'm not sure the parallel query is a great example here - maybe I'm wrong but I think it's a fairly isolated piece of code, and we have pretty clear idea of the two use cases. I'm sure it's non-trivial to design it well for both cases, but I think the questions for FWD/sharding will be much more about abstract concepts than particular technical solutions. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 09/02/2015 12:30 PM, Robert Haas wrote: > On Wed, Sep 2, 2015 at 3:03 PM, Josh Berkus <josh@agliodbs.com> wrote: >>> 4. Therefore, I think that we should instead use logical replication, >>> which might be either synchronous or asynchronous. When you modify >>> one copy of the data, that change will then be replicated to all other >>> nodes. If you are OK with eventual consistency, this replication can >>> be asynchronous, and nodes that are off-line will catch up when they >>> are on-line. If you are not OK with that, then you must replicate >>> synchronously to every node before transaction commit; or at least you >>> must replicate synchronously to every node that is currently on-line. >>> This presents some challenges: logical decoding currently can't >>> replicate transactions that are still in process - replication starts >>> when the transaction commits. Also, we don't have any way for >>> synchronous replication to wait for multiple nodes. >> >> Well, there is a WIP patch for that, which IMHO would be much improved >> by having a concrete use-case like this one. What nobody is working on >> -- and we've vetoed in the past -- is a way of automatically failing and >> removing from replication any node which repeatedly fails to sync, which >> would be a requirement for this model. > > Yep. It's clear to me we need that in general, not just for sharding. > To me, the key is to make sure there's a way for the cluster-ware to > know about the state transitions. Currently, when the synchronous > standby changes, PostgreSQL doesn't tell anyone. That's a problem. There are many parts of our replication which are still effectively unmonitorable. For example, there's still no way to tell from the replica that it's lost contact with the master except by tailing the log. If we try to build bigger systems on top of these components, we'll find that we need to add a lot of instrumentation. > >> You'd also need a way to let the connection nodes know when a replica >> has fallen behind so that they can be taken out of >> load-balancing/sharding for read queries. For the synchronous model, >> that would be "fallen behind at all"; for asynchronous it would be >> "fallen more than ### behind". > > How is that different from the previous thing? Just that we'd treat > "lagging" as "down" beyond some threshold? That doesn't seem like a > mandatory feature. It's a mandatory feature if you want to load-balance reads. We have to know which nodes not to send reads to because they are out of sync. >> Yeah? I'd assume that a GTM would be antithetical to two-stage copying. > > I don't think so. If transaction A writes data on X which is > replicated to Y and then commits, a new snapshot which shows A as > committed can't be used on Y until A's changes have been replicated > there. That could be enforced by having the commit of A wait for > replication, or by having an attempt by a later transaction to use the > snapshot on Y wait until replication completes, or some even more > sophisticated strategy that considers whether the replication backlog > touches the same data that the new transaction will read. It's > complicated, but it doesn't seem intractable. I need to see this on a chalkboard to understand it. >> I'm not a big fan of a GTM at all, frankly; it makes clusters much >> harder to set up, and becomes a SPOF. > > I partially agree. I think it's very important that the GTM is an > optional feature of whatever we end up with, rather than an > indispensable component. People who don't want it shouldn't have to > pay the price in performance and administrative complexity. But at > the same time, I think a lot of people will want it, because without > it, the fact that sharding is in use is much less transparent to the > application. If it can be optional, then we're pretty close to covering most use cases with one general infrastructure. That would be nice. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Wed, Sep 2, 2015 at 02:41:46PM -0400, Robert Haas wrote: > 4. Therefore, I think that we should instead use logical replication, > which might be either synchronous or asynchronous. When you modify > one copy of the data, that change will then be replicated to all other > nodes. If you are OK with eventual consistency, this replication can > be asynchronous, and nodes that are off-line will catch up when they > are on-line. If you are not OK with that, then you must replicate > synchronously to every node before transaction commit; or at least you > must replicate synchronously to every node that is currently on-line. > This presents some challenges: logical decoding currently can't > replicate transactions that are still in process - replication starts > when the transaction commits. Also, we don't have any way for > synchronous replication to wait for multiple nodes. But in theory > those seem like limitations that can be lifted. Also, the GTM needs > to be aware that this stuff is happening, or it will DTWT. That too > seems like a problem that can be solved. Can you explain why logical replication is better than binary replication for this use-case? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Wed, Sep 2, 2015 at 12:03:36PM -0700, Josh Berkus wrote: > Well, there is a WIP patch for that, which IMHO would be much improved > by having a concrete use-case like this one. What nobody is working on > -- and we've vetoed in the past -- is a way of automatically failing and > removing from replication any node which repeatedly fails to sync, which > would be a requirement for this model. > > You'd also need a way to let the connection nodes know when a replica > has fallen behind so that they can be taken out of > load-balancing/sharding for read queries. For the synchronous model, > that would be "fallen behind at all"; for asynchronous it would be > "fallen more than ### behind". I think this gets back to the idea of running an administrative alert command when we switch to using a different server for synchronous_standby_names. We can't just keep requiring external tooling to identify things that the database knows easily and can send an alert. Removing failed nodes is also something we should do and notify users about. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Wed, Sep 2, 2015 at 6:56 PM, Bruce Momjian <bruce@momjian.us> wrote: > On Wed, Sep 2, 2015 at 02:41:46PM -0400, Robert Haas wrote: >> 4. Therefore, I think that we should instead use logical replication, >> which might be either synchronous or asynchronous. When you modify >> one copy of the data, that change will then be replicated to all other >> nodes. If you are OK with eventual consistency, this replication can >> be asynchronous, and nodes that are off-line will catch up when they >> are on-line. If you are not OK with that, then you must replicate >> synchronously to every node before transaction commit; or at least you >> must replicate synchronously to every node that is currently on-line. >> This presents some challenges: logical decoding currently can't >> replicate transactions that are still in process - replication starts >> when the transaction commits. Also, we don't have any way for >> synchronous replication to wait for multiple nodes. But in theory >> those seem like limitations that can be lifted. Also, the GTM needs >> to be aware that this stuff is happening, or it will DTWT. That too >> seems like a problem that can be solved. > > Can you explain why logical replication is better than binary > replication for this use-case? Uh, well, for the same reasons it is better in many other cases. Particularly, you probably don't want to replicate all the data on machine A to machine B, just some of it. Typically, sharding solutions store multiple copies of each piece of data. So let's say you have 4 machines. You divide the data into 12 chunks. Each machine is the write-master for 2 of those chunks, but has secondary copies of 3 others. So maybe things start out like this: machine #1: master for chunks 1, 2, 3; also has copies of chunks 4, 7, 10 machine #2: master for chunks 4, 5, 6; also has copies of chunks 1, 8, 11 machine #3: master for chunks 7, 8, 9; also has copies of chunks 2, 5, 12 machine #4: master for chunks 10, 11, 12; also has copies of chunks 3, 6, 9 If machine #1 is run over by a rabid triceratops, you can make machine #2 the master for chunk 1, machine #3 the master for chunk 2, and machine #4 the master for chunk 3. The write load therefore remains evenly divided. If you can only copy entire machines, you can't achieve that in this situation. I'm not saying that the above is exactly what we're going to end up with, or even necessarily close. But a big part of the point of sharding is that not all the machines have the same data - otherwise you are not write scaling. But it will frequently be the case, for various reasons, that they have *overlapping* sets of data. Logical replication can handle that; physical replication can't. In Postgres-XC, all tables are either sharded (part of the table is present on each node) or distributed (all of the table is present on every node). Clearly, there's no way to use physical replication in that scenario except if you are OK with having two copies of every node. But that's not a very good solution. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Sep 3, 2015 at 3:41 AM, Robert Haas wrote: > 3. IIUC, Postgres-XC handles this problem by reducing at least > volatile functions, maybe all functions, to constants. Then it > generates an SQL statement to be sent to the data node to make the > appropriate change. If there's more than one copy of the data, we > send a separate copy of the SQL statement to every node. I'm not sure > exactly what happens if some of those nodes are not available, but I > don't think it's anything good. Fundamentally, this model doesn't > allow for many good options in that case. I don't recall that. Immutable functions are switched to constants in the query sent to datanodes. Volatile and stable functions are evaluated locally after fetching the results from the remote node. Not that efficient for warehouse loads. My 2c. > 4. Therefore, I think that we should instead use logical replication, > which might be either synchronous or asynchronous. When you modify > one copy of the data, that change will then be replicated to all other > nodes. If you are OK with eventual consistency, this replication can > be asynchronous, and nodes that are off-line will catch up when they > are on-line. If you are not OK with that, then you must replicate > synchronously to every node before transaction commit; or at least you > must replicate synchronously to every node that is currently on-line. > This presents some challenges: logical decoding currently can't > replicate transactions that are still in process - replication starts > when the transaction commits. Also, we don't have any way for > synchronous replication to wait for multiple nodes. That's something that the quorum synchronous patch would address. Still, having the possibility to be synchronous across multiple nodes does not seem like to be something at the top of the list. > Also, the GTM needs to be aware that this stuff is happening, or it will DTWT. That too seems like a problem that canbe solved. If I understood correctly, yes it is with its centralized transaction facility each node is aware of the transaction status via the global snapshot. -- Michael
On 09/02/2015 03:56 PM, Bruce Momjian wrote: > On Wed, Sep 2, 2015 at 02:41:46PM -0400, Robert Haas wrote: >> 4. Therefore, I think that we should instead use logical replication, >> which might be either synchronous or asynchronous. When you modify >> one copy of the data, that change will then be replicated to all other >> nodes. If you are OK with eventual consistency, this replication can >> be asynchronous, and nodes that are off-line will catch up when they >> are on-line. If you are not OK with that, then you must replicate >> synchronously to every node before transaction commit; or at least you >> must replicate synchronously to every node that is currently on-line. >> This presents some challenges: logical decoding currently can't >> replicate transactions that are still in process - replication starts >> when the transaction commits. Also, we don't have any way for >> synchronous replication to wait for multiple nodes. But in theory >> those seem like limitations that can be lifted. Also, the GTM needs >> to be aware that this stuff is happening, or it will DTWT. That too >> seems like a problem that can be solved. > > Can you explain why logical replication is better than binary > replication for this use-case? > Selectivity? JD -- Command Prompt, Inc. - http://www.commandprompt.com/ 503-667-4564 PostgreSQL Centered full stack support, consulting and development. Announcing "I'm offended" is basically telling the world you can't control your own emotions, so everyone else should do it for you.
On Wed, Sep 2, 2015 at 07:50:25PM -0700, Joshua Drake wrote: > >Can you explain why logical replication is better than binary > >replication for this use-case? > > > > Selectivity? I was assuming you would just create identical slaves to handle failure, rather than moving selected data around. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Wed, Sep 2, 2015 at 09:03:25PM -0400, Robert Haas wrote: > > Can you explain why logical replication is better than binary > > replication for this use-case? > > Uh, well, for the same reasons it is better in many other cases. > Particularly, you probably don't want to replicate all the data on > machine A to machine B, just some of it. > > Typically, sharding solutions store multiple copies of each piece of > data. So let's say you have 4 machines. You divide the data into 12 > chunks. Each machine is the write-master for 2 of those chunks, but > has secondary copies of 3 others. So maybe things start out like > this: > > machine #1: master for chunks 1, 2, 3; also has copies of chunks 4, 7, 10 > machine #2: master for chunks 4, 5, 6; also has copies of chunks 1, 8, 11 > machine #3: master for chunks 7, 8, 9; also has copies of chunks 2, 5, 12 > machine #4: master for chunks 10, 11, 12; also has copies of chunks 3, 6, 9 > > If machine #1 is run over by a rabid triceratops, you can make machine > #2 the master for chunk 1, machine #3 the master for chunk 2, and > machine #4 the master for chunk 3. The write load therefore remains > evenly divided. If you can only copy entire machines, you can't > achieve that in this situation. I see the advantage of this now. My original idea is that each shard would have its own standby for disaster recovery, but your approach above, which I know is typical, allows the shards to back up each other. You could say shard 2 is the backup for shard 1, but then if shard one goes bad, the entire workload of shard 1 goes to shard 2. With the above approach, the load of shard 1 is shared by all the shards. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Thu, Sep 3, 2015 at 8:28 AM, Bruce Momjian <bruce@momjian.us> wrote:
>
> On Wed, Sep 2, 2015 at 07:50:25PM -0700, Joshua Drake wrote:
> > >Can you explain why logical replication is better than binary
> > >replication for this use-case?
> > >
> >
> > Selectivity?
>
> I was assuming you would just create identical slaves to handle failure,
> rather than moving selected data around.
>
Yes, I also think so, otherwise when the shard goes down and it's replica
>
> On Wed, Sep 2, 2015 at 07:50:25PM -0700, Joshua Drake wrote:
> > >Can you explain why logical replication is better than binary
> > >replication for this use-case?
> > >
> >
> > Selectivity?
>
> I was assuming you would just create identical slaves to handle failure,
> rather than moving selected data around.
>
Yes, I also think so, otherwise when the shard goes down and it's replica
has to take the place of shard, it will take more time to make replica
available as it won't have all the data as original shard had.
On Wed, Sep 2, 2015 at 9:04 PM, Oleg Bartunov <obartunov@gmail.com> wrote:
One lesson from XL we got is that we need testing framework for cluster, so any cluster project should at least pass functional and performance testing.
+1. In early XC days, we focused a lot on adding newer features and supporting as many PG features as possible. That took its toll on the testing and QA. It was a mistake though my feeling was we tried to correct that to some extend with XL. We did a 9.5 merge, which of course was a big deal, but other than more time is being spent on improving stability and performance
XL was very easy to break and I'm wondering how many corner cases still exists.
Thanks,
Pavan
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
PostgreSQL Development, 24x7 Support, Training & Services
>> One lesson from XL we got is that we need testing framework for cluster, >> so any cluster project should at least pass functional and performance >> testing. >> > > +1. In early XC days, we focused a lot on adding newer features and > supporting as many PG features as possible. That took its toll on the > testing and QA. It was a mistake though my feeling was we tried to correct > that to some extend with XL. We did a 9.5 merge, which of course was a big > deal, but other than more time is being spent on improving stability and > performance Agreed. Any cluster project needs a cluster testing framework. pgpool-II project runs "build farm" which runs cluster regression tests every day. The tests includes several versions of pgpool-II and PostgreSQL combinations using docker. Still it needs more tests but even with limited test cases, it is pretty usefull to detect bugs. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
On 2015-09-02 19:57, Josh Berkus wrote: > On 09/01/2015 04:14 PM, Petr Jelinek wrote: >> On 2015-09-02 00:09, Josh Berkus wrote: >>> Not really, the mechanism is different and the behavior is different. >>> One critical deficiency in using binary syncrep is that you can't do >>> round-robin redundancy at all; every redundant node has to be an exact >>> mirror of another node. In a good HA distributed system, you want >>> multiple shards per node, and you want each shard to be replicated to a >>> different node, so that in the event of node failure you're not dumping >>> the full load on one other server. >>> >> >> This assumes that we use binary replication, but we can reasonably use >> logical replication which can quite easily do filtering of what's >> replicated where. > > Is there a way to do logical synchronous replication? I didn't think > there was. > Yes, the logical replication has similar syncrep properties as the binary one (feedback works same way). -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Sep 3, 2015 at 10:33:12AM +0200, Petr Jelinek wrote: > On 2015-09-02 19:57, Josh Berkus wrote: > > On 09/01/2015 04:14 PM, Petr Jelinek wrote: > >> On 2015-09-02 00:09, Josh Berkus wrote: > >>> Not really, the mechanism is different and the behavior is different. > >>> One critical deficiency in using binary syncrep is that you can't do > >>> round-robin redundancy at all; every redundant node has to be an exact > >>> mirror of another node. In a good HA distributed system, you want > >>> multiple shards per node, and you want each shard to be replicated to a > >>> different node, so that in the event of node failure you're not dumping > >>> the full load on one other server. > >>> > >> > >> This assumes that we use binary replication, but we can reasonably use > >> logical replication which can quite easily do filtering of what's > >> replicated where. > > > > Is there a way to do logical synchronous replication? I didn't think > > there was. > > > > Yes, the logical replication has similar syncrep properties as the > binary one (feedback works same way). Yes, I assumed that. Logical replication uses WAL, so if you are synchronous with WAL, logical replication is synchronous too. However, of course, it is synchronous in being durable, not synchronous in terms of applying the WAL. This is true of binary and logical replication. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Hi, On 09/03/2015 05:02 AM, Amit Kapila wrote: > On Thu, Sep 3, 2015 at 8:28 AM, Bruce Momjian <bruce@momjian.us > <mailto:bruce@momjian.us>> wrote: > > > > On Wed, Sep 2, 2015 at 07:50:25PM -0700, Joshua Drake wrote: > > > >Can you explain why logical replication is better than binary > > > >replication for this use-case? > > > > > > > > > > Selectivity? > > >> I was assuming you would just create identical slaves to handle >> failure, rather than moving selected data around. > > > > Yes, I also think so, otherwise when the shard goes down and it's > replica has to take the place of shard, it will take more time to > make replica available as it won't have all the data as original > shard had. Not really, the idea is that you don't need to create the replica immediately. The system recognizes that primary shard location is unavailable and redirects the tasks to the "replicas." So the time to recreate the failed node is not that critical. It needs to be done in a smart way to prevent some typical issues like suddenly doubling the load on replicas due to failure of the primary location. By using different group of nodes for each "data segment" you can eliminate this, because the group of nodes to handle the additional load will be larger. The other issue then of course is that the groups of nodes must not be entirely random, otherwise the cluster would suffer data loss in case of outage of arbitrary group of K nodes (where K is the number of replicas for each piece of data). It's also non-trivial to do this when you have to consider racks, data centers etc. With regular slaves you can't do any of this - no matter what you do, you have to load balance the additional load only on the slaves. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Josh Berkus <josh@agliodbs.com> wrote: >>> You'd also need a way to let the connection nodes know when a replica >>> has fallen behind so that they can be taken out of >>> load-balancing/sharding for read queries. For the synchronous model, >>> that would be "fallen behind at all"; for asynchronous it would be >>> "fallen more than ### behind". >> >> How is that different from the previous thing? Just that we'd treat >> "lagging" as "down" beyond some threshold? That doesn't seem like a >> mandatory feature. > > It's a mandatory feature if you want to load-balance reads. We have to > know which nodes not to send reads to because they are out of sync. There is another approach to this that we should consider how (if?) we are going to cover: database affinity. I have seen cases where there are multiple databases which are targets of asynchronous replication, with a web application load balancing among them. The application kept track of which copy each connection was using, so that if when they were not exactly in sync the user never saw "time moving backward". Two different users might see versions of the data from different points in time, but that generally doesn't matter, especially if the difference is just a few minutes. If one copy got too far behind for some reason, they would load-shift to the other servers (time still moves forward, only there is a "jump" forward at the shift). This would allow the tardy database to be dedicated to catching up again. Bottom line is that this very smooth behavior required two features -- the ability for the application to control database affinity, and the ability to shift that affinity gracefully (with no down time). -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Sep 3, 2015 at 03:40:40PM +0200, Tomas Vondra wrote: > Not really, the idea is that you don't need to create the replica > immediately. The system recognizes that primary shard location is > unavailable and redirects the tasks to the "replicas." So the time > to recreate the failed node is not that critical. > > It needs to be done in a smart way to prevent some typical issues > like suddenly doubling the load on replicas due to failure of the > primary location. By using different group of nodes for each "data > segment" you can eliminate this, because the group of nodes to > handle the additional load will be larger. > > The other issue then of course is that the groups of nodes must not > be entirely random, otherwise the cluster would suffer data loss in > case of outage of arbitrary group of K nodes (where K is the number > of replicas for each piece of data). > > It's also non-trivial to do this when you have to consider racks, > data centers etc. > > With regular slaves you can't do any of this - no matter what you > do, you have to load balance the additional load only on the slaves. Yes, and imagine doing this with FDW's, updating the catalog table location of the FDW as part of the failover process --- interesting. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Thu, Sep 3, 2015 at 6:57 AM, Bruce Momjian <bruce@momjian.us> wrote: > Yes, I assumed that. Logical replication uses WAL, so if you are > synchronous with WAL, logical replication is synchronous too. However, > of course, it is synchronous in being durable, not synchronous in terms > of applying the WAL. This is true of binary and logical replication. But, Thomas Munro is fixing it! -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015-09-03 12:10:08 -0400, Robert Haas wrote: > On Thu, Sep 3, 2015 at 6:57 AM, Bruce Momjian <bruce@momjian.us> wrote: > > Yes, I assumed that. Logical replication uses WAL, so if you are > > synchronous with WAL, logical replication is synchronous too. However, > > of course, it is synchronous in being durable, not synchronous in terms > > of applying the WAL. This is true of binary and logical replication. Actually that's not really true - it's just a question which LSNs you return. For UDR/BDR the relevant LSN is the LSN of the last durably committed transaction. And thus they wait for apply, not anything else. > But, Thomas Munro is fixing it! + many to that effort. Greetings, Andres Freund
On 09/03/2015 03:57 AM, Bruce Momjian wrote: >> > >> > Yes, the logical replication has similar syncrep properties as the >> > binary one (feedback works same way). Oh? What does UDR/BDR currently support for sync? > Yes, I assumed that. Logical replication uses WAL, so if you are > synchronous with WAL, logical replication is synchronous too. However, > of course, it is synchronous in being durable, not synchronous in terms > of applying the WAL. This is true of binary and logical replication. Well, there's no such thing as simultaneity in scalable architectures. But users are already used to that ... anybody who load-balances to read slaves knows about lag. The only way* to ensure near-simultenaity is to have some kind of single-node, single-process GTM for the cluster, and then your actual scalability goes bye-bye. The bigger issue we'll need to address with this is the fight between lag and load-balancing, which would become a much worse issue with read-load-balanced shards which are transparent to the user. They'd see the effects of lag, without having actually chosen to use this or that replica. This is the other reason to look at logical replication; presumably with logrep, we can be more discriminating about what activities cause lag (for one thing, vacuum won't). Also: On 09/03/2015 07:00 AM, Kevin Grittner wrote: > There is another approach to this that we should consider how (if?) > we are going to cover: database affinity. I have seen cases where > there are multiple databases which are targets of asynchronous > replication, with a web application load balancing among them. The > application kept track of which copy each connection was using, so > that if when they were not exactly in sync the user never saw "time > moving backward". Two different users might see versions of the > data from different points in time, but that generally doesn't > matter, especially if the difference is just a few minutes. If one > copy got too far behind for some reason, they would load-shift to > the other servers (time still moves forward, only there is a "jump" > forward at the shift). This would allow the tardy database to be > dedicated to catching up again. > > Bottom line is that this very smooth behavior required two features > -- the ability for the application to control database affinity, > and the ability to shift that affinity gracefully (with no down > time). Yes. Frankly, it would be *easier* to code things so that the same session always gets its requests load balanced to the same copies; making that a feature, too, is nice. (* there are actually other ways to come close to simultaneity, but they are much more complicated) -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Hey Robert,
Now the question is, where should the code that does all of this live?
postgres_fdw? Some new, sharding-specific FDW? In core? I don't
know for sure, but what I do know is that we could make a lot of
progress over where we are today by just improving postgres_fdw, and I
don't think those improvements are even all that difficult. If we
decide we need to implement something new, it's going to be a huge
project that will take years to complete, with uncertain results. I'd
rather have a postgres_fdw-based implementation that is imperfect and
can't handle some kinds of queries in 9.6 than a promise that by 9.9
we'll have something really great that handles MPP perfectly.
Distributed shuffles (Map/Reduce) are hard. When we looked at using FDWs for pg_shard, we thought that Map/Reduce would require a comprehensive revamp of the APIs.
For Citus, a second part of the question is as FDW writers. We implemented cstore_fdw, json_fdw, and mongo_fdw, and these wrappers don't benefit from even the simple join pushdown that doesn't require Map/Reduce.
The PostgreSQL wiki lists 85 foreign data wrappers, and only 18 of these have support for joins: https://wiki.postgresql.org/wiki/Foreign_data_wrappers
Best,
Ozgun
On Sat, Sep 5, 2015 at 4:22 AM, Ozgun Erdogan <ozgun@citusdata.com> wrote:
Hey Robert,Now the question is, where should the code that does all of this live?
postgres_fdw? Some new, sharding-specific FDW? In core? I don't
know for sure, but what I do know is that we could make a lot of
progress over where we are today by just improving postgres_fdw, and I
don't think those improvements are even all that difficult. If we
decide we need to implement something new, it's going to be a huge
project that will take years to complete, with uncertain results. I'd
rather have a postgres_fdw-based implementation that is imperfect and
can't handle some kinds of queries in 9.6 than a promise that by 9.9
we'll have something really great that handles MPP perfectly.Distributed shuffles (Map/Reduce) are hard. When we looked at using FDWs for pg_shard, we thought that Map/Reduce would require a comprehensive revamp of the APIs.For Citus, a second part of the question is as FDW writers. We implemented cstore_fdw, json_fdw, and mongo_fdw, and these wrappers don't benefit from even the simple join pushdown that doesn't require Map/Reduce.
I didn't get this. Join pushdown infrastructure (chiefly set of hooks provided in join planning paths) is part of 9.5. Isn't that sufficient to implement join push-down for above FDWs? Or FDW writers are facing problems while implementing those hooks. In either case that should be reported on hackers.
The PostgreSQL wiki lists 85 foreign data wrappers, and only 18 of these have support for joins: https://wiki.postgresql.org/wiki/Foreign_data_wrappersBest,Ozgun
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
I
On Monday, September 7, 2015, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
--
Ahsan Hadi
Snr Director Product Development
EnterpriseDB Corporation
The Enterprise Postgres Company
Phone: +92-51-8358874
Mobile: +92-333-5162114
Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb
This e-mail message (and any attachment) is intended for the use of the individual or entity to whom it is addressed. This message contains information from EnterpriseDB Corporation that may be privileged, confidential, or exempt from disclosure under applicable law. If you are not the intended recipient or authorized to receive this for the intended recipient, any use, dissemination, distribution, retention, archiving, or copying of this communication is strictly prohibited. If you have received this e-mail in error, please notify the sender immediately by reply e-mail and delete this message.
On Monday, September 7, 2015, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
On Sat, Sep 5, 2015 at 4:22 AM, Ozgun Erdogan <ozgun@citusdata.com> wrote:Hey Robert,Now the question is, where should the code that does all of this live?
postgres_fdw? Some new, sharding-specific FDW? In core? I don't
know for sure, but what I do know is that we could make a lot of
progress over where we are today by just improving postgres_fdw, and I
don't think those improvements are even all that difficult. If we
decide we need to implement something new, it's going to be a huge
project that will take years to complete, with uncertain results. I'd
rather have a postgres_fdw-based implementation that is imperfect and
can't handle some kinds of queries in 9.6 than a promise that by 9.9
we'll have something really great that handles MPP perfectly.Distributed shuffles (Map/Reduce) are hard. When we looked at using FDWs for pg_shard, we thought that Map/Reduce would require a comprehensive revamp of the APIs.For Citus, a second part of the question is as FDW writers. We implemented cstore_fdw, json_fdw, and mongo_fdw, and these wrappers don't benefit from even the simple join pushdown that doesn't require Map/Reduce.I didn't get this. Join pushdown infrastructure (chiefly set of hooks provided in join planning paths) is part of 9.5. Isn't that sufficient to implement join push-down for above FDWs? Or FDW writers are facing problems while implementing those hooks. In either case that should be reported on hackers.
I don't think any FDW writer (other the postgres_fdw) has tried to implement join push down in the respective FDW's using the new API.
The PostgreSQL wiki lists 85 foreign data wrappers, and only 18 of these have support for joins: https://wiki.postgresql.org/wiki/Foreign_data_wrappersBest,Ozgun
--Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
--
Ahsan Hadi
Snr Director Product Development
EnterpriseDB Corporation
The Enterprise Postgres Company
Phone: +92-51-8358874
Mobile: +92-333-5162114
Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb
This e-mail message (and any attachment) is intended for the use of the individual or entity to whom it is addressed. This message contains information from EnterpriseDB Corporation that may be privileged, confidential, or exempt from disclosure under applicable law. If you are not the intended recipient or authorized to receive this for the intended recipient, any use, dissemination, distribution, retention, archiving, or copying of this communication is strictly prohibited. If you have received this e-mail in error, please notify the sender immediately by reply e-mail and delete this message.
On Thu, Sep 3, 2015 at 7:03 AM, Josh Berkus <josh@agliodbs.com> wrote:
-- On 09/02/2015 11:41 AM, Robert Haas wrote:
> 4. Therefore, I think that we should instead use logical replication,
> which might be either synchronous or asynchronous. When you modify
> one copy of the data, that change will then be replicated to all other
> nodes. If you are OK with eventual consistency, this replication can
> be asynchronous, and nodes that are off-line will catch up when they
> are on-line. If you are not OK with that, then you must replicate
> synchronously to every node before transaction commit; or at least you
> must replicate synchronously to every node that is currently on-line.
> This presents some challenges: logical decoding currently can't
> replicate transactions that are still in process - replication starts
> when the transaction commits. Also, we don't have any way for
> synchronous replication to wait for multiple nodes.
Well, there is a WIP patch for that, which IMHO would be much improved
by having a concrete use-case like this one. What nobody is working on
-- and we've vetoed in the past -- is a way of automatically failing and
removing from replication any node which repeatedly fails to sync, which
would be a requirement for this model.
You'd also need a way to let the connection nodes know when a replica
has fallen behind so that they can be taken out of
load-balancing/sharding for read queries. For the synchronous model,
that would be "fallen behind at all"; for asynchronous it would be
"fallen more than ### behind".
I have been thinking about that problem in the context of synchronous_commit = apply, and while trying to review the multiple synchronous patch. How are you supposed to actually make use of remote-apply semantics without a way to find a replica that is consistent? And what does consistent mean? I'm going to say consistency means: it sees at least all effects of all preceding COMMITs that returned successfully. It's trivial in a no-timeout, single standby topology (if commit returned, the sole sync replica has applied your transaction and replied), but beyond that it obviously requires some more infrastructure and concepts. Here is my suggestion:
Imagine if we could configure setups like this:
1. I have 4 servers called london1, london2, paris1, paris2 (see nearby multiple sync server thread/patch).
2. I have synchronous_commit = apply (see nearby thread/patch)
3. Currently london1 is master, with the other 3 in the synchronous replication set, and I want a minimum replication set of size 2 so I can lose 1 of those and continue
So far so good, now for the double-vapourware part:
4. The replicas know whether they are currently part of the master's synchronous replication set or not because it tells them
5. While waiting for replicas, the master only waits up to X milliseconds, and if no reply is received from a given server it drops that server from the sync rep set, like a RAID controller dropping an unresponsive element from a RAID array, but still returns control to the user if 2 nodes (our configured minimum) did reply
6. If the sync rep set reaches our minimum size 2 because of a node being dropped, then you can no longer drop one, so commit hangs (work is blocked until enough servers connect and catch up again)
7. If a replica sees that it hasn't received WAL records or pings from the master with timestamps from the past Y milliseconds, or receives a message explicitly telling it it's been dropped from the sync rep set, it will start rejecting my queries on the basis that it's potentially out of date
8. If the master sees suitable apply reply messages stream in from a node that was dropped but has now caught up (possibly having reconnected), it will explicitly tell it that it's back in the sync rep set and start waiting for it again
9. X is sufficiently larger than Y so that, combined with the 'you've been dropped from/rejoined the sync rep set' messages and well sync'd system clocks, it should not be possible for a replica to expose snapshots that don't include all effects from transactions whose COMMIT command returned on the master. (If you don't have the concept of a dynamic replication set which replicas are dropped from and rejoin, then the master can't drop a node and carry on, like a RAID controller would, unless it's happy to wait for any old N nodes to reply. Waiting for any N nodes to reply may be OK for log-flush-only sync rep, but if you want to use apply sync rep and have guarantees about visibility, and you simply wait for any N nodes to reply, then the nodes themselves don't know if they are up to date with master or not (they don't know if they were one of the ones that master waited for some transaction they haven't even heard about yet), so they may show users old data. Also, if you don't have the master telling the replicas that it considers them to be in or out of the replication set, they don't know exactly when the master decides to consider them in again after they rejoin.)
Now I can connect to any server and specify my requirement to see all committed-on-the-master transactions (meaning: the COMMIT command returned success to the client), and either get an error telling me that it can't guarantee that at that moment (because it has been told it's not in the sync rep set or hasn't heard from the master recently enough), or happily proceed to query the database. I can send my writes to the master node, and do all my reads on any node I like, and be sure they include whatever I just committed (or someone else just committed and told me about). I can also use FDW to query the replicas from the master and know that they can see everything already committed (but of course not uncommitted changes; I'm guessing you get that on the GTM based systems).
The main problem I can see so far with this scheme is that you can see things on the replicas *before* the COMMIT returns. Is that a problem? I suspect all solutions to that problem involve centralised snapshot control (through a GTM or through the master).
The timeout and ping based drop/join idea is inspired by a non-Postgres system that was presented at CHAR(14) last year that is due to be released as open source one day (and whose author egged me on to try implementing that synchronous_commit = apply patch), and I guess loosely RAID. Maybe there is better terminology or a name in database literature for this approach, I'm not sure, and maybe it has unacceptable holes. It's a lot less radical than the GTM/MPP systems, since it just adds a few bells and whistles to the existing single master replication model, and obviously there are plenty more problems to solve to make really useful clustering technology, like master reelection, query routing/node location, load balancing and so forth. But the master and the replicas have the information they need to do that.
Thomas Munro
http://www.enterprisedb.com
http://www.enterprisedb.com
On Fri, Sep 4, 2015 at 6:52 PM, Ozgun Erdogan <ozgun@citusdata.com> wrote: > Distributed shuffles (Map/Reduce) are hard. When we looked at using FDWs for > pg_shard, we thought that Map/Reduce would require a comprehensive revamp of > the APIs. Well, so you've said. But what kind of API do you want to see? Taking control at some very high-level hook like ExecutorRun() is not really a maintainable solution - it's fine if you've only got one guy doing it, perhaps, but if you have several FDWs talking to different kinds of remote systems, they can't all seize overall control. >> For Citus, a second part of the question is as FDW writers. We implemented > cstore_fdw, json_fdw, and mongo_fdw, and these wrappers don't benefit from > even the simple join pushdown that doesn't require Map/Reduce. > > The PostgreSQL wiki lists 85 foreign data wrappers, and only 18 of these > have support for joins: > https://wiki.postgresql.org/wiki/Foreign_data_wrappers What do you mean by "support for joins"? Do you mean that only 18 of the remote data sources can do joins? If so, why does that matter? I'd be quite happy if a join pushdown or "distributed shuffle" API had as many as 18 users - I'd be quite happy if it had one (postgres_fdw). The fact that not all FDWs can support every operation because of limitations on the remote side isn't a reason not to support those operations when the remote side is capable. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015-09-08 19:52, Robert Haas wrote: > On Fri, Sep 4, 2015 at 6:52 PM, Ozgun Erdogan <ozgun@citusdata.com> wrote: >>> For Citus, a second part of the question is as FDW writers. We implemented >> cstore_fdw, json_fdw, and mongo_fdw, and these wrappers don't benefit from >> even the simple join pushdown that doesn't require Map/Reduce. >> >> The PostgreSQL wiki lists 85 foreign data wrappers, and only 18 of these >> have support for joins: >> https://wiki.postgresql.org/wiki/Foreign_data_wrappers > > What do you mean by "support for joins"? Do you mean that only 18 of > the remote data sources can do joins? If so, why does that matter? > I'd be quite happy if a join pushdown or "distributed shuffle" API had > as many as 18 users - I'd be quite happy if it had one (postgres_fdw). > The fact that not all FDWs can support every operation because of > limitations on the remote side isn't a reason not to support those > operations when the remote side is capable. > Agreed. While I personally don't think FDWs are long term answer for sharding, I do believe that the infrastructure that is being developed for FDWs (join/aggregate pushdown) is needed anyway and there are many common issues that need solving in this area for FDWs, sharding and parallel query. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Sep 7, 2015 at 1:33 PM, Ahsan Hadi <ahsan.hadi@enterprisedb.com> wrote: > I > > On Monday, September 7, 2015, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> >> >> >> On Sat, Sep 5, 2015 at 4:22 AM, Ozgun Erdogan <ozgun@citusdata.com> wrote: >>> >>> Hey Robert, >>> >>>> Now the question is, where should the code that does all of this live? >>>> postgres_fdw? Some new, sharding-specific FDW? In core? I don't >>>> know for sure, but what I do know is that we could make a lot of >>>> progress over where we are today by just improving postgres_fdw, and I >>>> don't think those improvements are even all that difficult. If we >>>> decide we need to implement something new, it's going to be a huge >>>> project that will take years to complete, with uncertain results. I'd >>>> rather have a postgres_fdw-based implementation that is imperfect and >>>> can't handle some kinds of queries in 9.6 than a promise that by 9.9 >>>> we'll have something really great that handles MPP perfectly. >>> >>> >>> Distributed shuffles (Map/Reduce) are hard. When we looked at using FDWs >>> for pg_shard, we thought that Map/Reduce would require a comprehensive >>> revamp of the APIs. >>> >>> For Citus, a second part of the question is as FDW writers. We >>> implemented cstore_fdw, json_fdw, and mongo_fdw, and these wrappers don't >>> benefit from even the simple join pushdown that doesn't require Map/Reduce. >> >> >> I didn't get this. Join pushdown infrastructure (chiefly set of hooks >> provided in join planning paths) is part of 9.5. Isn't that sufficient to >> implement join push-down for above FDWs? Or FDW writers are facing problems >> while implementing those hooks. In either case that should be reported on >> hackers. >> > > > I don't think any FDW writer (other the postgres_fdw) has tried to implement > join push down in the respective FDW's using the new API. Well, 'jdbc_fdw2' seems to implement deparsing at some level: https://github.com/heimir-sverrisson/jdbc2_fdw/blob/master/deparse.c ...but this was likely a copy/paste job from the postgres_fdw. It should not escape note that the deparsing strategy has dependencies on the optimizer and the parser. This is not good; asking the FDW implementations to implement SQL optimizations is not a sustainable solution. They should be indicating, for example, "I support ANSI-92 SQL", and the postgres planner should be deparsing the foreign table definition and rewriting it, not the other way around. merlin