Thread: Horizontal scalability/sharding

Horizontal scalability/sharding

From
Bruce Momjian
Date:
I have recently increased my public statements about the idea of adding
horizontal scaling/sharding to Postgres. I wanted to share with hackers
a timeline of how we got here, and where I think we are going in the
short term:

2012-2013:  As part of writing my scaling talk
(http://momjian.us/main/presentations/overview.html#scaling), studying
Oracle RAC, and talking to users, it became clear that an XC-like
architecture (sharding) was the only architecture that was going to allow
for write scaling.

Users and conference attendees I talked to were increasingly concerned
about the ability of Postgres to scale for high write volumes.  They didn't
necessarily need that scale now, but they needed to know they could get
it if they wanted it, and wouldn't need to switch to a new database in
the future.  This is similar to wanting a car that can get you on a highway
on-ramp fast --- even if you don't need it, you want to know it is there.

2014:  I started to shop around the idea that we could use FDWs,
parallelism, and a transaction/snapshot manager to get XC features
as built-in to Postgres.  (I don't remember where the original idea
came from.)  It was clear that having separate forks of the source code
in XC and XL was never going to achieve critical mass --- there just
aren't enough people who need high right scale right now, and the fork
maintenance overhead is a huge burden.

I realized that we would never get community acceptance to dump the XC
(or XL) code needed for sharding into community Postgres, but with FDWs,
we could add the features as _part_ of improving FDWs, which would benefit
FDWs _and_ would be useful for sharding.  (We already see some of those
FDW features in 9.5.)

October, 2014:  EDB and NTT started working together in the community
to start improving FDWs as a basis for an FDW-based sharding solution.
Many of the 9.5 FDW improvements that also benefit sharding were developed
by a combined EDB/NTT team.  The features improved FDWs independent of
sharding, so they didn't need community buy-in on sharding to get them
accepted.

June, 2015:  I attended the PGCon sharding unconference session and
there was a huge discussion about where we should go with sharding.
I think the big take-away was that most people liked the FDW approach,
but had business/customer reasons for wanting to work on XC or XL because
those would be production-ready faster.

July, 2015:  Oleg Bartunov and his new company Postgres Professional (PP)
started to think about joining the FDW approach, rather than working on
XL, as they had stated at PGCon in June.  A joint NTT/EDB/PP phone-in
meeting is scheduled for September 1.

August, 2015:  While speaking at SFPUG, Citus Data approached me about
joining the FDW sharding team.  They have been invited to the September
1 meeting, as have the XC and XL people.

October, 2015:  EDB is sponsoring a free 3-hour summit about FDW sharding
at the PG-EU conference in Vienna.   Everyone is invited, but it is hoped
most of the September 1 folks can attend.

February, 2016:  Oleg is planning a similar meeting at their February
Moscow conference.

Anyway, I wanted to explain the work that has been happening around
sharding.  As things move forward, I am increasingly convinced that write
scaling will be needed soon, that the XC approach is the only reasonable
way to do it, and that FDWs are the cleanest way to get it into community
Postgres.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Horizontal scalability/sharding

From
Simon Riggs
Date:
On 30 August 2015 at 03:17, Bruce Momjian <bruce@momjian.us> wrote:
I have recently increased my public statements about the idea of adding
horizontal scaling/sharding to Postgres.

Glad to see it. Many people have been pushing such things for years, so it is good to finally see some debate about this on Hackers.
 
I wanted to share with hackers
a timeline of how we got here, and where I think we are going in the
short term:

2012-2013:  As part of writing my scaling talk
(http://momjian.us/main/presentations/overview.html#scaling), studying
Oracle RAC, and talking to users, it became clear that an XC-like
architecture (sharding) was the only architecture that was going to allow
for write scaling.

What other architectures were discussed? Where was that discussion?
 
Users and conference attendees I talked to were increasingly concerned
about the ability of Postgres to scale for high write volumes.  They didn't
necessarily need that scale now, but they needed to know they could get
it if they wanted it, and wouldn't need to switch to a new database in
the future.  This is similar to wanting a car that can get you on a highway
on-ramp fast --- even if you don't need it, you want to know it is there.

+1
 
2014:  I started to shop around the idea that we could use FDWs,
parallelism, and a transaction/snapshot manager to get XC features
as built-in to Postgres.  (I don't remember where the original idea
came from.)  It was clear that having separate forks of the source code
in XC and XL was never going to achieve critical mass --- there just
aren't enough people who need high right scale right now, and the fork
maintenance overhead is a huge burden.

I personally support the view that we should put scalability features into Postgres core, rather than run separate forks.
 
I realized that we would never get community acceptance to dump the XC
(or XL) code needed for sharding into community Postgres

How or why did you realize that? There has never been any such discussion, AFAIK. Surely it can be possible to move required subsystems across?
 
, but with FDWs,
we could add the features as _part_ of improving FDWs, which would benefit
FDWs _and_ would be useful for sharding.  (We already see some of those
FDW features in 9.5.)

That is a huge presumption. Not discussed or technically analyzed in any way with the community.
 
October, 2014:  EDB and NTT started working together in the community
to start improving FDWs as a basis for an FDW-based sharding solution.
Many of the 9.5 FDW improvements that also benefit sharding were developed
by a combined EDB/NTT team.  The features improved FDWs independent of
sharding, so they didn't need community buy-in on sharding to get them
accepted.

June, 2015:  I attended the PGCon sharding unconference session and
there was a huge discussion about where we should go with sharding.
I think the big take-away was that most people liked the FDW approach,
but had business/customer reasons for wanting to work on XC or XL because
those would be production-ready faster.

Cough, cough. You must surely be joking that "most people liked the FDW approach"? How did we measure the acceptance of this approach? 

What actually is the FDW approach? Since its not been written down anywhere, or even explained verbally, how can anyone actually agree to it?
 
July, 2015:  Oleg Bartunov and his new company Postgres Professional (PP)
started to think about joining the FDW approach, rather than working on
XL, as they had stated at PGCon in June.  A joint NTT/EDB/PP phone-in
meeting is scheduled for September 1.
 
August, 2015:  While speaking at SFPUG, Citus Data approached me about
joining the FDW sharding team.  They have been invited to the September
1 meeting, as have the XC and XL people.

2ndQuadrant is working in this area, specifically bringing XL 9.5 forwards. Please can invites be posted to myself, Pavan Deolasee and Petr Jelinek also? I'll pass on to others also.

Koichi Suzuki is arranging a meeting in Hong Long for XC/XL discussions. Presumably EDB is invited also? If Koichi is a leading organizer of this, why are there two meetings?

October, 2015:  EDB is sponsoring a free 3-hour summit about FDW sharding
at the PG-EU conference in Vienna.   Everyone is invited, but it is hoped
most of the September 1 folks can attend.
 
February, 2016:  Oleg is planning a similar meeting at their February
Moscow conference.
 
Anyway, I wanted to explain the work that has been happening around
sharding.

Thanks 
  
As things move forward, I am increasingly convinced that write
scaling will be needed soon,

+1
 
that the XC approach is the only reasonable way to do it,
 
and that FDWs are the cleanest way to get it into community
Postgres.

Those two things aren't at all obvious to me.

Please don't presume my opposition. If the technical information were made public, I might understand and agree with "the FDW approach", perhaps others also. 2ndQuadrant is certainly happy to become involved in any team aiming to add features to Postgres core, as long as that makes sense. There may be areas we can all agree upon even if the full architecture remains in doubt.

Before the community commits to a long term venture together we should see the plan. Like all IT projects, expensive failure is possible and the lack of a design is a huge flashing red warning light for me at present. If that requires a meeting of all Developers, why are the meetings for this specifically not happening at the agreed Developer meetings?

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Horizontal scalability/sharding

From
Oleg Bartunov
Date:

On Sun, Aug 30, 2015 at 5:31 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 30 August 2015 at 03:17, Bruce Momjian <bruce@momjian.us> wrote:
I have recently increased my public statements about the idea of adding
horizontal scaling/sharding to Postgres.

Glad to see it. Many people have been pushing such things for years, so it is good to finally see some debate about this on Hackers.
 
I wanted to share with hackers
a timeline of how we got here, and where I think we are going in the
short term:

2012-2013:  As part of writing my scaling talk
(http://momjian.us/main/presentations/overview.html#scaling), studying
Oracle RAC, and talking to users, it became clear that an XC-like
architecture (sharding) was the only architecture that was going to allow
for write scaling.

What other architectures were discussed? Where was that discussion?
 
Users and conference attendees I talked to were increasingly concerned
about the ability of Postgres to scale for high write volumes.  They didn't
necessarily need that scale now, but they needed to know they could get
it if they wanted it, and wouldn't need to switch to a new database in
the future.  This is similar to wanting a car that can get you on a highway
on-ramp fast --- even if you don't need it, you want to know it is there.

+1
 
2014:  I started to shop around the idea that we could use FDWs,
parallelism, and a transaction/snapshot manager to get XC features
as built-in to Postgres.  (I don't remember where the original idea
came from.)  It was clear that having separate forks of the source code
in XC and XL was never going to achieve critical mass --- there just
aren't enough people who need high right scale right now, and the fork
maintenance overhead is a huge burden.

I personally support the view that we should put scalability features into Postgres core, rather than run separate forks.
 
I realized that we would never get community acceptance to dump the XC
(or XL) code needed for sharding into community Postgres

How or why did you realize that? There has never been any such discussion, AFAIK. Surely it can be possible to move required subsystems across?
 
, but with FDWs,
we could add the features as _part_ of improving FDWs, which would benefit
FDWs _and_ would be useful for sharding.  (We already see some of those
FDW features in 9.5.)

That is a huge presumption. Not discussed or technically analyzed in any way with the community.
 
October, 2014:  EDB and NTT started working together in the community
to start improving FDWs as a basis for an FDW-based sharding solution.
Many of the 9.5 FDW improvements that also benefit sharding were developed
by a combined EDB/NTT team.  The features improved FDWs independent of
sharding, so they didn't need community buy-in on sharding to get them
accepted.

June, 2015:  I attended the PGCon sharding unconference session and
there was a huge discussion about where we should go with sharding.
I think the big take-away was that most people liked the FDW approach,
but had business/customer reasons for wanting to work on XC or XL because
those would be production-ready faster.

Cough, cough. You must surely be joking that "most people liked the FDW approach"? How did we measure the acceptance of this approach? 

What actually is the FDW approach? Since its not been written down anywhere, or even explained verbally, how can anyone actually agree to it?
 
July, 2015:  Oleg Bartunov and his new company Postgres Professional (PP)
started to think about joining the FDW approach, rather than working on
XL, as they had stated at PGCon in June.  A joint NTT/EDB/PP phone-in
meeting is scheduled for September 1.
 

A little correction about Postgres Professional. We are concentrated on idea to have one distributed transaction manager, originally DTM, now we have better name XTM, which is neutral to actual cluster realization. For example, we are testing it with XL, ported to 9.4, but we were planning to extend tests to pg_shard, postgres_fdw. My idea was to have at least XTM committed to 9.6, so all parties could work on their implementation much easier.

 
August, 2015:  While speaking at SFPUG, Citus Data approached me about
joining the FDW sharding team.  They have been invited to the September
1 meeting, as have the XC and XL people.

2ndQuadrant is working in this area, specifically bringing XL 9.5 forwards. Please can invites be posted to myself, Pavan Deolasee and Petr Jelinek also? I'll pass on to others also.

Koichi Suzuki is arranging a meeting in Hong Long for XC/XL discussions. Presumably EDB is invited also? If Koichi is a leading organizer of this, why are there two meetings?

October, 2015:  EDB is sponsoring a free 3-hour summit about FDW sharding
at the PG-EU conference in Vienna.   Everyone is invited, but it is hoped
most of the September 1 folks can attend.
 
February, 2016:  Oleg is planning a similar meeting at their February
Moscow conference.
 
Anyway, I wanted to explain the work that has been happening around
sharding.

Thanks 
  
As things move forward, I am increasingly convinced that write
scaling will be needed soon,

+1
 
that the XC approach is the only reasonable way to do it,
 
and that FDWs are the cleanest way to get it into community
Postgres.

Those two things aren't at all obvious to me.

Please don't presume my opposition. If the technical information were made public, I might understand and agree with "the FDW approach", perhaps others also. 2ndQuadrant is certainly happy to become involved in any team aiming to add features to Postgres core, as long as that makes sense. There may be areas we can all agree upon even if the full architecture remains in doubt.

Before the community commits to a long term venture together we should see the plan. Like all IT projects, expensive failure is possible and the lack of a design is a huge flashing red warning light for me at present. If that requires a meeting of all Developers, why are the meetings for this specifically not happening at the agreed Developer meetings?


At PGCon we agreed to have such meeting in Vienna at least. But I think we should be prepared and try to clean all our issues before. It looks like we already out of time,but probably we could meet in Hong Kong ? 

Honestly, I still don't know which approach is better, we already played with XL (ported to 9.4)  and identified some very strong issues with inconsistency, which scared us, especially taking into account how easy we found them. XC people have fixed them, but I'm not sure if they were fundamental and if we could construct more sophisticated tests and find more issues in XC/XL. We also a bit disappointed by Huawei position about CSN patch, we hoped to use for  our XTM.  FDW approach has been actively criticized by pg_shard people and that's also made me a bit suspicious.  It looks like  we are doomed to continue several development forks, so we decided to work on very important common project, XTM, which we hoped could be accepted by all parties and eventually committed to 9.6.  Now I see we were right, unfortunately.  

Again, could we organize meeting somewhere in September ?  US is not good for us, but other places should be ok. I want to have an agreement  at least on XTM. We still are testing various approaches, though. We could present results of our experiments and are open to discussion. It's not easy project, but it's something we could do for 9.6.

I'm very glad Bruce started this discussion in -hackers, since it's silly to me to participate in both threads :)  Let's meet in September !

 
--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Horizontal scalability/sharding

From
Bruce Momjian
Date:
On Sun, Aug 30, 2015 at 03:31:10PM +0100, Simon Riggs wrote:
> On 30 August 2015 at 03:17, Bruce Momjian <bruce@momjian.us> wrote:
> 
>     I have recently increased my public statements about the idea of adding
>     horizontal scaling/sharding to Postgres.
>
> Glad to see it. Many people have been pushing such things for years, so it is
> good to finally see some debate about this on Hackers.

Agreed.  Right now, in our community, we are only seeing users who are
happy with what Postgres offers but think they might need massive
horizontal scalability in the future.  I think there is a larger group
that cares about massive horizontal scalability, but those people are
using other software right now, so we don't see them yet.

Without a roadmap for built-in massive horizontal scalability, I think
Postgres adoption will eventually suffer.

>     I wanted to share with hackers
>     a timeline of how we got here, and where I think we are going in the
>     short term:
> 
>     2012-2013:  As part of writing my scaling talk
>     (http://momjian.us/main/presentations/overview.html#scaling), studying
>     Oracle RAC, and talking to users, it became clear that an XC-like
>     architecture (sharding) was the only architecture that was going to allow
>     for write scaling.
> 
> 
> What other architectures were discussed? Where was that discussion?

That was mostly my conclusion.  I explained it to small groups at
conferences and Postgres user groups.  No one said I was wrong, but that
is about the level of debate I had.

>     2014:  I started to shop around the idea that we could use FDWs,
>     parallelism, and a transaction/snapshot manager to get XC features
>     as built-in to Postgres.  (I don't remember where the original idea
>     came from.)  It was clear that having separate forks of the source code
>     in XC and XL was never going to achieve critical mass --- there just
>     aren't enough people who need high right scale right now, and the fork
>     maintenance overhead is a huge burden.
> 
> 
> I personally support the view that we should put scalability features into
> Postgres core, rather than run separate forks.

Good, I do think it is time, but as I stated above, there is limited
interest in our current community, so the tolerance for additional
community code to accomplish this is also limited.  This is the big
thing that had me excited about using FDWs --- FDW improvements can get
us closer to sharding without requiring community acceptance of
sharding-only features.

>     I realized that we would never get community acceptance to dump the XC
>     (or XL) code needed for sharding into community Postgres
> 
> 
> How or why did you realize that? There has never been any such discussion,
> AFAIK. Surely it can be possible to move required subsystems across?

Well, I have had many such discussions with XC/XL folks, and that was my
opinion.  I have seen almost no public discussion about this because the
idea had almost no chance of success.  If it was possible, someone would
have already suggested it on this list.

>     , but with FDWs,
>     we could add the features as _part_ of improving FDWs, which would benefit
>     FDWs _and_ would be useful for sharding.  (We already see some of those
>     FDW features in 9.5.)
> 
> 
> That is a huge presumption. Not discussed or technically analyzed in any way
> with the community.

True.  It seemed pretty obvious to me.

>     October, 2014:  EDB and NTT started working together in the community
>     to start improving FDWs as a basis for an FDW-based sharding solution.
>     Many of the 9.5 FDW improvements that also benefit sharding were developed
>     by a combined EDB/NTT team.  The features improved FDWs independent of
>     sharding, so they didn't need community buy-in on sharding to get them
>     accepted.
> 
>     June, 2015:  I attended the PGCon sharding unconference session and
>     there was a huge discussion about where we should go with sharding.
>     I think the big take-away was that most people liked the FDW approach,
>     but had business/customer reasons for wanting to work on XC or XL because
>     those would be production-ready faster.
> 
> 
> Cough, cough. You must surely be joking that "most people liked the FDW
> approach"? How did we measure the acceptance of this approach? 

Well, I didn't have my audience-meter with me at the time.  ;-)

The discussion was mostly in the hallway after the unconference session,
"Future of PostgreSQL shared-nothing cluster" by Konstantin Knizhnik,
Alexander Korotkov, and Oleg Bartunov.  Again, when I explained the
ability to use FDWs to get sharding into Postgres with minimal
additional code, no one said the idea was crazy, which I took as a big
thumbs-up!  When I asked why to continue with XC/XL, I was told those
were more mature and more customer-ready, which is true.  I will not
quote people from the from the hallway discussion for privacy reasons.

> What actually is the FDW approach? Since its not been written down anywhere, or
> even explained verbally, how can anyone actually agree to it?

Well, my sharding talk just has the outlines of an approach.  I think
there are five broad segments:

*  FDW push-down of joins, sorts, aggregates
*  ability to send FDW requests in parallel
*  transaction/snapshot manager to allow ACID transctions on shards
*  simpler user partitioning API
*  infrastructure to manage shards, including replicated tables used for joins

>     July, 2015:  Oleg Bartunov and his new company Postgres Professional (PP)
>     started to think about joining the FDW approach, rather than working on
>     XL, as they had stated at PGCon in June.  A joint NTT/EDB/PP phone-in
>     meeting is scheduled for September 1.
>
>     August, 2015:  While speaking at SFPUG, Citus Data approached me about
>     joining the FDW sharding team.  They have been invited to the September
>     1 meeting, as have the XC and XL people.
> 
> 
> 2ndQuadrant is working in this area, specifically bringing XL 9.5 forwards.

Yes, I saw the blog post about that:
http://blog.2ndquadrant.com/working-towards-postgres-xl-9-5/

> Please can invites be posted to myself, Pavan Deolasee and Petr Jelinek also?
> I'll pass on to others also.

OK, I will send you a separate email and you can then supply their email
addresses.

> Koichi Suzuki is arranging a meeting in Hong Long for XC/XL discussions.
> Presumably EDB is invited also? If Koichi is a leading organizer of this, why
> are there two meetings?

I certainly have heard nothing about it, except third-hand people
telling me a meeting is happening.  I assumed those meetings where
XC/XL-specific.

>     that the XC approach is the only reasonable way to do it,
>     and that FDWs are the cleanest way to get it into community
>     Postgres.
> 
> Those two things aren't at all obvious to me.
>
> Please don't presume my opposition. If the technical information were made
> public, I might understand and agree with "the FDW approach", perhaps others
> also.

Well, the beauty of my approach is that we didn't need any technical
direction or buy-in on sharding from the community to improve FDWs.  I
think now is the right time to try to get that buy-in, or adjust our
approach.

There isn't really much more to my _analysis_ than I presented.  There
is certainly a lot more work to do to even decide this is the right
approach.  Some of the groups already involved have more experience in
trying this, e.g. Citus Data.

> 2ndQuadrant is certainly happy to become involved in any team aiming to
> add features to Postgres core, as long as that makes sense. There may be areas
> we can all agree upon even if the full architecture remains in doubt.

Right.

> Before the community commits to a long term venture together we should see the
> plan. Like all IT projects, expensive failure is possible and the lack of a
> design is a huge flashing red warning light for me at present. If that requires
> a meeting of all Developers, why are the meetings for this specifically not
> happening at the agreed Developer meetings?

Well, what meetings should it be at?  I don't think there was clear
enough direction for the June 2015 PGCon meeting.  Is there an
unconference in Vienna?  One thing I saw at the last PGCon is that this
is a big topic, so I think having a dedicated room and 3-hour slot for
it is nice.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Horizontal scalability/sharding

From
Bruce Momjian
Date:
On Sun, Aug 30, 2015 at 10:36:23PM +0300, Oleg Bartunov wrote:
> Honestly, I still don't know which approach is better, we already played with
> XL (ported to 9.4)  and identified some very strong issues with inconsistency,
> which scared us, especially taking into account how easy we found them. XC
> people have fixed them, but I'm not sure if they were fundamental and if we
> could construct more sophisticated tests and find more issues in XC/XL. We also
> a bit disappointed by Huawei position about CSN patch, we hoped to use for  our
> XTM.  FDW approach has been actively criticized by pg_shard people and that's
> also made me a bit suspicious. 

Yep, that has me concerned too.  The pg_shard people will be on the
September 1 call and are working on a Google document to explain their
concerns about FDWs for sharding.

> It looks like  we are doomed to continue
> several development forks, so we decided to work on very important common
> project, XTM, which we hoped could be accepted by all parties and eventually
> committed to 9.6.  Now I see we were right, unfortunately.  

Yes, the ability to add independent parts that can eventually be used
for sharding is a strong indication that doing this incrementally is a
good approach.

> Again, could we organize meeting somewhere in September ?  US is not good for
> us, but other places should be ok. I want to have an agreement  at least on
> XTM. We still are testing various approaches, though. We could present results
> of our experiments and are open to discussion. It's not easy project, but it's
> something we could do for 9.6.

Good.  XTM is a must-have for several use-cases, including sharding.

> I'm very glad Bruce started this discussion in -hackers, since it's silly to me
> to participate in both threads :)  Let's meet in September !

In summary, I think we need to start working on built-in sharding, and
FDWs are the only way I can see to do it with minimal code changes,
which I think might be a community requirement.  It might not work, but
right now, it is the only possible approach I can see.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Horizontal scalability/sharding

From
Michael Paquier
Date:


On Mon, Aug 31, 2015 at 7:29 AM, Bruce Momjian <bruce@momjian.us> wrote:
On Sun, Aug 30, 2015 at 03:31:10PM +0100, Simon Riggs wrote:
>     I realized that we would never get community acceptance to dump the XC
>     (or XL) code needed for sharding into community Postgres
>
>
> How or why did you realize that? There has never been any such discussion,
> AFAIK. Surely it can be possible to move required subsystems across?

Well, I have had many such discussions with XC/XL folks, and that was my
opinion.  I have seen almost no public discussion about this because the
idea had almost no chance of success.  If it was possible, someone would
have already suggested it on this list.

Or perhaps people invested in this area had other obligations or lacked motivation and/or time to work to push up for things in core. That's not possible to know, and what is done is done.
 
>     July, 2015:  Oleg Bartunov and his new company Postgres Professional (PP)
>     started to think about joining the FDW approach, rather than working on
>     XL, as they had stated at PGCon in June.  A joint NTT/EDB/PP phone-in
>     meeting is scheduled for September 1.
>
>     August, 2015:  While speaking at SFPUG, Citus Data approached me about
>     joining the FDW sharding team.  They have been invited to the September
>     1 meeting, as have the XC and XL people.
>
>
> 2ndQuadrant is working in this area, specifically bringing XL 9.5 forwards.

Yes, I saw the blog post about that:

        http://blog.2ndquadrant.com/working-towards-postgres-xl-9-5/

> Please can invites be posted to myself, Pavan Deolasee and Petr Jelinek also?
> I'll pass on to others also.

OK, I will send you a separate email and you can then supply their email
addresses.

FWIW, I would be interested in that as well. I worked in this area of things for a couple of years as well FWIW.
 
> Koichi Suzuki is arranging a meeting in Hong Long for XC/XL discussions.
> Presumably EDB is invited also? If Koichi is a leading organizer of this, why
> are there two meetings?

I certainly have heard nothing about it, except third-hand people
telling me a meeting is happening.  I assumed those meetings where
XC/XL-specific.

Yep, that's my understanding as well and AFAIK as I know things have been carried this way until now, aka XC/XL and Postgres core are aimed to live as separate communities.
--
Michael

Re: Horizontal scalability/sharding

From
Bruce Momjian
Date:
On Mon, Aug 31, 2015 at 09:53:57AM +0900, Michael Paquier wrote:
>     Well, I have had many such discussions with XC/XL folks, and that was my
>     opinion.  I have seen almost no public discussion about this because the
>     idea had almost no chance of success.  If it was possible, someone would
>     have already suggested it on this list.
> 
> 
> Or perhaps people invested in this area had other obligations or lacked
> motivation and/or time to work to push up for things in core. That's not
> possible to know, and what is done is done.

Well, I have talked to everyone privately about this, and concluded that
while horizontal scalability/sharding is useful, it is unlikely that the
code volume of something like XC or XL would be accepted into the
community, and frankly, now that we have FDWs, it is hard to imagine why
we would _not_ go in the FDW direction.  

Of course, people have concerns, and FDWs might need to be improved, but
it is something worth researching.  We might find out FDWs can't be used
at all, and that we have to either add much more code to Postgres to do
sharding, do something like pg_shard, or not implement built-in sharding
at all, but at least it is time to research this.

>     OK, I will send you a separate email and you can then supply their email
>     addresses.
> 
> 
> FWIW, I would be interested in that as well. I worked in this area of things
> for a couple of years as well FWIW.

OK, I will send you an email.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Horizontal scalability/sharding

From
Bruce Momjian
Date:
On Sun, Aug 30, 2015 at 10:08:06PM -0400, Bruce Momjian wrote:
> On Mon, Aug 31, 2015 at 09:53:57AM +0900, Michael Paquier wrote:
> >     Well, I have had many such discussions with XC/XL folks, and that was my
> >     opinion.  I have seen almost no public discussion about this because the
> >     idea had almost no chance of success.  If it was possible, someone would
> >     have already suggested it on this list.
> > 
> > 
> > Or perhaps people invested in this area had other obligations or lacked
> > motivation and/or time to work to push up for things in core. That's not
> > possible to know, and what is done is done.
> 
> Well, I have talked to everyone privately about this, and concluded that
> while horizontal scalability/sharding is useful, it is unlikely that the
> code volume of something like XC or XL would be accepted into the
> community, and frankly, now that we have FDWs, it is hard to imagine why
> we would _not_ go in the FDW direction.  

Actually, there was hope that XC or XL would get popular enough that it
would justify adding their code into community Postgres, but that never
happened.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Horizontal scalability/sharding

From
Michael Paquier
Date:
On Mon, Aug 31, 2015 at 11:08 AM, Bruce Momjian <bruce@momjian.us> wrote:
>
> On Mon, Aug 31, 2015 at 09:53:57AM +0900, Michael Paquier wrote:
> >     Well, I have had many such discussions with XC/XL folks, and that was my
> >     opinion.  I have seen almost no public discussion about this because the
> >     idea had almost no chance of success.  If it was possible, someone would
> >     have already suggested it on this list.
> >
> >
> > Or perhaps people invested in this area had other obligations or lacked
> > motivation and/or time to work to push up for things in core. That's not
> > possible to know, and what is done is done.
>
> Well, I have talked to everyone privately about this, and concluded that
> while horizontal scalability/sharding is useful, it is unlikely that the
> code volume of something like XC or XL would be accepted into the
> community, and frankly, now that we have FDWs, it is hard to imagine why
> we would _not_ go in the FDW direction.

If I recall correctly in terms of numbers, that's indeed 40k of code,
the main areas of XC code being the GTM, the planner changes for
expression and join push down, and the connection pooler for parallel
query execution.

ISTM that FDW is a portion of the puzzle, there are other pieces that
could be used toward an in-core integration, like the parallel stuff
Amit Kapila is working on to allow remote query execution in parallel
of local scans. Also, XC/XL were performing well on OLTP thanks to the
connection pooler: this should indeed be part of the FDW portion
managing the foreign scans. This may sound like a minor issue compared
to the others, but already established connections help a lot when
scaling out with foreign servers.

> Of course, people have concerns, and FDWs might need to be improved, but
> it is something worth researching.  We might find out FDWs can't be used
> at all, and that we have to either add much more code to Postgres to do
> sharding, do something like pg_shard, or not implement built-in sharding
> at all, but at least it is time to research this.

I am really looking forward to hearing the arguments of the authors of
pg_shard on the matter.

> >     OK, I will send you a separate email and you can then supply their email
> >     addresses.
> >
> >
> > FWIW, I would be interested in that as well. I worked in this area of things
> > for a couple of years as well FWIW.
>
> OK, I will send you an email.

Thanks.
-- 
Michael



Re: Horizontal scalability/sharding

From
Michael Paquier
Date:
On Mon, Aug 31, 2015 at 11:48 AM, Bruce Momjian <bruce@momjian.us> wrote:
> On Sun, Aug 30, 2015 at 10:08:06PM -0400, Bruce Momjian wrote:
>> On Mon, Aug 31, 2015 at 09:53:57AM +0900, Michael Paquier wrote:
>> >     Well, I have had many such discussions with XC/XL folks, and that was my
>> >     opinion.  I have seen almost no public discussion about this because the
>> >     idea had almost no chance of success.  If it was possible, someone would
>> >     have already suggested it on this list.
>> >
>> >
>> > Or perhaps people invested in this area had other obligations or lacked
>> > motivation and/or time to work to push up for things in core. That's not
>> > possible to know, and what is done is done.
>>
>> Well, I have talked to everyone privately about this, and concluded that
>> while horizontal scalability/sharding is useful, it is unlikely that the
>> code volume of something like XC or XL would be accepted into the
>> community, and frankly, now that we have FDWs, it is hard to imagine why
>> we would _not_ go in the FDW direction.
>
> Actually, there was hope that XC or XL would get popular enough that it
> would justify adding their code into community Postgres, but that never
> happened.

Forks are aimed to die without proper maintenance resources. Still,
for XC/XL, what does not help is the complication of the architecture
and SPOF management, particularly thinking with the GTM that was
something completely new and not well understood (there is a GTM
standby but this model is weak IMO and does not scale similarly to
what you get with standbys, and impacts the overall performance of the
cluster).
-- 
Michael



Re: Horizontal scalability/sharding

From
Amit Kapila
Date:
On Sun, Aug 30, 2015 at 7:47 AM, Bruce Momjian <bruce@momjian.us> wrote:
>
> I have recently increased my public statements about the idea of adding
> horizontal scaling/sharding to Postgres. I wanted to share with hackers
> a timeline of how we got here, and where I think we are going in the
> short term:
>
> 2012-2013:  As part of writing my scaling talk
> (http://momjian.us/main/presentations/overview.html#scaling), studying
> Oracle RAC, and talking to users, it became clear that an XC-like
> architecture (sharding) was the only architecture that was going to allow
> for write scaling.
>

I think sharding like architecture is quite useful for certain kind of workloads
where users can manage to arrange queries and data layout in an optimized
way which I hope users might agree to change if required.  One thing to
consider here is what kind of scaling are we expecting in such a system and
is it sufficient considering we will keep focussed on this architecture for
horizontal scalability?
Generally speaking, the scaling in such systems is limited by the number of
profitable partitions user can create based on data and then cross-partition
transactions sucks the performance/scalability in such systems.  I
understand that there is definitely a benefit in proceeding with sharding like
architecture as there are already some PostgreSQL based forks which uses
such architecture, so if we follow same way, we can save some effort rather
than inventing or following some other architecture, however there is no harm
is discussing pros and cons of some other architectures like Oracle RAC,
Google F1 or others.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Horizontal scalability/sharding

From
Ashutosh Bapat
Date:

At PGCon we agreed to have such meeting in Vienna at least. But I think we should be prepared and try to clean all our issues before. It looks like we already out of time,but probably we could meet in Hong Kong ? 

Honestly, I still don't know which approach is better, we already played with XL (ported to 9.4)  and identified some very strong issues with inconsistency, which scared us, especially taking into account how easy we found them. XC people have fixed them, but I'm not sure if they were fundamental and if we could construct more sophisticated tests and find more issues in XC/XL. We also a bit disappointed by Huawei position about CSN patch, we hoped to use for  our XTM.  FDW approach has been actively criticized by pg_shard people and that's also made me a bit suspicious.  It looks like  we are doomed to continue several development forks, so we decided to work on very important common project, XTM, which we hoped could be accepted by all parties and eventually committed to 9.6.  Now I see we were right, unfortunately.

Distributed transaction manager should support at least three things
1. Atomic commit
2. Atomic visibility
3. Consistent snapshots (e.g. required for repeatable reads and higher isolation levels).

I have submitted patch for implementing first for FDWs. The patch adds infrastructure to be used by all FDWs including postgres_fdw. It also adds postgres_fdw code to use this infrastructure. The same can be used to achieve atomic commit in postgres_fdw based sharding. Please see if XTM can benefit from it. If there are things that are required by XTM, please post the requirements on that thread and I will work on those. You can find the latest patch at http://www.postgresql.org/message-id/CAFjFpRfANWL53+x2HdM9TCNe5pup=oPkQSSJ-KGfr-d2efj+CQ@mail.gmail.com
 

Again, could we organize meeting somewhere in September ?  US is not good for us, but other places should be ok. I want to have an agreement  at least on XTM. We still are testing various approaches, though. We could present results of our experiments and are open to discussion. It's not easy project, but it's something we could do for 9.6.

I'm very glad Bruce started this discussion in -hackers, since it's silly to me to participate in both threads :)  Let's meet in September !

 
--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Horizontal scalability/sharding

From
Oleg Bartunov
Date:


On Mon, Aug 31, 2015 at 5:48 AM, Bruce Momjian <bruce@momjian.us> wrote:
On Sun, Aug 30, 2015 at 10:08:06PM -0400, Bruce Momjian wrote:
> On Mon, Aug 31, 2015 at 09:53:57AM +0900, Michael Paquier wrote:
> >     Well, I have had many such discussions with XC/XL folks, and that was my
> >     opinion.  I have seen almost no public discussion about this because the
> >     idea had almost no chance of success.  If it was possible, someone would
> >     have already suggested it on this list.
> >
> >
> > Or perhaps people invested in this area had other obligations or lacked
> > motivation and/or time to work to push up for things in core. That's not
> > possible to know, and what is done is done.
>
> Well, I have talked to everyone privately about this, and concluded that
> while horizontal scalability/sharding is useful, it is unlikely that the
> code volume of something like XC or XL would be accepted into the
> community, and frankly, now that we have FDWs, it is hard to imagine why
> we would _not_ go in the FDW direction.

Actually, there was hope that XC or XL would get popular enough that it
would justify adding their code into community Postgres, but that never
happened.

AFAIK, XC/XL has already some customers and that is an additional pressure on their development team, which is now called X2. I don't exactly know how internal Huawei's MPPDB is connected to XC/XL. 

We need community test suite for cluster and our company is working on this. It's non-trivial work, but community will never accepts any cluster solution without thorough testing of functionality and performance. Our XC/XL experience was not good.

 

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Horizontal scalability/sharding

From
Mason S
Date:

 We also a bit disappointed by Huawei position about CSN patch, we hoped to use for  our XTM. 

Disappointed in what way? Moving to some sort of CSN approach seems to open things up for different future ideas. In the short term, it would mean replacing potentially large snapshots and longer visibility checks. In the long term, perhaps CSN could help simplify the design of multi-master replication schemes. 
 
FDW approach has been actively criticized by pg_shard people and that's also made me a bit suspicious.  It looks like  we are doomed to continue several development forks, so we decided to work on very important common project, XTM, which we hoped could be accepted by all parties and eventually committed to 9.6.  Now I see we were right, unfortunately.  


I think the original XC project probably would have taken the FDW approach as a basis if it had existed, with focus on push-down optimizations. 

I assume that future work around PG sharding probably would be more likely to be accepted with the FDW approach. One could perhaps work on pushing down joins, aggregates and order by, then look at any optimizations gained if code is moved outside of FDW.  It would make sense if some kind of generic optimization for foreign tables for SQL-based sources could be leveraged across all databases, rather than having to re-implement for each FDW.

There are different approaches and related features that may need to be improved.

Do we want multiple copies of shards, like the pg_shard approach? Or keep things simpler and leave it up to the DBA to add standbys? 

Do we want to leverage table inheritance? If so, we may want to spend time improving performance for when the number of shards becomes large with what currently exists. If using table inheritance, we could add the ability to specify what node (er, foreign server) the subtable lives on. We could create top level sharding expressions that allow these to be implicitly created.

Should we allow arbitrary expressions for shards, not just range, list and hash?

Maybe the most community-acceptable approach would look something like

- Use FDWs, and continue to optimize push-down operations, also for non-PostgreSQL databases.

- Use table inheritance for defining the shards. Ideally allow for specifying that some shards may be replicated to other foreign servers (and itself) (for pushing down joins with lookup/static tables; at this point it should be decent for star schema based data warehouses). 

- XTM/GTM hooks. Preferably we move to CSN for snapshots in core PostgreSQL though.

Longer term, efficient internode joins would require a lot more work.

The devil is in the details. There are things that have to be addressed, for example, if using global XIDs via GTM, not every transaction is on every node, so we need to make sure that new clog pages get added properly.  There is also the potential to require a lot more code to be added, like for cursor handling and stored functions. Perhaps some limitations when using shards to foreign servers are acceptable if it is desired to minimize code changes.  XC and XL code help.

Regards,

Mason

Re: Horizontal scalability/sharding

From
Bruce Momjian
Date:
On Mon, Aug 31, 2015 at 02:48:31PM -0400, Mason S wrote:
> I assume that future work around PG sharding probably would be more likely to
> be accepted with the FDW approach. One could perhaps work on pushing down
> joins, aggregates and order by, then look at any optimizations gained if code
> is moved outside of FDW.  It would make sense if some kind of generic
> optimization for foreign tables for SQL-based sources could be leveraged across
> all databases, rather than having to re-implement for each FDW.
> 
> There are different approaches and related features that may need to be
> improved.
> 
> Do we want multiple copies of shards, like the pg_shard approach? Or keep
> things simpler and leave it up to the DBA to add standbys? 

I agree with all of the above.

> Do we want to leverage table inheritance? If so, we may want to spend time
> improving performance for when the number of shards becomes large with what
> currently exists. If using table inheritance, we could add the ability to
> specify what node (er, foreign server) the subtable lives on. We could create
> top level sharding expressions that allow these to be implicitly created.
> 
> Should we allow arbitrary expressions for shards, not just range, list and
> hash?
> 
> Maybe the most community-acceptable approach would look something like

I think everyone agrees that our current partitioning setup is just too
verbose and error-prone for users, and needs a simpler interface, and
one that can be better optimized internally.  I assume FDW-based
sharding will benefit from that work as well.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Horizontal scalability/sharding

From
Alexander Korotkov
Date:
On Mon, Aug 31, 2015 at 9:48 PM, Mason S <masonlists@gmail.com> wrote:

 We also a bit disappointed by Huawei position about CSN patch, we hoped to use for  our XTM. 

Disappointed in what way? Moving to some sort of CSN approach seems to open things up for different future ideas. In the short term, it would mean replacing potentially large snapshots and longer visibility checks. In the long term, perhaps CSN could help simplify the design of multi-master replication schemes. 

We are disappointed because at PGCon talk Huawei announced publishing of their CSN patch and further work in this direction together with community. However, it's even not published yet despite all the promises. Nobody from Huawei answers CSN thread in the hackers.
So, I think we got nothing from Huawei except teasing and should rely only on ourselves. That is disappointing.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: Horizontal scalability/sharding

From
Qingqing Zhou
Date:
On Mon, Aug 31, 2015 at 2:12 AM, Oleg Bartunov <obartunov@gmail.com> wrote:
>
> AFAIK, XC/XL has already some customers and that is an additional pressure
> on their development team, which is now called X2. I don't exactly know how
> internal Huawei's MPPDB is connected to XC/XL.
>

Huawei's MPPDB is based on PG-XC and tailored it more targeting OLAP scenarios.

The basic idea is that OLAP needs a shared nothing scale out
architecture for read and write. It needs ok-TP-performance, a
restricted set of functionality, and thus avoids some problems like
GTM being a central scaling bottleneck.

I advocate to merge PostgreSQL core with scale-out features, if we are
ready to face some long time functional discrepancies between the two
deployments.

Regards,
Qingqing



Re: Horizontal scalability/sharding

From
Josh Berkus
Date:
All, Bruce:

First, let me put out there that I think the horizontal scaling project
which has buy-in from the community and we're working on is infinitely
better than the one we're not working on or is an underresourced fork.
So we're in agreement on that.  However, I think there's a lot of room
for discussion; I feel like the FDW approach was decided in exclusive
meetings involving a very small number of people.  The FDW approach
*may* be the right approach, but I'd like to see some rigorous
questioning of that before it's final.

Particularly, I'm concerned that we already have two projects in process
aimed at horizontal scalability, and it seems like we could bring either
(or both) projects to production quality MUCH faster than we could make
an FDW-based solution work.  These are:

* pg_shard
* BDR

It seems worthwhile, just as a thought experiment, if we can get where
we want using those, faster, or by combining those with new FDW features.

It's also important to recognize that there are three major use-cases
for write-scalable clustering:

* OLTP: small-medium cluster, absolute ACID consistency, bottlnecked on small writes per second
* DW: small-large cluster, ACID optional, bottlenecked on bulk reads/writes
* Web: medium to very large cluster, ACID optional, bottlenecked on # of connections

We cannot possibly solve all of the above at once, but to the extent
that we recognize all 3 use cases, we can build core features which can
be adapted to all of them.

I'm also going to pontificate that, for a future solution, we should not
focus on write *IO*, but rather on CPU and RAM. The reason for this
thinking is that, with the latest improvements in hardware and 9.5
improvements, it's increasingly rare for machines to be bottlenecked on
writes to the transaction log (or the heap). This has some implications
for system design.  For example, solutions which require all connections
to go through a single master node do not scale sufficiently to be worth
bothering with.

On some other questions from Mason:

> Do we want multiple copies of shards, like the pg_shard approach? Or
> keep things simpler and leave it up to the DBA to add standbys? 

We want multiple copies of shards created by the sharding system itself.Having a separate, and completely orthagonal,
redundancysystem to the
 
sharding system is overly burdensome on the DBA and makes low-data-loss
HA impossible.

> Do we want to leverage table inheritance? If so, we may want to spend
> time improving performance for when the number of shards becomes large
> with what currently exists. If using table inheritance, we could add the
> ability to specify what node (er, foreign server) the subtable lives on.
> We could create top level sharding expressions that allow these to be
> implicitly created.

IMHO, given that we're looking at replacing inheritance because of its
many documented limitations, building sharding on top of inheritance
seems unwise.  For example, many sharding systems are hash-based; how
would an inheritance system transparently use hash keys?

> Should we allow arbitrary expressions for shards, not just range, list
> and hash?

That seems like a 2.0 feature.  It also doesn't seem necessary to
support it for the moderately skilled user; that is, requiring a special
C sharding function for this seems fine to me.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Horizontal scalability/sharding

From
"Joshua D. Drake"
Date:
On 08/31/2015 01:16 PM, Josh Berkus wrote:
> All, Bruce:
>

> I'm also going to pontificate that, for a future solution, we should not
> focus on write *IO*, but rather on CPU and RAM. The reason for this
> thinking is that, with the latest improvements in hardware and 9.5
> improvements, it's increasingly rare for machines to be bottlenecked on
> writes to the transaction log (or the heap). This has some implications
> for system design.  For example, solutions which require all connections
> to go through a single master node do not scale sufficiently to be worth
> bothering with.

We see this already, under very high concurrency (lots of connections, 
many cores) we often see a significant drop in performance that is not 
related to IO in any meaningful way.

JD


-- 
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.



Re: Horizontal scalability/sharding

From
Robert Haas
Date:
On Mon, Aug 31, 2015 at 4:16 PM, Josh Berkus <josh@agliodbs.com> wrote:
> First, let me put out there that I think the horizontal scaling project
> which has buy-in from the community and we're working on is infinitely
> better than the one we're not working on or is an underresourced fork.
> So we're in agreement on that.  However, I think there's a lot of room
> for discussion; I feel like the FDW approach was decided in exclusive
> meetings involving a very small number of people.  The FDW approach
> *may* be the right approach, but I'd like to see some rigorous
> questioning of that before it's final.

It seems to me that sharding consists of (1) breaking your data set up
into shards, (2) possibly replicating some of those shards onto
multiple machines, and then (3) being able to access the remote data
from local queries.   As far as (1) is concerned, we need declarative
partitioning, which is being worked on by Amit Langote.  As far as (2)
is concerned, I hope and expect BDR, or technology derived therefrom,
to eventually fill that need.  As far as (3) is concerned, why
wouldn't we use the foreign data wrapper interface, and specifically
postgres_fdw?  That interface was designed for the explicit purpose of
allowing access to remote data sources, and a lot of work has been put
into it, so it would be highly surprising if we decided to throw that
away and develop something completely new from the ground up.

It's true that postgres_fdw doesn't do everything we need yet.  The
new join pushdown hooks aren't used by postgres_fdw yet, and the API
itself has some bugs with EvalPlanQual handling.  Aggregate pushdown
is waiting on upper planner path-ification.   DML pushdown doesn't
exist yet, and the hooks that would enable pushdown of ORDER BY
clauses to the remote side aren't being used by postgres_fdw.  But all
of these things have been worked on.  Patches for many of them have
already been posted.  They have suffered from a certain amount of
neglect by senior hackers, and perhaps also from a shortage of time on
the part of the authors.  But an awful lot of the work that is needed
here has already been done, if only we could get it committed.
Aggregate pushdown is a notable exception, but abandoning the foreign
data wrapper approach in favor of something else won't fix that.

Postgres-XC developed a purpose-built system for talking to other
nodes instead of using the FDW interface, for the very good reason
that the FDW interface did not yet exist at the time that Postgres-XC
was created.  But several people associated with the XC project have
said, including one on this thread, that if it had existed, they
probably would have used it.  And it's hard to see why you wouldn't:
with XC's approach, the remote data source is presumed to be
PostgreSQL (or Postgres-XC/XL/X2/whatever); and you can only use the
facility as part of a sharding solution.  The FDW interface can talk
to anything, and it can be used for stuff other than sharding, like
making one remote table appear local because you just happen to want
that for some reason.  This makes the XC approach look rather brittle
by comparison.  I don't blame the XC folks for taking the shortest
path between two points, but FDWs are better, and we ought to try to
leverage that.

> Particularly, I'm concerned that we already have two projects in process
> aimed at horizontal scalability, and it seems like we could bring either
> (or both) projects to production quality MUCH faster than we could make
> an FDW-based solution work.  These are:
>
> * pg_shard
> * BDR
>
> It seems worthwhile, just as a thought experiment, if we can get where
> we want using those, faster, or by combining those with new FDW features.

I think it's abundantly clear that we need a logical replication
solution as part of any horizontal scalability story.  People will
want to do things like have 10 machines with each piece of data on 3
of them, and there won't be any reasonable way of doing that without
logical replication.  I assume that BDR, or some technology derived
from it, will end up in core and solve that problem.  I had actually
hoped we were going to get that in 9.5, but it didn't happen that way.
Still, I think that getting first single-master, and then eventually
multi-master, logical replication in core is absolutely critical.  And
not just for sharding specifically: replicating your whole database to
several nodes and load-balancing your clients across them isn't
sharding, but it does give you read scalability and is a good fit for
people with geographically dispersed data with good geographical
locality.  I think a lot of people will want that.

I'm not quite sure yet how we can marry declarative partitioning and
better FDW-pushdown and logical replication into one seamless, easy to
deploy solution that requires very low administrator effort.  But I am
sure that each of those things, taken individually, is very useful,
and that being able to construct a solution from those building blocks
would be a big improvement over what we have today.  I can't imagine
that trying to do one monolithic project that provides all of those
things, but only if you combine them in the specific way that the
designer had in mind, is ever going to be successful.  People _will_
want access to each of those features in an unbundled fashion.  And,
trying to do them altogether leads to trying to solve too many
problems at once.  I think the history of Postgres-XC is a cautionary
tale there.

I don't really understand how pg_shard fits into this equation.  It
looks to me like it does some interesting things but, for example, it
doesn't support JOIN pushdown, and suggests that you use the
proprietary CitusDB engine if you need that.  But I think JOIN
pushdown is something we want to have in core, not something where we
want to point people to proprietary alternatives.  And it has some
restrictions on INSERT statements - they have to contain only values
which are constants or which can be folded to constants.  I'm just
guessing, but I bet that's probably due to some limitation which
pg_shard, being out of core, has difficulty overcoming, but we can do
better in core.  Basically I guess I expect much of what pg_shard does
to be subsumed as we improve FDWs, but maybe not all of it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Horizontal scalability/sharding

From
Sumedh Pathak
Date:
Hi Bruce,

Sumedh from Citus Data here.

August, 2015:  While speaking at SFPUG, Citus Data approached me about joining the FDW sharding team.  They have been invited to the September 1 meeting, as have the XC and XL people.

I'd like to add a clarification. We already tried the FDW APIs for pg_shard two years ago and failed. We figured sharing our learnings could contribute to the technical discussion and that's why we wanted to be in the call.

Ozgun summarized our technical learnings in this design document: https://goo.gl/vJWF85

In the document, we focused on one of the four learnings we had with FDW APIs. For us, we switched to the hook API based approach, and things went smoothly from there.

Best,
Sumedh

On Sat, Aug 29, 2015 at 7:17 PM, Bruce Momjian <bruce@momjian.us> wrote:
I have recently increased my public statements about the idea of adding
horizontal scaling/sharding to Postgres. I wanted to share with hackers
a timeline of how we got here, and where I think we are going in the
short term:

2012-2013:  As part of writing my scaling talk
(http://momjian.us/main/presentations/overview.html#scaling), studying
Oracle RAC, and talking to users, it became clear that an XC-like
architecture (sharding) was the only architecture that was going to allow
for write scaling.

Users and conference attendees I talked to were increasingly concerned
about the ability of Postgres to scale for high write volumes.  They didn't
necessarily need that scale now, but they needed to know they could get
it if they wanted it, and wouldn't need to switch to a new database in
the future.  This is similar to wanting a car that can get you on a highway
on-ramp fast --- even if you don't need it, you want to know it is there.

2014:  I started to shop around the idea that we could use FDWs,
parallelism, and a transaction/snapshot manager to get XC features
as built-in to Postgres.  (I don't remember where the original idea
came from.)  It was clear that having separate forks of the source code
in XC and XL was never going to achieve critical mass --- there just
aren't enough people who need high right scale right now, and the fork
maintenance overhead is a huge burden.

I realized that we would never get community acceptance to dump the XC
(or XL) code needed for sharding into community Postgres, but with FDWs,
we could add the features as _part_ of improving FDWs, which would benefit
FDWs _and_ would be useful for sharding.  (We already see some of those
FDW features in 9.5.)

October, 2014:  EDB and NTT started working together in the community
to start improving FDWs as a basis for an FDW-based sharding solution.
Many of the 9.5 FDW improvements that also benefit sharding were developed
by a combined EDB/NTT team.  The features improved FDWs independent of
sharding, so they didn't need community buy-in on sharding to get them
accepted.

June, 2015:  I attended the PGCon sharding unconference session and
there was a huge discussion about where we should go with sharding.
I think the big take-away was that most people liked the FDW approach,
but had business/customer reasons for wanting to work on XC or XL because
those would be production-ready faster.

July, 2015:  Oleg Bartunov and his new company Postgres Professional (PP)
started to think about joining the FDW approach, rather than working on
XL, as they had stated at PGCon in June.  A joint NTT/EDB/PP phone-in
meeting is scheduled for September 1.

August, 2015:  While speaking at SFPUG, Citus Data approached me about
joining the FDW sharding team.  They have been invited to the September
1 meeting, as have the XC and XL people.

October, 2015:  EDB is sponsoring a free 3-hour summit about FDW sharding
at the PG-EU conference in Vienna.   Everyone is invited, but it is hoped
most of the September 1 folks can attend.

February, 2016:  Oleg is planning a similar meeting at their February
Moscow conference.

Anyway, I wanted to explain the work that has been happening around
sharding.  As things move forward, I am increasingly convinced that write
scaling will be needed soon, that the XC approach is the only reasonable
way to do it, and that FDWs are the cleanest way to get it into community
Postgres.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers



--
Sumedh Pathak
Citus Data

Re: Horizontal scalability/sharding

From
Josh Berkus
Date:
On 08/31/2015 02:47 PM, Robert Haas wrote:
> On Mon, Aug 31, 2015 at 4:16 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> First, let me put out there that I think the horizontal scaling project
>> which has buy-in from the community and we're working on is infinitely
>> better than the one we're not working on or is an underresourced fork.
>> So we're in agreement on that.  However, I think there's a lot of room
>> for discussion; I feel like the FDW approach was decided in exclusive
>> meetings involving a very small number of people.  The FDW approach
>> *may* be the right approach, but I'd like to see some rigorous
>> questioning of that before it's final.
> 
> It seems to me that sharding consists of (1) breaking your data set up
> into shards, (2) possibly replicating some of those shards onto
> multiple machines, and then (3) being able to access the remote data
> from local queries.   As far as (1) is concerned, we need declarative
> partitioning, which is being worked on by Amit Langote.  As far as (2)
> is concerned, I hope and expect BDR, or technology derived therefrom,
> to eventually fill that need.  

Well, maybe.  If you look at pg_shard, you'll see that it works by
multiplexing writes to all copies.  There's a good reason to do that; it
allows you to have a tight feedback loop between the success of writes
and the availability of "good" nodes.  If you're depending on a separate
replication system to handle getting row copies from one shard to
another, then you need a different way to deal with bad nodes and with
inconsistency between copies of shards.  That's why the existing
multinode non-relational databases don't separate replication from
writes, either.

For that matter, if what you want is transactional fully ACID sharding,
I really don't see a way to do it via BDR, since BDR is purely
asynchronous replication, as far as I know.

Also, if we want BDR to do this, that's pretty far afield of what BDR is
currently capable of, so someone will need to put serious work into it
rather than just assuming functionality will show up.

> As far as (3) is concerned, why
> wouldn't we use the foreign data wrapper interface, and specifically
> postgres_fdw?  That interface was designed for the explicit purpose of
> allowing access to remote data sources, and a lot of work has been put
> into it, so it would be highly surprising if we decided to throw that
> away and develop something completely new from the ground up.

Well, query hooks are also a capability which we already have, and is
mature.  Citus has already posted about why they chose to use them instead.

As long as you recognize that the FDW API (not just the existing fdws)
will need to expand to make this work, it's a viable path.

Also consider that (3) includes both reads and writes.

> I think it's abundantly clear that we need a logical replication
> solution as part of any horizontal scalability story.  People will
> want to do things like have 10 machines with each piece of data on 3
> of them, and there won't be any reasonable way of doing that without
> logical replication.  I assume that BDR, or some technology derived
> from it, will end up in core and solve that problem.  I had actually
> hoped we were going to get that in 9.5, but it didn't happen that way.
> Still, I think that getting first single-master, and then eventually
> multi-master, logical replication in core is absolutely critical.  And
> not just for sharding specifically: replicating your whole database to
> several nodes and load-balancing your clients across them isn't
> sharding, but it does give you read scalability and is a good fit for
> people with geographically dispersed data with good geographical
> locality.  I think a lot of people will want that.

Well, the latter thing is something which BDR is designed for, so all
that needs to happen with that is getting the rest of the plumbing into
core.  Also documentation, packaging, productization, etc.  But the
heavy lifting has already been done.

However, integrating BDR with sharding has some major design issues
which aren't trivial and may be unresolvable, per above.

> I'm not quite sure yet how we can marry declarative partitioning and
> better FDW-pushdown and logical replication into one seamless, easy to
> deploy solution that requires very low administrator effort.  But I am
> sure that each of those things, taken individually, is very useful,
> and that being able to construct a solution from those building blocks
> would be a big improvement over what we have today.  I can't imagine
> that trying to do one monolithic project that provides all of those
> things, but only if you combine them in the specific way that the
> designer had in mind, is ever going to be successful.  People _will_
> want access to each of those features in an unbundled fashion.  And,
> trying to do them altogether leads to trying to solve too many
> problems at once.  I think the history of Postgres-XC is a cautionary
> tale there.

Yes.  It's also a cautionary tale about not skipping over major design
elements (like HA and DR) until after version 1.0, which is one of the
reasons I'm harping on certain things here.  I don't want us to repeat
those mistakes.

> I don't really understand how pg_shard fits into this equation.  It
> looks to me like it does some interesting things but, for example, it
> doesn't support JOIN pushdown, and suggests that you use the
> proprietary CitusDB engine if you need that.  But I think JOIN
> pushdown is something we want to have in core, not something where we
> want to point people to proprietary alternatives.  And it has some
> restrictions on INSERT statements - they have to contain only values
> which are constants or which can be folded to constants.  I'm just
> guessing, but I bet that's probably due to some limitation which
> pg_shard, being out of core, has difficulty overcoming, but we can do
> better in core.  Basically I guess I expect much of what pg_shard does
> to be subsumed as we improve FDWs, but maybe not all of it.

pg_shard provides an alternate implementation based on planner hooks
instead of FDWs.  Even if you pursue an FDW-based design, you should
look at (a) why the Citus team found FDWs to be unworkable and (b) what
got implemented in planner hooks.  Otherwise you're liable to repeat the
exact same "learning experience".

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Horizontal scalability/sharding

From
Marc Munro
Date:
On Mon, 2015-08-31 at 22:21 +0000,  Robert Haas wrote:

> It seems to me that sharding consists of (1) breaking your data set up
> into shards, (2) possibly replicating some of those shards onto
> multiple machines, and then (3) being able to access the remote data
> from local queries. [...]

I believe there is another aspect to sharding that I have not yet seen
mentioned, which is one of connection routing.

One use case that I have been involved in, is to simply partition the
application into entirely, or almost entirely, separate datasets running
on separate databases with little or no need for queries to access
remote data.

This allows each database to deal only with connections from clients
that actually want its local data, greatly reducing the number of
connections on any individual database.  If this works for your
application, your ability to scale is great.

The pain point comes from trying to route queries to the correct
database.  Inevitably, everyone taking this route builds custom
connection-selection layers into their apps.

It seems to me that even with the more sophisticated types of sharding
being discussed here, the ability to conditionally route a
query/connection to a suitable starting database could be quite
beneficial.

Although this is probably a job for the pgbouncer/pgpool developers
rather than the hackers on this list, this thread seems to be a good
place to mention it.

__
Marc





Re: Horizontal scalability/sharding

From
Bruce Momjian
Date:
On Mon, Aug 31, 2015 at 05:10:11PM -0700, Josh Berkus wrote:
> > As far as (3) is concerned, why
> > wouldn't we use the foreign data wrapper interface, and specifically
> > postgres_fdw?  That interface was designed for the explicit purpose of
> > allowing access to remote data sources, and a lot of work has been put
> > into it, so it would be highly surprising if we decided to throw that
> > away and develop something completely new from the ground up.
> 
> Well, query hooks are also a capability which we already have, and is
> mature.  Citus has already posted about why they chose to use them instead.
> 
> As long as you recognize that the FDW API (not just the existing fdws)
> will need to expand to make this work, it's a viable path.

Uh, we already have a list of things we need to add to FDWs to make them
work, and Citus Data has provided a document of more things that are
needed, https://goo.gl/vJWF85.  I am not sure how much bigger a red flag
you want to confirm that everyone agrees that major FDW improvements are
a requirement for this.  

My hope is that many FDW improvements will benefit sharding and
non-sharding workloads, but I bet some improvements are going to be
sharding-specific.  I would say we are still in the exploratory stage,
but based on the number of people who care about this feature and want
to be involved, I think we are off to a very good start.  :-)

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Horizontal scalability/sharding

From
Alvaro Herrera
Date:
Bruce Momjian wrote:

> My hope is that many FDW improvements will benefit sharding and
> non-sharding workloads, but I bet some improvements are going to be
> sharding-specific.  I would say we are still in the exploratory stage,
> but based on the number of people who care about this feature and want
> to be involved, I think we are off to a very good start.  :-)

Having lots of interested people doesn't help with some problems,
though.  The Citus document says:
And the issue with these four limitations wasn't with foreigndata wrappers. We wrote mongo_fdw and cstore_fdw, and
we'requitehappy with the contract FDWs provide. The problem was thatwe were trying to retrofit an API for something
thatit wasfundamentally not designed to do.
 

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Horizontal scalability/sharding

From
Etsuro Fujita
Date:
On 2015/09/01 9:54, Bruce Momjian wrote:
> On Mon, Aug 31, 2015 at 05:10:11PM -0700, Josh Berkus wrote:
>>> As far as (3) is concerned, why
>>> wouldn't we use the foreign data wrapper interface, and specifically
>>> postgres_fdw?  That interface was designed for the explicit purpose of
>>> allowing access to remote data sources, and a lot of work has been put
>>> into it, so it would be highly surprising if we decided to throw that
>>> away and develop something completely new from the ground up.
>>
>> Well, query hooks are also a capability which we already have, and is
>> mature.  Citus has already posted about why they chose to use them instead.
>>
>> As long as you recognize that the FDW API (not just the existing fdws)
>> will need to expand to make this work, it's a viable path.
>
> Uh, we already have a list of things we need to add to FDWs to make them
> work, and Citus Data has provided a document of more things that are
> needed, https://goo.gl/vJWF85.

My top priority is postgres_fdw join pushdown, but I also plan to work 
on update pushdown [1] for 9.6, which couldn't make it into 9.5.  I 
think that would resolve the following issue mentioned in the document:
UPDATE and DELETE operations are performed by first fetchingrecords from the table scanning functions, and then going
overthefetched records. If the user wanted to update a single row,this involved first pulling rows and then updating
relatedrecords.

Best regards,
Etsuro Fujita

[1] https://commitfest.postgresql.org/4/162/



Re: Horizontal scalability/sharding

From
Pavan Deolasee
Date:


On Tue, Sep 1, 2015 at 3:17 AM, Robert Haas <robertmhaas@gmail.com> wrote:


It seems to me that sharding consists of (1) breaking your data set up
into shards, (2) possibly replicating some of those shards onto
multiple machines, and then (3) being able to access the remote data
from local queries.   As far as (1) is concerned, we need declarative
partitioning, which is being worked on by Amit Langote.  As far as (2)
is concerned, I hope and expect BDR, or technology derived therefrom,
to eventually fill that need.  As far as (3) is concerned, why
wouldn't we use the foreign data wrapper interface, and specifically
postgres_fdw?  That interface was designed for the explicit purpose of
allowing access to remote data sources, and a lot of work has been put
into it, so it would be highly surprising if we decided to throw that
away and develop something completely new from the ground up.

It's true that postgres_fdw doesn't do everything we need yet.  The
new join pushdown hooks aren't used by postgres_fdw yet, and the API
itself has some bugs with EvalPlanQual handling.  Aggregate pushdown
is waiting on upper planner path-ification.   DML pushdown doesn't
exist yet, and the hooks that would enable pushdown of ORDER BY
clauses to the remote side aren't being used by postgres_fdw.  But all
of these things have been worked on.  Patches for many of them have
already been posted.  They have suffered from a certain amount of
neglect by senior hackers, and perhaps also from a shortage of time on
the part of the authors.  But an awful lot of the work that is needed
here has already been done, if only we could get it committed.
Aggregate pushdown is a notable exception, but abandoning the foreign
data wrapper approach in favor of something else won't fix that.

Postgres-XC developed a purpose-built system for talking to other
nodes instead of using the FDW interface, for the very good reason
that the FDW interface did not yet exist at the time that Postgres-XC
was created.  But several people associated with the XC project have
said, including one on this thread, that if it had existed, they
probably would have used it.  And it's hard to see why you wouldn't:
with XC's approach, the remote data source is presumed to be
PostgreSQL (or Postgres-XC/XL/X2/whatever); and you can only use the
facility as part of a sharding solution.  The FDW interface can talk
to anything, and it can be used for stuff other than sharding, like
making one remote table appear local because you just happen to want
that for some reason.  This makes the XC approach look rather brittle
by comparison.  I don't blame the XC folks for taking the shortest
path between two points, but FDWs are better, and we ought to try to
leverage that.


In my discussions on this topic with various folks including Robert, I've conceded that if FDW was available when XC was first written, in all likelihood we would have used and extended that interface. But that wasn't the case and we did what we thought was the best solution at that time, given the resources and the schedule. To be honest, when XC project was started, I was quite skeptical about the whole thing given the goal was to built something which can replace Oracle RAC with may be less than 1% resources of what Oracle must have invested in building RAC. The lack of resources at the start of the project keeps showing up in the quality issues that users report from time to time. Having said that, I am quite satisfied with what we have been able to build within the constraints. 

But FDW is just one part of the story. There is this entire global consistency problem that would require something like GTM to give out XIDs and snapshots, atomicity which would require managing transactions across multiple shards, join pushdowns when all data is not available locally, something that XL is attempting to solve with datanode-datanode exchange of information, other global states such as sequences, replicating some part of the data to multiple shards for efficient operations, ability to add/remove shards with least disruption, globally consistent backups/restore. XC/XL has attempted to solve each of them to some extent. I don't claim that they are completely solved and there are no corner cases left, but we have made fairly good progress on each of them.
 
My worry is that if we start implementing them again from scratch, it will take a few years before we get them in a usable state. What XC/XL lacked is probably a Robert Haas or a Tom Lane who could look at the work and suggest major edits. If that had happened, the quality of the product could have been much better today. I don't mean to derate the developers who worked on XC/XL, but there is no harm in accepting that if someone with a much better understanding of the whole system was part of the team, that would have positively impacted the project. Is that an angle worth exploring? Does it make sense to commit some more resources to say XC or XL and try to improve the quality of the product even further? To be honest, XL is in far far better shape (haven't really tried XC in a while) and some more QA/polishing can make it production ready much sooner.

Yet another possibility is rework the design such that only coordinator needs to be a fork of PostgreSQL but the shards are all PostgreSQL instances, queried using standard client APIs. That would reduce the code that needs to go in the core to build the entire scalable system and also shorten the timeline considerably.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Horizontal scalability/sharding

From
Andres Freund
Date:
On 2015-08-31 20:54:51 -0400, Bruce Momjian wrote:
> Uh, we already have a list of things we need to add to FDWs to make them
> work, and Citus Data has provided a document of more things that are
> needed, https://goo.gl/vJWF85.  I am not sure how much bigger a red flag
> you want to confirm that everyone agrees that major FDW improvements are
> a requirement for this.

Several people saying that the FDW infrastructure isn't sufficient right
now is pretty far from implying that all of them agree that the FDW API
is the way to go.

I'm not sure myself. If it works out it's going to save us some work and
make it more realistic to get there sometime not too far off. But I'm
afraid that the resulting system will feel like our current partitioning
implemenentation. Yes, it kinda works, but it's hard to get started, it
doesn't support too many features and you're kind afraid your relatives
will see what you've done.



Re: Horizontal scalability/sharding

From
Bruce Momjian
Date:
On Mon, Aug 31, 2015 at 01:16:21PM -0700, Josh Berkus wrote:
> I'm also going to pontificate that, for a future solution, we should not
> focus on write *IO*, but rather on CPU and RAM. The reason for this
> thinking is that, with the latest improvements in hardware and 9.5
> improvements, it's increasingly rare for machines to be bottlenecked on
> writes to the transaction log (or the heap). This has some implications
> for system design.  For example, solutions which require all connections
> to go through a single master node do not scale sufficiently to be worth
> bothering with.

Well, I highlighted write IO for sharding because sharding is the only
solution that allows write scaling.  If we want to scale CPU, we are
better off using server parallelism, and to scale CPU and RAM, a
multi-master/BDR solution seems best.  (Multi-master doesn't do write
scaling because you eventually have to write all the data to each node.)

For read-scaling, having the read queries go to streaming slave seems
best.

Only sharding allows for write scaling, but only in very limited
use-cases, where the queries use multiple shards and the
sharding/communication overhead doesn't overwhelm the benefits.

For these reasons, I think sharding has a limited use, and hence, I
don't think the community will be willing to add a lot of code just to
enable auto-sharding.  I think it has to be done in a way that adding
sharding also gives other benefits, like better FDWs and cross-node ACID
control.

In summary, I don't think adding a ton of code just to do sharding will
be acceptable.  A corollary of that, is that if FDWs are unable to
provide useful sharding, I don't see an acceptable way of adding
built-in sharding to Postgres.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Horizontal scalability/sharding

From
Bruce Momjian
Date:
On Tue, Sep  1, 2015 at 10:15:27AM +0200, Andres Freund wrote:
> On 2015-08-31 20:54:51 -0400, Bruce Momjian wrote:
> > Uh, we already have a list of things we need to add to FDWs to make them
> > work, and Citus Data has provided a document of more things that are
> > needed, https://goo.gl/vJWF85.  I am not sure how much bigger a red flag
> > you want to confirm that everyone agrees that major FDW improvements are
> > a requirement for this.
> 
> Several people saying that the FDW infrastructure isn't sufficient right
> now is pretty far from implying that all of them agree that the FDW API
> is the way to go.
> 
> I'm not sure myself. If it works out it's going to save us some work and
> make it more realistic to get there sometime not too far off. But I'm
> afraid that the resulting system will feel like our current partitioning
> implemenentation. Yes, it kinda works, but it's hard to get started, it
> doesn't support too many features and you're kind afraid your relatives
> will see what you've done.

Well, reworking our partitioning system is one of the things required
for sharding, so at least we will clean up one mess while we create
another.  ;-)

Seem my post to Josh Berkus just now --- I think if we don't use FDWs,
that sharding is such a limited use-case that we will not implement it
inside of Postgres.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Horizontal scalability/sharding

From
Bruce Momjian
Date:
On Tue, Sep  1, 2015 at 09:30:41AM +0530, Pavan Deolasee wrote:
> My worry is that if we start implementing them again from scratch, it will take
> a few years before we get them in a usable state. What XC/XL lacked is probably
> a Robert Haas or a Tom Lane who could look at the work and suggest major edits.
> If that had happened, the quality of the product could have been much better
> today. I don't mean to derate the developers who worked on XC/XL, but there is
> no harm in accepting that if someone with a much better understanding of the
> whole system was part of the team, that would have positively impacted the
> project. Is that an angle worth exploring? Does it make sense to commit some
> more resources to say XC or XL and try to improve the quality of the product
> even further? To be honest, XL is in far far better shape (haven't really tried
> XC in a while) and some more QA/polishing can make it production ready much
> sooner.

There is no question that using XC/XL will get us to a usable solution
faster, but see my recent post to Josh Berkus --- the additional code
will be so burdensome that I doubt it would be accepted. If it was, I
bet we would have considered it long ago.

I think the only way we are going to get sharding into Postgres is to do
it in a way that enhances existing Postgres capabilities.  

(I think we got a little too cute in enhancing existing Postgres
capabilities to add partitioning, and I think we need to fix that.
Hopefully we don't do the same thing with sharding.)

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Horizontal scalability/sharding

From
Mark Kirkwood
Date:
On 01/09/15 21:41, Bruce Momjian wrote:
>
> Well, reworking our partitioning system is one of the things required
> for sharding, so at least we will clean up one mess while we create
> another.  ;-)
>
> Seem my post to Josh Berkus just now --- I think if we don't use FDWs,
> that sharding is such a limited use-case that we will not implement it
> inside of Postgres.
>

I'm thinking that partitioning and sharding are two different things:

Partitioning is about reducing the amount of table data accessed and 
also perhaps easing admin activities (archiving/removing old stuff etc).

Sharding is a about parallelism and redundancy...copies of stuff in 
different places and concurrent access by virtue of it being on 
different nodes!

Now *maybe* FDW is a good way to approach this, but really would be nice 
to see a more rigorous analysis (I note that like XC and XL, Greenplum 
looked at the existing mechanisms around at the time and ended up 
writing their own). Now I'm aware that things have moved on - but I 
think there needs to be a proper discussion about design and what we 
think distributed data/sharding etc should provide *before* grabbing 
hold of FDW as *the answer*!

Regards

Mark



Re: Horizontal scalability/sharding

From
Mason S
Date:


On Mon, Aug 31, 2015 at 3:08 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
On Mon, Aug 31, 2015 at 9:48 PM, Mason S <masonlists@gmail.com> wrote:

 We also a bit disappointed by Huawei position about CSN patch, we hoped to use for  our XTM. 

Disappointed in what way? Moving to some sort of CSN approach seems to open things up for different future ideas. In the short term, it would mean replacing potentially large snapshots and longer visibility checks. In the long term, perhaps CSN could help simplify the design of multi-master replication schemes. 

We are disappointed because at PGCon talk Huawei announced publishing of their CSN patch and further work in this direction together with community. However, it's even not published yet despite all the promises. Nobody from Huawei answers CSN thread in the hackers.
So, I think we got nothing from Huawei except teasing and should rely only on ourselves. That is disappointing.


Oh, I see. I was in contact with some of the folks involved a few weeks ago.  The intention is to do a bit more work on the patch and then post it. This should be happening "soon".

Thanks,

Mason

Re: Horizontal scalability/sharding

From
Bruce Momjian
Date:
On Mon, Aug 31, 2015 at 11:23:58PM -0300, Alvaro Herrera wrote:
> Bruce Momjian wrote:
> 
> > My hope is that many FDW improvements will benefit sharding and
> > non-sharding workloads, but I bet some improvements are going to be
> > sharding-specific.  I would say we are still in the exploratory stage,
> > but based on the number of people who care about this feature and want
> > to be involved, I think we are off to a very good start.  :-)
> 
> Having lots of interested people doesn't help with some problems,
> though.  The Citus document says:
> 
>     And the issue with these four limitations wasn't with foreign
>     data wrappers. We wrote mongo_fdw and cstore_fdw, and we're
>     quite happy with the contract FDWs provide. The problem was that
>     we were trying to retrofit an API for something that it was
>     fundamentally not designed to do.

I had a chance to review the Citus Data document just now:
https://goo.gl/vJWF85

Particularly, it links to this document, which is clearer about the
issues they are trying to solve:
https://www.citusdata.com/blog/114-how-to-build-your-distributed-database

The document opens a big question --- when queries can't be processed in
a traditional top/down fashion, Citus has the goal of sending groups of
results up the the coordinator, reordering them, then sending them back
to the shards for further processing, basically using the shards as
compute engines because the shards are no longer using local data to do
their computations.  The two examples they give are COUNT(DISTINCT) and
a join across two sharded tables ("CANADA").

I assumed these queries were going to be solved by sending as digested
data as possible to the coordinator, and having the coordinator complete
any remaining processing.  I think we are going to need to decide if
such "sending data back to shards" is something we are ever going to
implement.  I can see FDWs _not_ working well for that use-case.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Horizontal scalability/sharding

From
Mason S
Date:


On Tue, Sep 1, 2015 at 6:55 AM, Bruce Momjian <bruce@momjian.us> wrote:
On Mon, Aug 31, 2015 at 11:23:58PM -0300, Alvaro Herrera wrote:
> Bruce Momjian wrote:
>
> > My hope is that many FDW improvements will benefit sharding and
> > non-sharding workloads, but I bet some improvements are going to be
> > sharding-specific.  I would say we are still in the exploratory stage,
> > but based on the number of people who care about this feature and want
> > to be involved, I think we are off to a very good start.  :-)
>
> Having lots of interested people doesn't help with some problems,
> though.  The Citus document says:
>
>       And the issue with these four limitations wasn't with foreign
>       data wrappers. We wrote mongo_fdw and cstore_fdw, and we're
>       quite happy with the contract FDWs provide. The problem was that
>       we were trying to retrofit an API for something that it was
>       fundamentally not designed to do.

I had a chance to review the Citus Data document just now:

        https://goo.gl/vJWF85

Particularly, it links to this document, which is clearer about the
issues they are trying to solve:

        https://www.citusdata.com/blog/114-how-to-build-your-distributed-database

The document opens a big question --- when queries can't be processed in
a traditional top/down fashion, Citus has the goal of sending groups of
results up the the coordinator, reordering them, then sending them back
to the shards for further processing, basically using the shards as
compute engines because the shards are no longer using local data to do
their computations.  The two examples they give are COUNT(DISTINCT) and
a join across two sharded tables ("CANADA").

I assumed these queries were going to be solved by sending as digested
data as possible to the coordinator, and having the coordinator complete
any remaining processing.  I think we are going to need to decide if
such "sending data back to shards" is something we are ever going to
implement.  I can see FDWs _not_ working well for that use-case.


For efficient internodes joins with row shipping, FDWs may also not be easy to do. Maybe it is possible if we optionally pass in lists of other nodes and information about how they are partitioned so data knows where to get shipped. 

A challenge for planning with arbitrary copies of different shards is that sometimes you may be able to push down joins, sometimes not. Planning and execution get ugly. Maybe this can be simplified by parent-child tables following the same partitioning scheme.

Mason

Re: Horizontal scalability/sharding

From
"Joshua D. Drake"
Date:
On 09/01/2015 02:48 AM, Bruce Momjian wrote:
> On Tue, Sep  1, 2015 at 09:30:41AM +0530, Pavan Deolasee wrote:

> There is no question that using XC/XL will get us to a usable solution
> faster, but see my recent post to Josh Berkus --- the additional code
> will be so burdensome that I doubt it would be accepted. If it was, I
> bet we would have considered it long ago.
>
> I think the only way we are going to get sharding into Postgres is to do
> it in a way that enhances existing Postgres capabilities.

So that we have XL again?

This reads like: We don't want to merge a bunch of code that a non .Org 
project wrote. Instead we want to write a bunch of code that will all 
but duplicate what that other project wrote because.... ?

Don't get me wrong, I am all about this goal and if XL truly doesn't fit 
that's cool but I think the idea that our project will somehow reinvent 
a better wheel is naive.

> (I think we got a little too cute in enhancing existing Postgres
> capabilities to add partitioning, and I think we need to fix that.
> Hopefully we don't do the same thing with sharding.)

Not cute, lazy. We implemented the barest minimum within our existing 
infrastructure and we have been kicking ourselves ever since.

Sincerely,

JD



-- 
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.



Re: Horizontal scalability/sharding

From
"Joshua D. Drake"
Date:
On 09/01/2015 02:58 AM, Mark Kirkwood wrote:
> On 01/09/15 21:41, Bruce Momjian wrote:

> I'm thinking that partitioning and sharding are two different things:
>
> Partitioning is about reducing the amount of table data accessed and
> also perhaps easing admin activities (archiving/removing old stuff etc).
>
> Sharding is a about parallelism and redundancy...copies of stuff in
> different places and concurrent access by virtue of it being on
> different nodes!

In our world, they are complimentary. Consider partitioning that uses 
FDW tables with proper plan push down etc....

>
> Now *maybe* FDW is a good way to approach this, but really would be nice
> to see a more rigorous analysis (I note that like XC and XL, Greenplum
> looked at the existing mechanisms around at the time and ended up
> writing their own). Now I'm aware that things have moved on - but I
> think there needs to be a proper discussion about design and what we
> think distributed data/sharding etc should provide *before* grabbing
> hold of FDW as *the answer*!

Agreed.

Sincerely,

Joshua D. Drake



-- 
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.



Re: Horizontal scalability/sharding

From
David Fetter
Date:
On Tue, Sep 01, 2015 at 10:15:27AM +0200, Andres Freund wrote:
> On 2015-08-31 20:54:51 -0400, Bruce Momjian wrote:
> > Uh, we already have a list of things we need to add to FDWs to make them
> > work, and Citus Data has provided a document of more things that are
> > needed, https://goo.gl/vJWF85.  I am not sure how much bigger a red flag
> > you want to confirm that everyone agrees that major FDW improvements are
> > a requirement for this.
> 
> Several people saying that the FDW infrastructure isn't sufficient right
> now is pretty far from implying that all of them agree that the FDW API
> is the way to go.
> 
> I'm not sure myself. If it works out it's going to save us some work and
> make it more realistic to get there sometime not too far off. But I'm
> afraid that the resulting system will feel like our current partitioning
> implemenentation. Yes, it kinda works, but it's hard to get started, it
> doesn't support too many features and you're kind afraid your relatives
> will see what you've done.

Whatever we decide on, we can only count on built-in multi-node being
adopted if all the needed bits needed ship with every PostgreSQL
installation.  If we require people do Install More Software™ in order
to get a feature, we're going to lose a majority of our potential
base.

If it turns out we need the PostgreSQL FDW, and I believe some
consensus is starting to gel around that, I can see not installing it
in template1 by default.  Even that's a questionable decision, as
merely having the software in place does not credibly increase the
attack surface, and does up the installation procedures by a fallible
step.

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: Horizontal scalability/sharding

From
Robert Haas
Date:
On Tue, Sep 1, 2015 at 12:00 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
> My worry is that if we start implementing them again from scratch, it will
> take a few years before we get them in a usable state. What XC/XL lacked is
> probably a Robert Haas or a Tom Lane who could look at the work and suggest
> major edits. If that had happened, the quality of the product could have
> been much better today. I don't mean to derate the developers who worked on
> XC/XL, but there is no harm in accepting that if someone with a much better
> understanding of the whole system was part of the team, that would have
> positively impacted the project. Is that an angle worth exploring? Does it
> make sense to commit some more resources to say XC or XL and try to improve
> the quality of the product even further? To be honest, XL is in far far
> better shape (haven't really tried XC in a while) and some more QA/polishing
> can make it production ready much sooner.

From my point of view, and EnterpriseDB's point of view, anything that
doesn't go into the core PostgreSQL distribution isn't really getting
us where we need to be.  If there's code in XL that would be valuable
to merge into core PostgreSQL, then let's do it.  If the code cannot
be used but there are lessons we can learn that will make what does go
into core PostgreSQL better, let's learn them.  However, I don't think
it's serving anybody very well that we have the XC fork, and multiple
forks of the XC fork, floating around out there and people are working
on those instead of working on core PostgreSQL.  The reality is that
we don't have enough brainpower to spread it across 2 or 3 or 4 or 5
different projects and have all of them be good.  The reality is,
also, that horizontal scalability isn't an optional feature.  There
was a point in time at which the PostgreSQL project's official policy
on replication was that it did not belong in core.  That was a bad
policy; thankfully, it was reversed, and the result was Hot Standby
and Streaming Replication, incredibly important technologies without
which we would not be where we are today. Horizontal scalability is
just as essential.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Horizontal scalability/sharding

From
Robert Haas
Date:
On Tue, Sep 1, 2015 at 4:15 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-08-31 20:54:51 -0400, Bruce Momjian wrote:
>> Uh, we already have a list of things we need to add to FDWs to make them
>> work, and Citus Data has provided a document of more things that are
>> needed, https://goo.gl/vJWF85.  I am not sure how much bigger a red flag
>> you want to confirm that everyone agrees that major FDW improvements are
>> a requirement for this.
>
> Several people saying that the FDW infrastructure isn't sufficient right
> now is pretty far from implying that all of them agree that the FDW API
> is the way to go.
>
> I'm not sure myself. If it works out it's going to save us some work and
> make it more realistic to get there sometime not too far off. But I'm
> afraid that the resulting system will feel like our current partitioning
> implemenentation. Yes, it kinda works, but it's hard to get started, it
> doesn't support too many features and you're kind afraid your relatives
> will see what you've done.

I'm not averse to making the "connect to the remote nodes" part of
this solution use something other than the FDW infrastructure at some
point in time if somebody's prepared to build something better.  On
the other hand, I think it's extremely clear that the FDW
infrastructure has a large amount of potential upon which we have
thoroughly failed to capitalize.  Patches have already been written
for UPDATE/DELETE pushdown and for join pushdown.  Those patches have
been around for some time, but progress has been slow.  Core
infrastructure exists to allow sort pushdown, but nobody's done
anything with it.  Aggregate pushdown hasn't happened yet due to the
dependency on upper planner path-ification, but it's not as if some
alternative to the FDW interface is going to dodge that problem.

It would be a bad idea to cling blindly to the FDW infrastructure if
it's fundamentally inadequate to do what we want.  On the other hand,
it would also be a bad idea to set about recreating it without a
really good reason, and - just to take one example - the fact that it
doesn't currently push down DML operations to the remote side is not a
really good reason to rewrite the whole thing.  On the contrary, it's
a reason to put some energy into the already-written patch which
implements that optimization.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Horizontal scalability/sharding

From
"Joshua D. Drake"
Date:
On 09/01/2015 09:08 AM, Robert Haas wrote:
> On Tue, Sep 1, 2015 at 12:00 AM, Pavan Deolasee

>  From my point of view, and EnterpriseDB's point of view, anything that
> doesn't go into the core PostgreSQL distribution isn't really getting
> us where we need to be.  If there's code in XL that would be valuable
> to merge into core PostgreSQL, then let's do it.  If the code cannot
> be used but there are lessons we can learn that will make what does go
> into core PostgreSQL better, let's learn them.

I think that is pretty much anybody in the industry's point of view and 
I agree with it. Although there is definitely an argument to be had 
about whether it is a module or it is in core proper, no matter what it 
should be a .Org project.

You are also correct on the replication statement. Replication should 
have never been outside of core and I say that as the company that wrote 
a replication solution that even Fortune 50 companies used.

I think that we should consider that there are very smart, very talented 
engineers already devoting time to their respective communities (XL/XC) 
that may very much enjoy being part of our community so we don't have to 
reinvent the wheel, again.

JD

-- 
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.



Re: Horizontal scalability/sharding

From
Robert Haas
Date:
On Tue, Sep 1, 2015 at 6:55 AM, Bruce Momjian <bruce@momjian.us> wrote:
> I assumed these queries were going to be solved by sending as digested
> data as possible to the coordinator, and having the coordinator complete
> any remaining processing.  I think we are going to need to decide if
> such "sending data back to shards" is something we are ever going to
> implement.  I can see FDWs _not_ working well for that use-case.

I do think we are going to want to support that.  All the people I've
talked to about parallel and distributed query processing agree that
you need to do that sort of thing to get really good and scalable
performance.  I think that we could make a lot of headway as compared
with the status quo just by implementing more pushdown optimizations
than we have today.  Right now, SELECT COUNT(*) FROM table will suck
back the whole remote table and count the rows locally, and that's
stupid.  We can fix that case with better pushdown logic.  We can also
fix the case of N-way join nests where the joins are either on the
partitioning key or to replicated tables.  But suppose you have a join
between two tables which are sharded across the cluster but not on the
partitioning key.  There's no way to push the join down, so all the
work comes back to the coordinator, which is possibly OK if such
queries are rare, but not so hot if they are frequent.

To leverage the full CPU power of the cluster in such a case, you need
to be able to shuffle data around between the nodes.  You pick one of
the two tables being joined, and based on the partitioning key of that
table, each node examines the other table and, for each row, sends it
to the machine where it MIGHT have one or more join partners.  Then
each node can join its shard of the first table against the rows from
the second table that were sent to it.

Now the question is, where should the code that does all of this live?postgres_fdw?  Some new, sharding-specific FDW?
Incore?  I don't
 
know for sure, but what I do know is that we could make a lot of
progress over where we are today by just improving postgres_fdw, and I
don't think those improvements are even all that difficult.  If we
decide we need to implement something new, it's going to be a huge
project that will take years to complete, with uncertain results.  I'd
rather have a postgres_fdw-based implementation that is imperfect and
can't handle some kinds of queries in 9.6 than a promise that by 9.9
we'll have something really great that handles MPP perfectly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Horizontal scalability/sharding

From
Robert Haas
Date:
On Tue, Sep 1, 2015 at 7:01 AM, Mason S <masonlists@gmail.com> wrote:
> For efficient internodes joins with row shipping, FDWs may also not be easy
> to do. Maybe it is possible if we optionally pass in lists of other nodes
> and information about how they are partitioned so data knows where to get
> shipped.
>
> A challenge for planning with arbitrary copies of different shards is that
> sometimes you may be able to push down joins, sometimes not. Planning and
> execution get ugly. Maybe this can be simplified by parent-child tables
> following the same partitioning scheme.

This gets at a problem which Ozgun also mentioned in his Google
document, and which I also discussed with Etsuro Fujita at PGCon: good
query planning requires good metadata, and we don't really have that
today.

I think that a big part of the goal of the declarative partitioning
work that Amit Langote has recently been undertaking is to provide a
catalog representation of the partitioning structure that is easy to
work with, as opposed to just having a bunch of CHECK constraints that
you have to try to reason about.  That's one part of the solution.

You also might need to know more about the remote table than is
captured by the column and data type list.  Most particularly, you
might want to know what indexes exist on the remote side, but
currently, to figure out that out, you'd need to send queries to
retrieve that information to the remote side every time you do
planning, or maybe you could contrive a session-lifespan cache.
That's pretty annoying.

One idea for solving this problem is to allow CREATE INDEX on foreign
tables, but I don't like that much.  There's no guarantee that the
remote side is a PostgreSQL instance, and if it isn't, the relevant
details about the indexes that exist may not be convenient to
represent in our catalogs.  Heck, that can be true even if it is a
PostgreSQL instance, if the remote side relies on an AM or a function
that doesn't exist locally.  But even if both sides are PostgreSQL
instances using only btree indexes on raw columns, now you've put the
burden on the DBA to make sure that the index definitions on the local
and remote sides match, and that's a pain in the neck.

What seems better to me is to allow ANALYZE of a foreign table a place
to record an arbitrary blob of metadata about the remote side that it
can then get access to during planning.  Then, you can record details
about indexes, or statistics that don't fit into the mold of
pg_statistic, or really, anything else you're going to need to figure
out the best plan, and if the DBA changes the configuration on the
remote side, they don't need to update the local configuration to
match; a re-ANALYZE will do the trick.

I'm open to other ideas as well.  Repartitioning will also arise for
an all-local parallel join, and it would be nice if the planner smarts
could be shared between that case and the remote-table case.
Therefore, while I suspect that some of the logic here will end up
inside one or more FDWs, I'm pretty confident that a significant chunk
of it needs to go into the core optimizer.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Horizontal scalability/sharding

From
Josh Berkus
Date:
On 09/01/2015 02:39 AM, Bruce Momjian wrote:
> On Mon, Aug 31, 2015 at 01:16:21PM -0700, Josh Berkus wrote:
>> I'm also going to pontificate that, for a future solution, we should not
>> focus on write *IO*, but rather on CPU and RAM. The reason for this
>> thinking is that, with the latest improvements in hardware and 9.5
>> improvements, it's increasingly rare for machines to be bottlenecked on
>> writes to the transaction log (or the heap). This has some implications
>> for system design.  For example, solutions which require all connections
>> to go through a single master node do not scale sufficiently to be worth
>> bothering with.
> 
> Well, I highlighted write IO for sharding because sharding is the only
> solution that allows write scaling.  If we want to scale CPU, we are
> better off using server parallelism, and to scale CPU and RAM, a
> multi-master/BDR solution seems best.  (Multi-master doesn't do write
> scaling because you eventually have to write all the data to each node.)

You're assuming that our primary bottleneck for writes is IO.  It's not
at present for most users, and it certainly won't be in the future.  You
need to move your thinking on systems resources into the 21st century,
instead of solving the resource problems from 15 years ago.

Currently, CPU resources and locking are the primary bottlenecks on
writing for the vast majority of the hundreds of servers I tune every
year.  This even includes AWS, with EBS's horrible latency; even in that
environment, most users can outstrip PostgreSQL's ability to handle
requests by getting 20K PRIOPs.

Our real future bottlenecks are:

* ability to handle more than a few hundred connections
* locking limits on the scalability of writes
* ability to manage large RAM and data caches


The only place where IO becomes the bottleneck is for the
batch-processing, high-throughput DW case ... and I would argue that
existing forks already handle that case.

Any sharding solution worth bothering with will solve some or all of the
above by extending our ability to process requests across multiple
nodes.  Any solution which does not is merely an academic curiosity.

> For these reasons, I think sharding has a limited use, and hence, I
> don't think the community will be willing to add a lot of code just to
> enable auto-sharding.  I think it has to be done in a way that adding
> sharding also gives other benefits, like better FDWs and cross-node ACID
> control.
> 
> In summary, I don't think adding a ton of code just to do sharding will
> be acceptable.  A corollary of that, is that if FDWs are unable to
> provide useful sharding, I don't see an acceptable way of adding
> built-in sharding to Postgres.

So, while I am fully in agreement with you that having side benefits to
our sharding tools, I think you're missing the big picture entirely.  In
a few years, clustered/sharded PostgreSQL will be the default
installation, or we'll be a legacy database.  Single-node and
single-master databases are rapidly becoming history.

From my perspective, we don't need an awkward, limited, bolt-on solution
for write-scaling.  We need something which will become core to how
PostgreSQL works.  I just don't see us getting there with the described
FDW approach, which is why I keep raising issues with it.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Horizontal scalability/sharding

From
Robert Haas
Date:
On Tue, Sep 1, 2015 at 1:06 PM, Josh Berkus <josh@agliodbs.com> wrote:
> You're assuming that our primary bottleneck for writes is IO.  It's not
> at present for most users, and it certainly won't be in the future.  You
> need to move your thinking on systems resources into the 21st century,
> instead of solving the resource problems from 15 years ago.

Your experience doesn't match mine.  I find that it's frequently
impossible to get the system to use all of the available CPU capacity,
either because you're bottlenecked on locks or because you are
bottlenecked on the  I/O subsystem, and with the locking improvements
in newer versions, the former is becoming less and less common.
Amit's recent work on scalability demonstrates this trend: he goes
looking for lock bottlenecks, and finds problems that only occur at
128+ concurrent connections running full tilt.  The patches show
limited benefit - a few percentage points - at lesser concurrency
levels.  Either there are other locking bottlenecks that limit
performance at lower client counts but which mysteriously disappear as
concurrency increases, which I would find surprising, or the limit is
somewhere else.  I haven't seen any convincing evidence that the I/O
subsystem is the bottleneck, but I'm having a hard time figuring out
what else it could be.

> Our real future bottlenecks are:
>
> * ability to handle more than a few hundred connections
> * locking limits on the scalability of writes
> * ability to manage large RAM and data caches

I do agree that all of those things are problems, FWIW.

> Any sharding solution worth bothering with will solve some or all of the
> above by extending our ability to process requests across multiple
> nodes.  Any solution which does not is merely an academic curiosity.

I think the right solution to those problems is to attack them
head-on.  Sharding solutions should cater to use cases where using all
the resources of one machine isn't sufficient no matter how
efficiently we do it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Horizontal scalability/sharding

From
"Joshua D. Drake"
Date:
On 09/01/2015 10:06 AM, Josh Berkus wrote:
> On 09/01/2015 02:39 AM, Bruce Momjian wrote:
>> On Mon, Aug 31, 2015 at 01:16:21PM -0700, Josh Berkus wrote:

> Our real future bottlenecks are:
>
> * ability to handle more than a few hundred connections

This, 1000 times this. No a connection pooler doesn't help, we are using 
1000 connections WITH a connection pooler. The amount of data centric, 
always connected (even if stateless) apps that are out there is 
astounding. Our performance curve is better (especially since Robert's 
fantastic work on 9.2) but it isn't where a lot of the demanding users 
need to be.

We are also seeing a vast uptick in always connected apps that aren't 
web centric (or the client at least isn't). In other words we are seeing 
situations where 1000 connections is 1000 users, connected.

JD



-- 
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.



Re: Horizontal scalability/sharding

From
Josh Berkus
Date:
On 09/01/2015 10:17 AM, Robert Haas wrote:
> On Tue, Sep 1, 2015 at 1:06 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> Any sharding solution worth bothering with will solve some or all of the
>> above by extending our ability to process requests across multiple
>> nodes.  Any solution which does not is merely an academic curiosity.
> 
> I think the right solution to those problems is to attack them
> head-on.  Sharding solutions should cater to use cases where using all
> the resources of one machine isn't sufficient no matter how
> efficiently we do it.

As long as "all the resources" != "just IO", I'm completely on board
with that.  The reason I raised this is that the initial FDW-based
proposals pretty much scale IO and nothing else.

pg_shard also currently only scales IO, but they're working on that.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Horizontal scalability/sharding

From
Tomas Vondra
Date:
Hi,

On 09/01/2015 07:17 PM, Robert Haas wrote:
> On Tue, Sep 1, 2015 at 1:06 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> You're assuming that our primary bottleneck for writes is IO. It's
>> not at present for most users, and it certainly won't be in the
>> future. You need to move your thinking on systems resources into
>> the 21st century, instead of solving the resource problems from 15
>> years ago.
>
> Your experience doesn't match mine.  I find that it's frequently
> impossible to get the system to use all of the available CPU
> capacity, either because you're bottlenecked on locks or because you
> are bottlenecked on the  I/O subsystem, and with the locking
> improvements in newer versions, the former is becoming less and less
> common. Amit's recent work on scalability demonstrates this trend: he
> goes looking for lock bottlenecks, and finds problems that only occur
> at 128+ concurrent connections running full tilt.  The patches show
> limited benefit - a few percentage points - at lesser concurrency
> levels.  Either there are other locking bottlenecks that limit
> performance at lower client counts but which mysteriously disappear
> as concurrency increases, which I would find surprising, or the limit
> is somewhere else.  I haven't seen any convincing evidence that the
> I/O subsystem is the bottleneck, but I'm having a hard time figuring
> out what else it could be.

Memory bandwidth, for example. It's quite difficult to spot, because the 
intuition is that memory is fast, but thanks to improvements in storage 
(and stagnation in RAM bandwidth), this is becoming a significant issue.

Process-management overhead is another thing we tend to ignore, but once 
you get to many processes all willing to work at the same time, you need 
to account for that.

Of course, this applies differently to different sharding use cases. For 
example analytics workloads have serious issues with memory bandwidth, 
but not so much with process management overhead (because the number of 
connections is usually about number of cores). Use cases with many 
clients (in web-scale use cases) tends to run into both (all the 
processes also have to share all the caches, killing them).

I don't know if sharding can help solving (or at least improve) these 
issues. And if sharding in general can, I don't know if it still holds 
for FDW-based solution.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Horizontal scalability/sharding

From
Robert Haas
Date:
On Tue, Sep 1, 2015 at 2:04 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Memory bandwidth, for example. It's quite difficult to spot, because the
> intuition is that memory is fast, but thanks to improvements in storage (and
> stagnation in RAM bandwidth), this is becoming a significant issue.

I'd appreciate any tips on how to spot problems of this type.  But
it's my impression that perf, top, vmstat, and other Linux performance
tools will count time spent waiting for memory as CPU time, not idle
time.  If that's correct, that wouldn't explain workloads where CPU
utilization doesn't reach 100%.  Rather, it would show up as CPU time
hitting 100% while tps remains low.

> Process-management overhead is another thing we tend to ignore, but once you
> get to many processes all willing to work at the same time, you need to
> account for that.

Any tips on spotting problems in that area?

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Horizontal scalability/sharding

From
Andres Freund
Date:
On 2015-09-01 14:11:21 -0400, Robert Haas wrote:
> On Tue, Sep 1, 2015 at 2:04 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> > Memory bandwidth, for example. It's quite difficult to spot, because the
> > intuition is that memory is fast, but thanks to improvements in storage (and
> > stagnation in RAM bandwidth), this is becoming a significant issue.
> 
> I'd appreciate any tips on how to spot problems of this type.  But
> it's my impression that perf, top, vmstat, and other Linux performance
> tools will count time spent waiting for memory as CPU time, not idle
> time.  If that's correct, that wouldn't explain workloads where CPU
> utilization doesn't reach 100%.  Rather, it would show up as CPU time
> hitting 100% while tps remains low.

Yea.

-e bus-cycles is a good start to measure where bus traffic isrelevant. Depending on the individual cpu other events can
behelpful.
 

> > Process-management overhead is another thing we tend to ignore, but once you
> > get to many processes all willing to work at the same time, you need to
> > account for that.
> 
> Any tips on spotting problems in that area?

Not perfect, but -e context-switches (general context switches) and -e
syscalls:sys_enter_semop (for postgres enforced context switches) is
rather useful when combined with --call-graph dwarf ('fp' sometimes
doesn't see through libc which is most of the time not compiled with
-fno-omit-frame-pointer).

Greetings,

Andres Freund



Re: Horizontal scalability/sharding

From
Tomas Vondra
Date:
Hi,

On 08/31/2015 10:16 PM, Josh Berkus wrote:
> It's also important to recognize that there are three major use-cases
> for write-scalable clustering:
>
> * OLTP: small-medium cluster, absolute ACID consistency,
>    bottlnecked on small writes per second
> * DW: small-large cluster, ACID optional,
>    bottlenecked on bulk reads/writes
> * Web: medium to very large cluster, ACID optional,
>    bottlenecked on # of connections
>
> We cannot possibly solve all of the above at once, but to the extent
> that we recognize all 3 use cases, we can build core features which
> can be adapted to all of them.

It would be good to have a discussion about use-cases first - each of us 
is mostly concerned about the use cases they're dealing with, with 
bottlenecks specific to their environment. These three basic use-cases 
seem like a good start, but some of the details certainly don't match my 
experience ...

For example I can't see how ACID can be optional for the DWH use-case, 
but maybe there's a good explanation - I can imagine sacrificing various 
ACID properties at the node level, but I can't really imagine 
sacrificing any of the ACID properties for the cluster as a whole. So 
this would deserve some explanation.

I also don't share the view that write scalability is the only (or even 
main) issue, that we should aim to solve. For the business-intelligence 
use cases I've been working on recently, handling complex read-only 
ad-hoc queries is often much more important. And in those cases the 
bottleneck is often CPU and/or RAM.

>
> I'm also going to pontificate that, for a future solution, we should
> not focus on write *IO*, but rather on CPU and RAM. The reason for
> this thinking is that, with the latest improvements in hardware and
> 9.5 improvements, it's increasingly rare for machines to be
> bottlenecked on writes to the transaction log (or the heap). This has
> some implications for system design. For example, solutions which
> require all connections to go through a single master node do not
> scale sufficiently to be worth bothering with.

+1

> On some other questions from Mason:
>
>> Do we want multiple copies of shards, like the pg_shard approach? Or
>> keep things simpler and leave it up to the DBA to add standbys?
>
> We want multiple copies of shards created by the sharding system itself.
>   Having a separate, and completely orthagonal, redundancy system to the
> sharding system is overly burdensome on the DBA and makes low-data-loss
> HA impossible.

IMHO it'd be quite unfortunate if the design would make it impossible to 
combine those two features (e.g. creating standbys for shards and 
failing over to them).

It's true that solving HA at the sharding level (by keeping multiple 
copies of a each shard) may be simpler than combining sharding and 
standbys, but I don't see why it makes low-data-loss HA impossible.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Horizontal scalability/sharding

From
Tomas Vondra
Date:
Hi,

On 09/01/2015 08:22 PM, Andres Freund wrote:
> On 2015-09-01 14:11:21 -0400, Robert Haas wrote:
>> On Tue, Sep 1, 2015 at 2:04 PM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>>> Memory bandwidth, for example. It's quite difficult to spot, because the
>>> intuition is that memory is fast, but thanks to improvements in storage (and
>>> stagnation in RAM bandwidth), this is becoming a significant issue.
>>
>> I'd appreciate any tips on how to spot problems of this type.  But
>> it's my impression that perf, top, vmstat, and other Linux performance
>> tools will count time spent waiting for memory as CPU time, not idle
>> time.  If that's correct, that wouldn't explain workloads where CPU
>> utilization doesn't reach 100%.  Rather, it would show up as CPU time
>> hitting 100% while tps remains low.
>
> Yea.
>
> -e bus-cycles is a good start to measure where bus traffic is
>   relevant. Depending on the individual cpu other events can be helpful.

long-story: https://people.freebsd.org/~lstewart/articles/cpumemory.pdf

It's from 2007 and only explains oprofile (chapter 7), which is mostly 
abandoned in favor of perf nowadays. Perf can produce similar stats, so 
the discussion is still valid. But it also shows cachegrind (valgrind 
module).

perf examples: http://www.brendangregg.com/perf.html

Most of the examples with "CPU" in the comment are relevant. Usually 
"perf stat" and "perf stat -d" are good starting points - once you get a 
lot of LLC misses or too many instructions per cycle, it's a sign of 
memory bandwidth problems.

Sadly, this is partially caused by our volcano-style executor and 
sharding alone can do nothing about that.

>
>>> Process-management overhead is another thing we tend to ignore, but once you
>>> get to many processes all willing to work at the same time, you need to
>>> account for that.
>>
>> Any tips on spotting problems in that area?
>
> Not perfect, but -e context-switches (general context switches) and -e
> syscalls:sys_enter_semop (for postgres enforced context switches) is
> rather useful when combined with --call-graph dwarf ('fp' sometimes
> doesn't see through libc which is most of the time not compiled with
> -fno-omit-frame-pointer).

Right, this is about the best I'm aware of.

The problem often is not in the number of context switches, but in the 
fact that all the processes share the same (very limited) L caches on 
the CPU. Each process dirties the caches for the other processes, 
lowering the hit ratios. Which can be spotted using the commands above.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Horizontal scalability/sharding

From
Josh Berkus
Date:
On 09/01/2015 11:36 AM, Tomas Vondra wrote:
>> We want multiple copies of shards created by the sharding system itself.
>>   Having a separate, and completely orthagonal, redundancy system to the
>> sharding system is overly burdensome on the DBA and makes low-data-loss
>> HA impossible.
> 
> IMHO it'd be quite unfortunate if the design would make it impossible to
> combine those two features (e.g. creating standbys for shards and
> failing over to them).
> 
> It's true that solving HA at the sharding level (by keeping multiple
> copies of a each shard) may be simpler than combining sharding and
> standbys, but I don't see why it makes low-data-loss HA impossible.

Other way around, that is, having replication standbys as the only
method of redundancy requires either high data loss or high latency for
all writes.

In the case of async rep, every time we fail over a node, the entire
cluser would need to roll back to the last common known-good replay
point, hence high data loss.

In the case of sync rep, we are required to wait for at least double
network lag time in order to do a single write ... making
write-scalability quite difficult.

Futher, if using replication the sharding system would have no way to
(a) find out immediately if a copy was bad and (b) fail over quickly to
a copy of the shard if the first requested copy was not responding.
With async replication, we also can't use multiple copies of the same
shard as a way to balance read workloads.

If we write to multiple copies as a part of the sharding feature, then
that can be parallelized, so that we are waiting only as long as the
slowest write (or in failure cases, as long as the shard timeout).
Further, we can check for shard-copy health and update shard
availability data with each user request, so that the ability to see
stale/bad data is minimized.

There are obvious problems with multiplexing writes, which you can
figure out if you knock pg_shard around a bit.  But I really think that
solving those problems is the only way to go.

Mind you, I see a strong place for binary replication and BDR for
multi-region redundancy; you really don't want that to be part of the
sharding system if you're aiming for write scalability.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Horizontal scalability/sharding

From
Robert Haas
Date:
On Tue, Sep 1, 2015 at 3:19 PM, Josh Berkus <josh@agliodbs.com> wrote:
> Mind you, I see a strong place for binary replication and BDR for
> multi-region redundancy; you really don't want that to be part of the
> sharding system if you're aiming for write scalability.

I mostly agree, but keep in mind that you don't have to replicate
everything.  A lot of people might find it useful to replicate the
data that is common to all nodes even if they don't replicate the
sharded data.

Admittedly, there are some problems with snapshots here: if you don't
do anything special about snapshots, then what you have here will be
"eventually consistent" behavior.  But that might be suitable for some
environments, such as very loosely coupled system where not all nodes
are connected all the time.  And, for those environments where you do
need consistent snapshots, we can imagine ways to get that behavior,
like having the GTM consider the transaction uncommitted until it's
been logically replicated to every node.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Horizontal scalability/sharding

From
Tomas Vondra
Date:
Hi,

On 09/01/2015 09:19 PM, Josh Berkus wrote:
> On 09/01/2015 11:36 AM, Tomas Vondra wrote:
>>> We want multiple copies of shards created by the sharding system
>>> itself. Having a separate, and completely orthagonal, redundancy
>>> system to the sharding system is overly burdensome on the DBA and
>>> makes low-data-loss HA impossible.
>>
>> IMHO it'd be quite unfortunate if the design would make it
>> impossible to combine those two features (e.g. creating standbys
>> for shards and failing over to them).
>>
>> It's true that solving HA at the sharding level (by keeping
>> multiple copies of a each shard) may be simpler than combining
>> sharding and standbys, but I don't see why it makes low-data-loss
>> HA impossible.
>
> Other way around, that is, having replication standbys as the only
> method of redundancy requires either high data loss or high latency
> for all writes.

I haven't said that. I said that we should allow that topology, not that 
it should be the only method of redundancy.

>
> In the case of async rep, every time we fail over a node, the entire
> cluser would need to roll back to the last common known-good replay
> point, hence high data loss.
>
> In the case of sync rep, we are required to wait for at least double
>  network lag time in order to do a single write ... making
> write-scalability quite difficult.

Which assumes that latency (or rather the increase due to syncrep) is a 
problem for the use case. Which may be the case for many use cases, but 
certainly is not a problem for many BI/DWH use cases performing mostly 
large batch loads. In those cases the network bandwidth may be quite 
important resource.

For example assume that there are just two shards in two separate data 
centers, connected by a link with limited bandwidth. Now, let's assume 
you always keep a local replica for failover. So you have A1+A2 in DC1, 
B1+B2 in DC2. If you're in DC1, then writing data to B1 means you also 
have to write data to B2 and wait for it. So either you send the data to 
each node separately (consuming 2x the bandwidth), or send it to B1 and 
let it propagate to B2 e.g. through sync rep.

So while you may be right in single-DC deployments, with multi-DC 
deployments the situation is quite different - not only that the network 
bandwidth is not unlimited, but because latencies within DC may be a 
fraction of latencies between the locations (to the extent that the 
increase due to syncrep may be just noise). So the local replication may 
be actually way faster.

I can imagine forwarding the data between B1 and B2 even with a purely 
sharding solution, but at that point you effectively re-implemented syncrep.

IMHO the design has to address the multi-DC setups somehow. I think that 
many of the customers who are so concerned about scaling to many shards 
are also concerned about availability in case of DC outages, no?

We should also consider support for custom topologies (not just a full 
mesh, or whatever we choose as the default/initial topology), which is 
somehow related.

>
> Futher, if using replication the sharding system would have no way
> to (a) find out immediately if a copy was bad and (b) fail over
> quickly to a copy of the shard if the first requested copy was not
> responding. With async replication, we also can't use multiple copies
> of the same shard as a way to balance read workloads.

I don't follow. With sync rep we do know whether the copy is OK or not, 
because the node either confirms writes or not. The failover certainly 
is more complicated and is not immediate (to the extent of keeping a 
copy at the sharding level), but it's a question of trade-offs.

It's true we don't have auto-failover solution at the moment, but as I 
said - I can easily imagine most people using just sharding, while some 
deployments use syncrep with manual failover.

>
> If we write to multiple copies as a part of the sharding feature,
> then that can be parallelized, so that we are waiting only as long as
> the slowest write (or in failure cases, as long as the shard
> timeout). Further, we can check for shard-copy health and update
> shard availability data with each user request, so that the ability
> to see stale/bad data is minimized.

Again, this assumes infinite network bandwidth.

>
> There are obvious problems with multiplexing writes, which you can
> figure out if you knock pg_shard around a bit. But I really think
> that solving those problems is the only way to go.
>
> Mind you, I see a strong place for binary replication and BDR for
> multi-region redundancy; you really don't want that to be part of
> the sharding system if you're aiming for write scalability.

I haven't mentioned BDR at all, and given the async nature I don't have 
a clear idea of how it fits into the sharding world at this point.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Horizontal scalability/sharding

From
Bruce Momjian
Date:
On Tue, Sep  1, 2015 at 08:18:38AM -0700, Joshua Drake wrote:
> On 09/01/2015 02:48 AM, Bruce Momjian wrote:
> >On Tue, Sep  1, 2015 at 09:30:41AM +0530, Pavan Deolasee wrote:
> 
> >There is no question that using XC/XL will get us to a usable solution
> >faster, but see my recent post to Josh Berkus --- the additional code
> >will be so burdensome that I doubt it would be accepted. If it was, I
> >bet we would have considered it long ago.
> >
> >I think the only way we are going to get sharding into Postgres is to do
> >it in a way that enhances existing Postgres capabilities.
> 
> So that we have XL again?

Kind of.  If XC/XL used FDWs I think we would try to use their code
first.  The issue is that FDWs didn't exist at the time.  I would say
our first approach might be doing XC/XL again with FDWs.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Horizontal scalability/sharding

From
"David E. Wheeler"
Date:
On Sep 1, 2015, at 1:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:

> Admittedly, there are some problems with snapshots here: if you don't
> do anything special about snapshots, then what you have here will be
> "eventually consistent" behavior.  But that might be suitable for some
> environments, such as very loosely coupled system where not all nodes
> are connected all the time.

Given that we’re discussing multi-node architectures here, you should expect that not all nodes will be connected at
anytime. Nodes fail, but the cluster should not. 

> And, for those environments where you do
> need consistent snapshots, we can imagine ways to get that behavior,
> like having the GTM consider the transaction uncommitted until it's
> been logically replicated to every node.

Again, you need a way to deal with nodes going down. I can envision building a cluster with twelve nodes replicated to
eachof three geographically-distributed data centers. Each replication/sync model needs to be able to handle nodes
goingup and down, data centers or racks going up or down, and nodes being added and removed. 

But even with smaller clusters, there’s no way around the fact that no system can guarantee that all nodes will be
availableat all times. 

Best,

David


Re: Horizontal scalability/sharding

From
Josh Berkus
Date:
On 09/01/2015 02:29 PM, Tomas Vondra wrote:
> Hi,
> 
> On 09/01/2015 09:19 PM, Josh Berkus wrote:
>> Other way around, that is, having replication standbys as the only
>> method of redundancy requires either high data loss or high latency
>> for all writes.
> 
> I haven't said that. I said that we should allow that topology, not that
> it should be the only method of redundancy.

Ah, OK, I didn't understand you.  Of course I'm in favor of supporting
both methods of redundancy if we can.

>> In the case of sync rep, we are required to wait for at least double
>>  network lag time in order to do a single write ... making
>> write-scalability quite difficult.
> 
> Which assumes that latency (or rather the increase due to syncrep) is a
> problem for the use case. Which may be the case for many use cases, but
> certainly is not a problem for many BI/DWH use cases performing mostly
> large batch loads. In those cases the network bandwidth may be quite
> important resource.

I'll argue that BI/DW is the least interesting use case for mainstream
PostgreSQL because there are production-quality forks which do this
(mostly propietary, but we can work on that).  We really need a solution
which works for OLTP.

> For example assume that there are just two shards in two separate data
> centers, connected by a link with limited bandwidth. Now, let's assume
> you always keep a local replica for failover. So you have A1+A2 in DC1,
> B1+B2 in DC2. If you're in DC1, then writing data to B1 means you also
> have to write data to B2 and wait for it. So either you send the data to
> each node separately (consuming 2x the bandwidth), or send it to B1 and
> let it propagate to B2 e.g. through sync rep.
> 
> So while you may be right in single-DC deployments, with multi-DC
> deployments the situation is quite different - not only that the network
> bandwidth is not unlimited, but because latencies within DC may be a
> fraction of latencies between the locations (to the extent that the
> increase due to syncrep may be just noise). So the local replication may
> be actually way faster.

I'm not seeing how the above is better using syncrep than using shard
copying?

> I can imagine forwarding the data between B1 and B2 even with a purely
> sharding solution, but at that point you effectively re-implemented
> syncrep.

Not really, the mechanism is different and the behavior is different.
One critical deficiency in using binary syncrep is that you can't do
round-robin redundancy at all; every redundant node has to be an exact
mirror of another node.  In a good HA distributed system, you want
multiple shards per node, and you want each shard to be replicated to a
different node, so that in the event of node failure you're not dumping
the full load on one other server.

> IMHO the design has to address the multi-DC setups somehow. I think that
> many of the customers who are so concerned about scaling to many shards
> are also concerned about availability in case of DC outages, no?

Certainly.  But users located in a single DC shouldn't pay the same
overhead as users who are geographically spread.

> I don't follow. With sync rep we do know whether the copy is OK or not,
> because the node either confirms writes or not. The failover certainly
> is more complicated and is not immediate (to the extent of keeping a
> copy at the sharding level), but it's a question of trade-offs.
> 
> It's true we don't have auto-failover solution at the moment, but as I
> said - I can easily imagine most people using just sharding, while some
> deployments use syncrep with manual failover.

As long as direct shard copying is available, I'm happy.  I have no
complaints about additional mechanisms.

I'm bringing this up because the FDW proposal made at pgCon did not
include *any* mechanism for HA/redundancy, just some handwaving about
replication and/or BDR. This was one of the critical design failures of
PostgresXC.  A multinode system without automated node failover and
replacement is a low-availability system.

>> If we write to multiple copies as a part of the sharding feature,
>> then that can be parallelized, so that we are waiting only as long as
>> the slowest write (or in failure cases, as long as the shard
>> timeout). Further, we can check for shard-copy health and update
>> shard availability data with each user request, so that the ability
>> to see stale/bad data is minimized.
> 
> Again, this assumes infinite network bandwidth.

In what way is the total network bandwitdh used in the system different
for shard copying than for sync replication?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Horizontal scalability/sharding

From
Bruce Momjian
Date:
On Tue, Sep  1, 2015 at 12:40:40PM -0400, Robert Haas wrote:
> On Tue, Sep 1, 2015 at 6:55 AM, Bruce Momjian <bruce@momjian.us> wrote:
> > I assumed these queries were going to be solved by sending as digested
> > data as possible to the coordinator, and having the coordinator complete
> > any remaining processing.  I think we are going to need to decide if
> > such "sending data back to shards" is something we are ever going to
> > implement.  I can see FDWs _not_ working well for that use-case.
> 
> I do think we are going to want to support that.  All the people I've
> talked to about parallel and distributed query processing agree that
> you need to do that sort of thing to get really good and scalable
> performance.  I think that we could make a lot of headway as compared
> with the status quo just by implementing more pushdown optimizations
> than we have today.  Right now, SELECT COUNT(*) FROM table will suck
> back the whole remote table and count the rows locally, and that's
> stupid.  We can fix that case with better pushdown logic.  We can also
> fix the case of N-way join nests where the joins are either on the
> partitioning key or to replicated tables.  But suppose you have a join
> between two tables which are sharded across the cluster but not on the
> partitioning key.  There's no way to push the join down, so all the
> work comes back to the coordinator, which is possibly OK if such
> queries are rare, but not so hot if they are frequent.

Let me clearer about what the Citus Data paper shows.  I said originally
that the data was sent to the coordinator, sorted, then resent to the
shards, but the document:
https://goo.gl/vJWF85https://www.citusdata.com/blog/114-how-to-build-your-distributed-database

has the shards create the groups and the groups are sent to the other
shards.  For example, to do COUNT(DISTINCT) if you have three shards,
then each shard breaks its data into 3 buckets (1B in size), then the
first bucket from each of the three shards goes to the first shard, and
the second bucket goes to the second shared, etc.

Basically, they are doing map-reduce, and the shards are creating
additional batches that get shipped to other shards.  I can see FDWs not
working well in that case as you are really creating a new data layout
just for the query.  This explains why the XC/XL people are saying they
would use FDWs if they existed at the time they started development,
while the Citus Data people are saying they couldn't use FDWs as they
currently exist.  They probably both needed FDW improvements, but I
think the Citus Data features would need a lot more.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Horizontal scalability/sharding

From
Peter Geoghegan
Date:
On Tue, Sep 1, 2015 at 10:17 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Sep 1, 2015 at 1:06 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> You're assuming that our primary bottleneck for writes is IO.  It's not
>> at present for most users, and it certainly won't be in the future.  You
>> need to move your thinking on systems resources into the 21st century,
>> instead of solving the resource problems from 15 years ago.
>
> Your experience doesn't match mine.  I find that it's frequently
> impossible to get the system to use all of the available CPU capacity,
> either because you're bottlenecked on locks or because you are
> bottlenecked on the  I/O subsystem, and with the locking improvements
> in newer versions, the former is becoming less and less common.

I think you're both right. I think that we need to fix the buffer
manager, to make its caching algorithm smarter. Since we're mostly
using the filesystem cache, this is particularly important for
PostgreSQL. We need to remember usage information for evicted blocks
for some period of time afterwards. This problem is largely a problem
with Postgres in particular, I suspect.

At the same time, I agree with Josh's assessment that long-term, we
are going to have the biggest problem with memory latency and memory
bandwidth, which are usually considered facets of CPU performance, and
with internal lock contention. Addressing the memory access bottleneck
dovetails with parallelism, in that it must be considered alongside
parallelism. Josh's "21st century" remark seems quite justified to me.
For a further example of this, check out my latest progress with
external sorting, which I plan to post later in the week. I/O isn't
the big problem there at all, and I now think we can make external
sorts close to internal sorts in performance across the board.

I imagine that Josh's experience is based on workloads that mostly fit
in shared_buffers, so I can see why you'd disagree if that was
something you've seen less of. I'll quote Hellerstein and Stonebraker
in 2007 [1]:

"Copying data in memory can be a serious bottleneck. Copies contribute
latency, consume CPU cycles, and can flood the CPU data. This fact is
often a surprise to people who have not operated or implemented a
database system, and assume that main-memory operations are “free”
compared to disk I/O. But in practice, throughput in a well-tuned
transaction processing DBMS is typically not I/O-bound. This is
achieved in high-end installations by purchasing sufficient disks and
RAM so that repeated page requests are absorbed by the buffer pool,
and disk I/Os are shared across the disk arms at a rate that can feed
the data appetite of all the processors in the system."

>> Our real future bottlenecks are:
>>
>> * ability to handle more than a few hundred connections
>> * locking limits on the scalability of writes
>> * ability to manage large RAM and data caches
>
> I do agree that all of those things are problems, FWIW.

These seem like our long term problems, to me.

[1] http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf, page 213
--
Peter Geoghegan



Re: Horizontal scalability/sharding

From
Petr Jelinek
Date:
On 2015-09-02 00:09, Josh Berkus wrote:
> On 09/01/2015 02:29 PM, Tomas Vondra wrote:
>> For example assume that there are just two shards in two separate data
>> centers, connected by a link with limited bandwidth. Now, let's assume
>> you always keep a local replica for failover. So you have A1+A2 in DC1,
>> B1+B2 in DC2. If you're in DC1, then writing data to B1 means you also
>> have to write data to B2 and wait for it. So either you send the data to
>> each node separately (consuming 2x the bandwidth), or send it to B1 and
>> let it propagate to B2 e.g. through sync rep.
>>
>> So while you may be right in single-DC deployments, with multi-DC
>> deployments the situation is quite different - not only that the network
>> bandwidth is not unlimited, but because latencies within DC may be a
>> fraction of latencies between the locations (to the extent that the
>> increase due to syncrep may be just noise). So the local replication may
>> be actually way faster.
>
> I'm not seeing how the above is better using syncrep than using shard
> copying?

Shard copying usually assumes that the origin node does the copy - the
data has to go twice through the slow connection. With replication you
can replicate locally over fast connection.

>
>> I can imagine forwarding the data between B1 and B2 even with a purely
>> sharding solution, but at that point you effectively re-implemented
>> syncrep.
>
> Not really, the mechanism is different and the behavior is different.
> One critical deficiency in using binary syncrep is that you can't do
> round-robin redundancy at all; every redundant node has to be an exact
> mirror of another node.  In a good HA distributed system, you want
> multiple shards per node, and you want each shard to be replicated to a
> different node, so that in the event of node failure you're not dumping
> the full load on one other server.
>

This assumes that we use binary replication, but we can reasonably use
logical replication which can quite easily do filtering of what's
replicated where.

>> IMHO the design has to address the multi-DC setups somehow. I think that
>> many of the customers who are so concerned about scaling to many shards
>> are also concerned about availability in case of DC outages, no?
>
> Certainly.  But users located in a single DC shouldn't pay the same
> overhead as users who are geographically spread.
>

Agreed, so we should support both ways, but I don't think it's necessary
to support both ways in version 0.1. It's just important to not paint
ourselves into a corner with design decisions that would make one of the
ways impossible.


>>> If we write to multiple copies as a part of the sharding feature,
>>> then that can be parallelized, so that we are waiting only as long as
>>> the slowest write (or in failure cases, as long as the shard
>>> timeout). Further, we can check for shard-copy health and update
>>> shard availability data with each user request, so that the ability
>>> to see stale/bad data is minimized.
>>
>> Again, this assumes infinite network bandwidth.
>
> In what way is the total network bandwitdh used in the system different
> for shard copying than for sync replication?
>

Again, when shards are distributed over multiple DCs (or actually even
multiple racks) the bandwidth and latency of local copy will be much
better then the one of the remote copy so the local replication can have
much lower impact on the cluster performance than remote shard copy will.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: Horizontal scalability/sharding

From
Amit Kapila
Date:
> On Tue, Sep 1, 2015 at 9:48 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Tue, Sep 1, 2015 at 4:15 AM, Andres Freund <andres@anarazel.de> wrote:
> > > On 2015-08-31 20:54:51 -0400, Bruce Momjian wrote:
> > >> Uh, we already have a list of things we need to add to FDWs to make them
> > >> work, and Citus Data has provided a document of more things that are
> > >> needed, https://goo.gl/vJWF85.  I am not sure how much bigger a red flag
> > >> you want to confirm that everyone agrees that major FDW improvements are
> > >> a requirement for this.
> > >
> > > Several people saying that the FDW infrastructure isn't sufficient right
> > > now is pretty far from implying that all of them agree that the FDW API
> > > is the way to go.
> > >
> > > I'm not sure myself. If it works out it's going to save us some work and
> > > make it more realistic to get there sometime not too far off. But I'm
> > > afraid that the resulting system will feel like our current partitioning
> > > implemenentation. Yes, it kinda works, but it's hard to get started, it
> > > doesn't support too many features and you're kind afraid your relatives
> > > will see what you've done.
> >
> > I'm not averse to making the "connect to the remote nodes" part of
> > this solution use something other than the FDW infrastructure at some
> > point in time if somebody's prepared to build something better.  On
> > the other hand, I think it's extremely clear that the FDW
> > infrastructure has a large amount of potential upon which we have
> > thoroughly failed to capitalize.  Patches have already been written
> > for UPDATE/DELETE pushdown and for join pushdown.

Will pushing down writes (Update/Delete) sufficient to maintain sane locking
behaviour and deadlock detection that can occur during writes on multiple
shards?  For example it could easily be the case where a single Update
statement could effect multiple shards and cause deadlock due to waits
across the nodes.  Now unless we have some distributed lock manager or
some other way to know the information of locks that happens across
shards, it could be difficult to detect deadlocks.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Horizontal scalability/sharding

From
Amit Kapila
Date:
On Tue, Sep 1, 2015 at 4:25 PM, Bruce Momjian <bruce@momjian.us> wrote:
>
>
> The document opens a big question --- when queries can't be processed in
> a traditional top/down fashion, Citus has the goal of sending groups of
> results up the the coordinator, reordering them, then sending them back
> to the shards for further processing, basically using the shards as
> compute engines because the shards are no longer using local data to do
> their computations.  The two examples they give are COUNT(DISTINCT) and
> a join across two sharded tables ("CANADA").
>
> I assumed these queries were going to be solved by sending as digested
> data as possible to the coordinator, and having the coordinator complete
> any remaining processing.  I think we are going to need to decide if
> such "sending data back to shards" is something we are ever going to
> implement.  I can see FDWs _not_ working well for that use-case.
>

Here one related point to think is how do we envision to handle statement
requests, do we want to have centeralized coordinator which will process
all requests or the requests could be received by any node?
I think both kind of systems have their own pros and cons like if we want
to have centralized coordinator kind of system, then it might be limited
by the number of simultaneous requests it can handle and if go other way
like allow requests to be processed by each individual nodes, then we
have to think about replicating all meta-data on all nodes.

I think Collecting statistics about different objects is another thing which
can differ depending on the strategy we choose to allow requests.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Horizontal scalability/sharding

From
Amit Langote
Date:
On 2015-09-02 PM 01:28, Amit Kapila wrote:
>> On Tue, Sep 1, 2015 at 9:48 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>
>>> I'm not averse to making the "connect to the remote nodes" part of
>>> this solution use something other than the FDW infrastructure at some
>>> point in time if somebody's prepared to build something better.  On
>>> the other hand, I think it's extremely clear that the FDW
>>> infrastructure has a large amount of potential upon which we have
>>> thoroughly failed to capitalize.  Patches have already been written
>>> for UPDATE/DELETE pushdown and for join pushdown.
> 
> Will pushing down writes (Update/Delete) sufficient to maintain sane locking
> behaviour and deadlock detection that can occur during writes on multiple
> shards?  For example it could easily be the case where a single Update
> statement could effect multiple shards and cause deadlock due to waits
> across the nodes.  Now unless we have some distributed lock manager or
> some other way to know the information of locks that happens across
> shards, it could be difficult to detect deadlocks.
> 

I wonder if Ashutosh's atomic foreign transactions patch would address any
issues inherent in such cases...

Thanks,
Amit




Re: Horizontal scalability/sharding

From
Etsuro Fujita
Date:
On 2015/09/02 14:28, Amit Langote wrote:
> On 2015-09-02 PM 01:28, Amit Kapila wrote:
>>> On Tue, Sep 1, 2015 at 9:48 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>> I'm not averse to making the "connect to the remote nodes" part of
>>>> this solution use something other than the FDW infrastructure at some
>>>> point in time if somebody's prepared to build something better.  On
>>>> the other hand, I think it's extremely clear that the FDW
>>>> infrastructure has a large amount of potential upon which we have
>>>> thoroughly failed to capitalize.  Patches have already been written
>>>> for UPDATE/DELETE pushdown and for join pushdown.

>> Will pushing down writes (Update/Delete) sufficient to maintain sane locking
>> behaviour and deadlock detection that can occur during writes on multiple
>> shards?  For example it could easily be the case where a single Update
>> statement could effect multiple shards and cause deadlock due to waits
>> across the nodes.  Now unless we have some distributed lock manager or
>> some other way to know the information of locks that happens across
>> shards, it could be difficult to detect deadlocks.

> I wonder if Ashutosh's atomic foreign transactions patch would address any
> issues inherent in such cases...

The UPDATE/DELETE pushdown, which I've proposed, would ensure the sane 
behaviour for inherited UPDATEs/DELETEs, as existing non-pushed-down 
UPDATE/DELETE does, because inheritance_planner guarantees that all 
backends lock inheritance children in the same order to avoid needless 
deadlocks.

Best regards,
Etsuro Fujita




Re: Horizontal scalability/sharding

From
Amit Kapila
Date:
On Wed, Sep 2, 2015 at 11:35 AM, Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp> wrote:
On 2015/09/02 14:28, Amit Langote wrote:
On 2015-09-02 PM 01:28, Amit Kapila wrote:
On Tue, Sep 1, 2015 at 9:48 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I'm not averse to making the "connect to the remote nodes" part of
this solution use something other than the FDW infrastructure at some
point in time if somebody's prepared to build something better.  On
the other hand, I think it's extremely clear that the FDW
infrastructure has a large amount of potential upon which we have
thoroughly failed to capitalize.  Patches have already been written
for UPDATE/DELETE pushdown and for join pushdown.

Will pushing down writes (Update/Delete) sufficient to maintain sane locking
behaviour and deadlock detection that can occur during writes on multiple
shards?  For example it could easily be the case where a single Update
statement could effect multiple shards and cause deadlock due to waits
across the nodes.  Now unless we have some distributed lock manager or
some other way to know the information of locks that happens across
shards, it could be difficult to detect deadlocks.

I wonder if Ashutosh's atomic foreign transactions patch would address any
issues inherent in such cases...

The UPDATE/DELETE pushdown, which I've proposed, would ensure the sane behaviour for inherited UPDATEs/DELETEs, as existing non-pushed-down UPDATE/DELETE does, because inheritance_planner guarantees that all backends lock inheritance children in the same order to avoid needless deadlocks.


Will it be able to do it for row level locks, row level locking occurs
during updation of a row, so will it be possible to ensure the order of
locks on rows?

Will it handle deadlocks across different table partitions. Consider
a case as below:

T1
1. Updates row R1 of T1 on shard S1
2. Updates row R2 of T2 on shard S2

T2
1. Updates row R2 of T2 on shard S2 
2. Updates row R1 of T1 on shard S1



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Horizontal scalability/sharding

From
Amit Langote
Date:
On 2015-09-02 PM 03:25, Amit Kapila wrote:
> On Wed, Sep 2, 2015 at 11:35 AM, Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp>
>>
>> The UPDATE/DELETE pushdown, which I've proposed, would ensure the sane
>> behaviour for inherited UPDATEs/DELETEs, as existing non-pushed-down
>> UPDATE/DELETE does, because inheritance_planner guarantees that all
>> backends lock inheritance children in the same order to avoid needless
>> deadlocks.
>>
>>
> Will it be able to do it for row level locks, row level locking occurs
> during updation of a row, so will it be possible to ensure the order of
> locks on rows?
> 
> Will it handle deadlocks across different table partitions. Consider
> a case as below:
> 
> T1
> 1. Updates row R1 of T1 on shard S1
> 2. Updates row R2 of T2 on shard S2
> 
> T2
> 1. Updates row R2 of T2 on shard S2
> 2. Updates row R1 of T1 on shard S1
> 

As long as shards are processed in the same order in different
transactions, ISTM, this issue should not arise? I can imagine it becoming
a concern if parallel shard processing enters the scene. Am I missing
something?

Thanks,
Amit




Re: Horizontal scalability/sharding

From
Etsuro Fujita
Date:
On 2015/09/02 15:40, Amit Langote wrote:
> On 2015-09-02 PM 03:25, Amit Kapila wrote:
>> On Wed, Sep 2, 2015 at 11:35 AM, Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp>
>>> The UPDATE/DELETE pushdown, which I've proposed, would ensure the sane
>>> behaviour for inherited UPDATEs/DELETEs, as existing non-pushed-down
>>> UPDATE/DELETE does, because inheritance_planner guarantees that all
>>> backends lock inheritance children in the same order to avoid needless
>>> deadlocks.

>> Will it be able to do it for row level locks, row level locking occurs
>> during updation of a row, so will it be possible to ensure the order of
>> locks on rows?

>> Will it handle deadlocks across different table partitions. Consider
>> a case as below:
>>
>> T1
>> 1. Updates row R1 of T1 on shard S1
>> 2. Updates row R2 of T2 on shard S2
>>
>> T2
>> 1. Updates row R2 of T2 on shard S2
>> 2. Updates row R1 of T1 on shard S1

> As long as shards are processed in the same order in different
> transactions, ISTM, this issue should not arise? I can imagine it becoming
> a concern if parallel shard processing enters the scene. Am I missing
> something?

Yeah, I thinks so, too.

Sorry, maybe my explanation above was not enough, but in the inherted 
UPDATEs/DELETEs, the table modification is also ensured to be done in 
the same order.  So, as Amit Langote said, both transactions will do the 
updates in the same order.

Best regards,
Etsuro Fujita




Re: Horizontal scalability/sharding

From
Albe Laurenz
Date:
Amit Langote wrote:
> On 2015-09-02 PM 03:25, Amit Kapila wrote:
>> Will it handle deadlocks across different table partitions. Consider
>> a case as below:
>>
>> T1
>> 1. Updates row R1 of T1 on shard S1
>> 2. Updates row R2 of T2 on shard S2
>>
>> T2
>> 1. Updates row R2 of T2 on shard S2
>> 2. Updates row R1 of T1 on shard S1

> As long as shards are processed in the same order in different
> transactions, ISTM, this issue should not arise? I can imagine it becoming
> a concern if parallel shard processing enters the scene. Am I missing
> something?

That would only hold for a single query, right?

If 1. and 2. in the above example come from different queries within one
transaction, you cannot guarantee that shards are processed in the same order.

So T1 and T2 could deadlock.

Yours,
Laurenz Albe

Re: Horizontal scalability/sharding

From
Amit Langote
Date:
On 2015-09-02 PM 04:07, Albe Laurenz wrote:
> Amit Langote wrote:
>> On 2015-09-02 PM 03:25, Amit Kapila wrote:
>>> Will it handle deadlocks across different table partitions. Consider
>>> a case as below:
>>>
>>> T1
>>> 1. Updates row R1 of T1 on shard S1
>>> 2. Updates row R2 of T2 on shard S2
>>>
>>> T2
>>> 1. Updates row R2 of T2 on shard S2
>>> 2. Updates row R1 of T1 on shard S1
> 
>> As long as shards are processed in the same order in different
>> transactions, ISTM, this issue should not arise? I can imagine it becoming
>> a concern if parallel shard processing enters the scene. Am I missing
>> something?
> 
> That would only hold for a single query, right?
> 
> If 1. and 2. in the above example come from different queries within one
> transaction, you cannot guarantee that shards are processed in the same order.
> 
> So T1 and T2 could deadlock.
> 

Sorry, I failed to see why that would be the case. Could you elaborate?

Thanks,
Amit




Re: Horizontal scalability/sharding

From
Etsuro Fujita
Date:
On 2015/09/02 16:40, Amit Langote wrote:
> On 2015-09-02 PM 04:07, Albe Laurenz wrote:
>> Amit Langote wrote:
>>> On 2015-09-02 PM 03:25, Amit Kapila wrote:
>>>> Will it handle deadlocks across different table partitions. Consider
>>>> a case as below:
>>>>
>>>> T1
>>>> 1. Updates row R1 of T1 on shard S1
>>>> 2. Updates row R2 of T2 on shard S2
>>>>
>>>> T2
>>>> 1. Updates row R2 of T2 on shard S2
>>>> 2. Updates row R1 of T1 on shard S1
>>
>>> As long as shards are processed in the same order in different
>>> transactions, ISTM, this issue should not arise? I can imagine it becoming
>>> a concern if parallel shard processing enters the scene. Am I missing
>>> something?
>>
>> That would only hold for a single query, right?
>>
>> If 1. and 2. in the above example come from different queries within one
>> transaction, you cannot guarantee that shards are processed in the same order.
>>
>> So T1 and T2 could deadlock.

> Sorry, I failed to see why that would be the case. Could you elaborate?

I think Laurenz would assume that the updates 1. and 2. in the above 
transactions are performed *in a non-inherited manner*.  If that's 
right, T1 and T2 could deadlock, but I think we assume here to run 
transactions over shards *in an inherited manner*.

Best regards,
Etsuro Fujita




Re: Horizontal scalability/sharding

From
Albe Laurenz
Date:
Etsuro Fujita wrote:
> On 2015/09/02 16:40, Amit Langote wrote:
>> On 2015-09-02 PM 04:07, Albe Laurenz wrote:
>>> Amit Langote wrote:
>>>> On 2015-09-02 PM 03:25, Amit Kapila wrote:
>>>>> Will it handle deadlocks across different table partitions. Consider
>>>>> a case as below:
>>>>>
>>>>> T1
>>>>> 1. Updates row R1 of T1 on shard S1
>>>>> 2. Updates row R2 of T2 on shard S2
>>>>>
>>>>> T2
>>>>> 1. Updates row R2 of T2 on shard S2
>>>>> 2. Updates row R1 of T1 on shard S1
>>>
>>>> As long as shards are processed in the same order in different
>>>> transactions, ISTM, this issue should not arise? I can imagine it becoming
>>>> a concern if parallel shard processing enters the scene. Am I missing
>>>> something?
>>>
>>> That would only hold for a single query, right?
>>>
>>> If 1. and 2. in the above example come from different queries within one
>>> transaction, you cannot guarantee that shards are processed in the same order.
>>>
>>> So T1 and T2 could deadlock.
> 
>> Sorry, I failed to see why that would be the case. Could you elaborate?
> 
> I think Laurenz would assume that the updates 1. and 2. in the above
> transactions are performed *in a non-inherited manner*.  If that's
> right, T1 and T2 could deadlock, but I think we assume here to run
> transactions over shards *in an inherited manner*.

Yes, but does every update affect all shards?

If I say "UPDATE t1 SET col = 1 WHERE id = 42" and the row with id 42
happens to be on shard S1, the update would only affect that shard, right?

Now if "UPDATE t2 SET col = 1 WHERE id = 42" would only take place on
shard S2, and two transactions issue both updates in different order,
one transaction would be waiting for a lock on shard S1, while the other
would be waiting for a lock on shard S2, right?

But maybe I'm missing something fundamental.

Yours,
Laurenz Albe

Re: Horizontal scalability/sharding

From
Amit Langote
Date:
On 2015-09-02 PM 05:07, Etsuro Fujita wrote:
> On 2015/09/02 16:40, Amit Langote wrote:
>> On 2015-09-02 PM 04:07, Albe Laurenz wrote:
>>>
>>> That would only hold for a single query, right?
>>>
>>> If 1. and 2. in the above example come from different queries within one
>>> transaction, you cannot guarantee that shards are processed in the same
>>> order.
>>>
>>> So T1 and T2 could deadlock.
> 
>> Sorry, I failed to see why that would be the case. Could you elaborate?
> 
> I think Laurenz would assume that the updates 1. and 2. in the above
> transactions are performed *in a non-inherited manner*.  If that's right,
> T1 and T2 could deadlock, but I think we assume here to run transactions
> over shards *in an inherited manner*.
> 

I think Albe may have a point here...

Even inherited updates case appears to cause a deadlock if they are in
different queries. Demonstrated below:

-- setup
CREATE TABLE t(a int);
CREATE TABLE t1() INHERITS(t);
CREATE TABLE t2() INHERITS(t);

INSERT INTO t1 VALUES (1);
INSERT INTO t2 VALUES (2);

-- in session 1
BEGIN;
UPDATE t SET a = a + 1 WHERE a = 1;
<ok>

-- in session 2
BEGIN;
UPDATE t SET a = a + 1 WHERE a = 2;
<ok>

-- back in session 1
UPDATE t SET a = a + 1 WHERE a = 2;
<waits>

-- back in session 2
UPDATE t SET a = a + 1 WHERE a = 1;
<deadlock is detected>


Thanks,
Amit




Re: Horizontal scalability/sharding

From
Ashutosh Bapat
Date:


On Wed, Sep 2, 2015 at 12:49 AM, Josh Berkus <josh@agliodbs.com> wrote:
On 09/01/2015 11:36 AM, Tomas Vondra wrote:
>> We want multiple copies of shards created by the sharding system itself.
>>   Having a separate, and completely orthagonal, redundancy system to the
>> sharding system is overly burdensome on the DBA and makes low-data-loss
>> HA impossible.
>
> IMHO it'd be quite unfortunate if the design would make it impossible to
> combine those two features (e.g. creating standbys for shards and
> failing over to them).
>
> It's true that solving HA at the sharding level (by keeping multiple
> copies of a each shard) may be simpler than combining sharding and
> standbys, but I don't see why it makes low-data-loss HA impossible.

Other way around, that is, having replication standbys as the only
method of redundancy requires either high data loss or high latency for
all writes.

In the case of async rep, every time we fail over a node, the entire
cluser would need to roll back to the last common known-good replay
point, hence high data loss.

In the case of sync rep, we are required to wait for at least double
network lag time in order to do a single write ... making
write-scalability quite difficult.

Futher, if using replication the sharding system would have no way to
(a) find out immediately if a copy was bad and (b) fail over quickly to
a copy of the shard if the first requested copy was not responding.
With async replication, we also can't use multiple copies of the same
shard as a way to balance read workloads.

If we write to multiple copies as a part of the sharding feature, then
that can be parallelized, so that we are waiting only as long as the
slowest write (or in failure cases, as long as the shard timeout).
Further, we can check for shard-copy health and update shard
availability data with each user request, so that the ability to see
stale/bad data is minimized.

XC (and I guess XL, pgPool II as well) did this by firing same DML statement to all the copies after resolving any volatile references  (e.g. now()) in DML, so that all the copies get the same values. That method however needed some row identifier which can identify same row on all the replicas. Primary key is used as row identifier usually, but not all use cases which require shards to be replicated have primary key in their sharded tables.
 

There are obvious problems with multiplexing writes, which you can
figure out if you knock pg_shard around a bit.  But I really think that
solving those problems is the only way to go.

Mind you, I see a strong place for binary replication and BDR for
multi-region redundancy; you really don't want that to be part of the
sharding system if you're aiming for write scalability.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers



--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Horizontal scalability/sharding

From
Amit Langote
Date:
On 2015-09-02 PM 06:41, Amit Langote wrote:
> 
> I think Albe may have a point here...
> 
> Even inherited updates case appears to cause a deadlock if they are in
> different queries. Demonstrated below:
> 
> -- setup
> CREATE TABLE t(a int);
> CREATE TABLE t1() INHERITS(t);
> CREATE TABLE t2() INHERITS(t);
> 
> INSERT INTO t1 VALUES (1);
> INSERT INTO t2 VALUES (2);
> 
> -- in session 1
> BEGIN;
> UPDATE t SET a = a + 1 WHERE a = 1;
> <ok>
> 
> -- in session 2
> BEGIN;
> UPDATE t SET a = a + 1 WHERE a = 2;
> <ok>
> 
> -- back in session 1
> UPDATE t SET a = a + 1 WHERE a = 2;
> <waits>
> 
> -- back in session 2
> UPDATE t SET a = a + 1 WHERE a = 1;
> <deadlock is detected>
> 

Which, I now realize, is not the worry Amit Kapila's expresses.

The deadlock was *indeed detected* in this case, with all the locks in the
same PG instance. In a sharded environment with multiple PG instances,
that becomes tricky. DLM (distributed lock manager/deadlock detector)
seems indeed necessary as Amit K. suspects.

Thanks,
Amit




Re: Horizontal scalability/sharding

From
Pavan Deolasee
Date:


On Wed, Sep 2, 2015 at 3:55 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
On 2015-09-02 PM 06:41, Amit Langote wrote:
>
> I think Albe may have a point here...
>
> Even inherited updates case appears to cause a deadlock if they are in
> different queries. Demonstrated below:
>
> -- setup
> CREATE TABLE t(a int);
> CREATE TABLE t1() INHERITS(t);
> CREATE TABLE t2() INHERITS(t);
>
> INSERT INTO t1 VALUES (1);
> INSERT INTO t2 VALUES (2);
>
> -- in session 1
> BEGIN;
> UPDATE t SET a = a + 1 WHERE a = 1;
> <ok>
>
> -- in session 2
> BEGIN;
> UPDATE t SET a = a + 1 WHERE a = 2;
> <ok>
>
> -- back in session 1
> UPDATE t SET a = a + 1 WHERE a = 2;
> <waits>
>
> -- back in session 2
> UPDATE t SET a = a + 1 WHERE a = 1;
> <deadlock is detected>
>

Which, I now realize, is not the worry Amit Kapila's expresses.

The deadlock was *indeed detected* in this case, with all the locks in the
same PG instance. In a sharded environment with multiple PG instances,
that becomes tricky. DLM (distributed lock manager/deadlock detector)
seems indeed necessary as Amit K. suspects.


Right. XC/XL did not address this issue and they rely on statement timeouts to break distributed deadlocks.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Horizontal scalability/sharding

From
Amit Kapila
Date:
On Wed, Sep 2, 2015 at 4:19 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
> On Wed, Sep 2, 2015 at 3:55 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>
>> On 2015-09-02 PM 06:41, Amit Langote wrote:
>>
>> Which, I now realize, is not the worry Amit Kapila's expresses.
>>
>> The deadlock was *indeed detected* in this case, with all the locks in the
>> same PG instance. In a sharded environment with multiple PG instances,
>> that becomes tricky. DLM (distributed lock manager/deadlock detector)
>> seems indeed necessary as Amit K. suspects.
>>
>
> Right. XC/XL did not address this issue and they rely on statement timeouts to break distributed deadlocks.
>

I think that will be difficult for application to decide and then it
needs to decide the same for all statements which is tricky because
different statements could take different time.  I think it is better to
have solution for this problem and deadlock should be detected.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Horizontal scalability/sharding

From
Etsuro Fujita
Date:
On 2015/09/02 20:42, Amit Kapila wrote:
> On Wed, Sep 2, 2015 at 4:19 PM, Pavan Deolasee <pavan.deolasee@gmail.com
> <mailto:pavan.deolasee@gmail.com>> wrote:
>  > On Wed, Sep 2, 2015 at 3:55 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp <mailto:Langote_Amit_f8@lab.ntt.co.jp>>
> wrote:
>  >> On 2015-09-02 PM 06:41, Amit Langote wrote:
>  >> Which, I now realize, is not the worry Amit Kapila's expresses.

>  >> The deadlock was *indeed detected* in this case, with all the locks
> in the
>  >> same PG instance. In a sharded environment with multiple PG instances,
>  >> that becomes tricky. DLM (distributed lock manager/deadlock detector)
>  >> seems indeed necessary as Amit K. suspects.

Ah, you are right.

>  > Right. XC/XL did not address this issue and they rely on statement
> timeouts to break distributed deadlocks.

> I think that will be difficult for application to decide and then it
> needs to decide the same for all statements which is tricky because
> different statements could take different time.  I think it is better to
> have solution for this problem and deadlock should be detected.

+1

Best regards,
Etsuro Fujita




Re: Horizontal scalability/sharding

From
Bruce Momjian
Date:
On Tue, Sep  1, 2015 at 06:11:45PM -0400, Bruce Momjian wrote:
> Let me clearer about what the Citus Data paper shows.  I said originally
> that the data was sent to the coordinator, sorted, then resent to the
> shards, but the document:
> 
>     https://goo.gl/vJWF85
>     https://www.citusdata.com/blog/114-how-to-build-your-distributed-database
> 
> has the shards create the groups and the groups are sent to the other
> shards.  For example, to do COUNT(DISTINCT) if you have three shards,
> then each shard breaks its data into 3 buckets (1B in size), then the
> first bucket from each of the three shards goes to the first shard, and
> the second bucket goes to the second shared, etc.
> 
> Basically, they are doing map-reduce, and the shards are creating
> additional batches that get shipped to other shards.  I can see FDWs not
> working well in that case as you are really creating a new data layout
> just for the query.  This explains why the XC/XL people are saying they
> would use FDWs if they existed at the time they started development,
> while the Citus Data people are saying they couldn't use FDWs as they
> currently exist.  They probably both needed FDW improvements, but I
> think the Citus Data features would need a lot more.

To expand on this, using FDWs, it means each shard would create a
temporary table on the other shards and send some if its data to those
shards.  Once a shard gets all its data from the other shards, it will
process the data and send the result to the collector.

That certainly seems like something FDWs would not do well.  Frankly, I
am unclear how Citus Data was able to do this with only backend hooks.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Horizontal scalability/sharding

From
Oleg Bartunov
Date:


On Tue, Sep 1, 2015 at 7:08 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Sep 1, 2015 at 12:00 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
> My worry is that if we start implementing them again from scratch, it will
> take a few years before we get them in a usable state. What XC/XL lacked is
> probably a Robert Haas or a Tom Lane who could look at the work and suggest
> major edits. If that had happened, the quality of the product could have
> been much better today. I don't mean to derate the developers who worked on
> XC/XL, but there is no harm in accepting that if someone with a much better
> understanding of the whole system was part of the team, that would have
> positively impacted the project. Is that an angle worth exploring? Does it
> make sense to commit some more resources to say XC or XL and try to improve
> the quality of the product even further? To be honest, XL is in far far
> better shape (haven't really tried XC in a while) and some more QA/polishing
> can make it production ready much sooner.

From my point of view, and EnterpriseDB's point of view, anything that
doesn't go into the core PostgreSQL distribution isn't really getting
us where we need to be.  If there's code in XL that would be valuable
to merge into core PostgreSQL, then let's do it.  If the code cannot
be used but there are lessons we can learn that will make what does go
into core PostgreSQL better, let's learn them.  However, I don't think
it's serving anybody very well that we have the XC fork, and multiple
forks of the XC fork, floating around out there and people are working
on those instead of working on core PostgreSQL.  The reality is that
we don't have enough brainpower to spread it across 2 or 3 or 4 or 5
different projects and have all of them be good.  The reality is,
also, that horizontal scalability isn't an optional feature.  There
was a point in time at which the PostgreSQL project's official policy
on replication was that it did not belong in core.  That was a bad
policy; thankfully, it was reversed, and the result was Hot Standby
and Streaming Replication, incredibly important technologies without
which we would not be where we are today. Horizontal scalability is
just as essential.

Agree with you, Robert.

One lesson from XL we got is that we need testing framework for cluster, so any cluster project should at least pass functional and performance testing. XL was very easy to break and I'm wondering how many corner cases still exists. We tried several other approaches and while reading the papers was a fun, in practice we found many devil details, which made the paper be just a paper. 

 

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Horizontal scalability/sharding

From
Josh Berkus
Date:
On 09/01/2015 04:14 PM, Petr Jelinek wrote:
> On 2015-09-02 00:09, Josh Berkus wrote:
>> On 09/01/2015 02:29 PM, Tomas Vondra wrote:
>>> So while you may be right in single-DC deployments, with multi-DC
>>> deployments the situation is quite different - not only that the network
>>> bandwidth is not unlimited, but because latencies within DC may be a
>>> fraction of latencies between the locations (to the extent that the
>>> increase due to syncrep may be just noise). So the local replication may
>>> be actually way faster.
>>
>> I'm not seeing how the above is better using syncrep than using shard
>> copying?
> 
> Shard copying usually assumes that the origin node does the copy - the
> data has to go twice through the slow connection. With replication you
> can replicate locally over fast connection.

Ah, I was thinking of the case of having a single set of copies in the
remote DC, but of course that isn't going to be the case with a highly
redundant setup.

Basically this seems to be saying that, in an ideal setup, we'd have
some kind of synchronous per-shard replication.  We don't have that at
present (sync rep is whole-node, and BDR is asynchronous).  There's also
the question of how to deal with failures and taking bad nodes out of
circulation in such a setup, especially considering that the writes
could be coming from multiple other nodes.

>> Not really, the mechanism is different and the behavior is different.
>> One critical deficiency in using binary syncrep is that you can't do
>> round-robin redundancy at all; every redundant node has to be an exact
>> mirror of another node.  In a good HA distributed system, you want
>> multiple shards per node, and you want each shard to be replicated to a
>> different node, so that in the event of node failure you're not dumping
>> the full load on one other server.
>>
> 
> This assumes that we use binary replication, but we can reasonably use
> logical replication which can quite easily do filtering of what's
> replicated where.

Is there a way to do logical synchronous replication?  I didn't think
there was.

>>> IMHO the design has to address the multi-DC setups somehow. I think that
>>> many of the customers who are so concerned about scaling to many shards
>>> are also concerned about availability in case of DC outages, no?
>>
>> Certainly.  But users located in a single DC shouldn't pay the same
>> overhead as users who are geographically spread.
>>
> 
> Agreed, so we should support both ways, but I don't think it's necessary
> to support both ways in version 0.1. It's just important to not paint
> ourselves into a corner with design decisions that would make one of the
> ways impossible.

Exactly!

Let me explain why I'm so vocal on this point.  PostgresXC didn't deal
with the redundancy/node replacement at all until after version 1.0.
Then, when they tried to address it, they discovered that the code was
chock full of assumptions that "1 node == 1 shard", and breaking that
assumption would require a total refactor of the code (which never
happened).  I don't want to see a repeat of that mistake.

Even if it's only on paper, any new sharding design needs to address
these questions:

1. How do we ensure no/minimal data is lost if we lose a node?
2. How do we replace a lost node (without taking the cluster down)?  2. a. how do we allow an out-of-sync node to
"catchup"?
 
3. How do we maintain metadata about good/bad nodes (and shard locations)?
4. How do we add nodes to expand the cluster?

There doesn't need to be code for all of the above from version 0.1, but
there needs to be a plan to tackle those problems.  Otherwise, we'll
just end up with another dead-end, not-useful-in-production technology.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Horizontal scalability/sharding

From
Merlin Moncure
Date:
On Tue, Sep 1, 2015 at 11:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> It would be a bad idea to cling blindly to the FDW infrastructure if
> it's fundamentally inadequate to do what we want.  On the other hand,
> it would also be a bad idea to set about recreating it without a
> really good reason, and - just to take one example - the fact that it
> doesn't currently push down DML operations to the remote side is not a
> really good reason to rewrite the whole thing.  On the contrary, it's
> a reason to put some energy into the already-written patch which
> implements that optimization.

The problem with FDW for these purposes as I see it is that too much
intelligence is relegated to the implementer of the API.  There needs
to be a mechanism so that the planner can rewrite the remote query and
then do some after the fact processing.  This exactly what citus does;
if you send out AVG(foo) it rewrites that to SUM(foo) and COUNT(foo)
so that aggregation can be properly weighted to the result.   To do
this (distributed OLAP-type processing) right, the planner needs to
*know* that this table is in fact distributed and also know that it
can make SQL compatible adjustments to the query.

This strikes me as a bit of a conflict of interest with FDW which
seems to want to hide the fact that it's foreign; the FDW
implementation makes it's own optimization decisions which might make
sense for single table queries but breaks down in the face of joins.

merlin



Re: Horizontal scalability/sharding

From
Robert Haas
Date:
On Wed, Sep 2, 2015 at 1:59 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Tue, Sep 1, 2015 at 11:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> It would be a bad idea to cling blindly to the FDW infrastructure if
>> it's fundamentally inadequate to do what we want.  On the other hand,
>> it would also be a bad idea to set about recreating it without a
>> really good reason, and - just to take one example - the fact that it
>> doesn't currently push down DML operations to the remote side is not a
>> really good reason to rewrite the whole thing.  On the contrary, it's
>> a reason to put some energy into the already-written patch which
>> implements that optimization.
>
> The problem with FDW for these purposes as I see it is that too much
> intelligence is relegated to the implementer of the API.  There needs
> to be a mechanism so that the planner can rewrite the remote query and
> then do some after the fact processing.  This exactly what citus does;
> if you send out AVG(foo) it rewrites that to SUM(foo) and COUNT(foo)
> so that aggregation can be properly weighted to the result.   To do
> this (distributed OLAP-type processing) right, the planner needs to
> *know* that this table is in fact distributed and also know that it
> can make SQL compatible adjustments to the query.
>
> This strikes me as a bit of a conflict of interest with FDW which
> seems to want to hide the fact that it's foreign; the FDW
> implementation makes it's own optimization decisions which might make
> sense for single table queries but breaks down in the face of joins.

Well, I don't think that ALL of the logic should go into the FDW.  In
particular, in this example, parallel aggregate needs the same query
rewrite, so the logic for that should live in core so that both
parallel and distributed queries get the benefit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Horizontal scalability/sharding

From
Robert Haas
Date:
On Wed, Sep 2, 2015 at 1:57 PM, Josh Berkus <josh@agliodbs.com> wrote:
> Even if it's only on paper, any new sharding design needs to address
> these questions:
>
> 1. How do we ensure no/minimal data is lost if we lose a node?
> 2. How do we replace a lost node (without taking the cluster down)?
>    2. a. how do we allow an out-of-sync node to "catch up"?
> 3. How do we maintain metadata about good/bad nodes (and shard locations)?
> 4. How do we add nodes to expand the cluster?
>
> There doesn't need to be code for all of the above from version 0.1, but
> there needs to be a plan to tackle those problems.  Otherwise, we'll
> just end up with another dead-end, not-useful-in-production technology.

This is a good point, and I think I agree with it.  Let me make a few
observations:

1. None of this stuff matters very much when the data is strictly
read-only.  You don't lose any data because you made enough copies at
some point in the distant past to ensure that you wouldn't.  You
replace a lost node by taking anew copy.  Nodes never need to catch up
because there are no changes happening.  To make bring up a new node,
you make a copy of an existing node (which doesn't change in the
meantime).  So most of these concerns are about how to handle writes.

2. None of this stuff matters when you only have one copy of the data.
Your system is low-availability, but you just don't care for whatever
reason.  The issue arises when you have multiple copies of the data,
and the data is being changed.  Now, you have to worry about the
copies getting out of sync with each other, especially when failures
happen.

3. IIUC, Postgres-XC handles this problem by reducing at least
volatile functions, maybe all functions, to constants.  Then it
generates an SQL statement to be sent to the data node to make the
appropriate change.  If there's more than one copy of the data, we
send a separate copy of the SQL statement to every node.  I'm not sure
exactly what happens if some of those nodes are not available, but I
don't think it's anything good.  Fundamentally, this model doesn't
allow for many good options in that case.

4. Therefore, I think that we should instead use logical replication,
which might be either synchronous or asynchronous.  When you modify
one copy of the data, that change will then be replicated to all other
nodes.  If you are OK with eventual consistency, this replication can
be asynchronous, and nodes that are off-line will catch up when they
are on-line.  If you are not OK with that, then you must replicate
synchronously to every node before transaction commit; or at least you
must replicate synchronously to every node that is currently on-line.
This presents some challenges: logical decoding currently can't
replicate transactions that are still in process - replication starts
when the transaction commits.  Also, we don't have any way for
synchronous replication to wait for multiple nodes.  But in theory
those seem like limitations that can be lifted.  Also, the GTM needs
to be aware that this stuff is happening, or it will DTWT.  That too
seems like a problem that can be solved.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Horizontal scalability/sharding

From
Josh Berkus
Date:
On 09/02/2015 11:41 AM, Robert Haas wrote:
> On Wed, Sep 2, 2015 at 1:57 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> Even if it's only on paper, any new sharding design needs to address
>> these questions:
>>
>> 1. How do we ensure no/minimal data is lost if we lose a node?
>> 2. How do we replace a lost node (without taking the cluster down)?
>>    2. a. how do we allow an out-of-sync node to "catch up"?
>> 3. How do we maintain metadata about good/bad nodes (and shard locations)?
>> 4. How do we add nodes to expand the cluster?
>>
>> There doesn't need to be code for all of the above from version 0.1, but
>> there needs to be a plan to tackle those problems.  Otherwise, we'll
>> just end up with another dead-end, not-useful-in-production technology.
> 
> This is a good point, and I think I agree with it.  Let me make a few
> observations:
> 
> 1. None of this stuff matters very much when the data is strictly
> read-only. 

Yep.

> 2. None of this stuff matters when you only have one copy of the data.
> Your system is low-availability, but you just don't care for whatever
> reason. 

Uh-huh.

> 3. IIUC, Postgres-XC handles this problem by reducing at least
> volatile functions, maybe all functions, to constants.  Then it
> generates an SQL statement to be sent to the data node to make the
> appropriate change.  If there's more than one copy of the data, we
> send a separate copy of the SQL statement to every node.  I'm not sure
> exactly what happens if some of those nodes are not available, but I
> don't think it's anything good.  Fundamentally, this model doesn't
> allow for many good options in that case.

pg_shard also sends the data to each node, and automatically notices
which nodes are not responding and takes them out of availability.
There isn't a "catch up" feature yet (AFAIK), or any attempt to reduce
volatile functions.

For that matter, last I worked on it Greenplum also did multiplexing via
the writing node (or via the data loader).  So this is a popular
approach; it has a number of drawbacks, though, of which volatile
functions are a major one.

> 4. Therefore, I think that we should instead use logical replication,
> which might be either synchronous or asynchronous.  When you modify
> one copy of the data, that change will then be replicated to all other
> nodes.  If you are OK with eventual consistency, this replication can
> be asynchronous, and nodes that are off-line will catch up when they
> are on-line.  If you are not OK with that, then you must replicate
> synchronously to every node before transaction commit; or at least you
> must replicate synchronously to every node that is currently on-line.
> This presents some challenges: logical decoding currently can't
> replicate transactions that are still in process - replication starts
> when the transaction commits.  Also, we don't have any way for
> synchronous replication to wait for multiple nodes.  

Well, there is a WIP patch for that, which IMHO would be much improved
by having a concrete use-case like this one.  What nobody is working on
-- and we've vetoed in the past -- is a way of automatically failing and
removing from replication any node which repeatedly fails to sync, which
would be a requirement for this model.

You'd also need a way to let the connection nodes know when a replica
has fallen behind so that they can be taken out of
load-balancing/sharding for read queries.  For the synchronous model,
that would be "fallen behind at all"; for asynchronous it would be
"fallen more than ### behind".

> But in theory
> those seem like limitations that can be lifted.  Also, the GTM needs
> to be aware that this stuff is happening, or it will DTWT.  That too
> seems like a problem that can be solved.

Yeah?  I'd assume that a GTM would be antithetical to two-stage copying.I'm not a big fan of a GTM at all, frankly; it
makesclusters much
 
harder to set up, and becomes a SPOF.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Horizontal scalability/sharding

From
Robert Haas
Date:
On Wed, Sep 2, 2015 at 3:03 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> 4. Therefore, I think that we should instead use logical replication,
>> which might be either synchronous or asynchronous.  When you modify
>> one copy of the data, that change will then be replicated to all other
>> nodes.  If you are OK with eventual consistency, this replication can
>> be asynchronous, and nodes that are off-line will catch up when they
>> are on-line.  If you are not OK with that, then you must replicate
>> synchronously to every node before transaction commit; or at least you
>> must replicate synchronously to every node that is currently on-line.
>> This presents some challenges: logical decoding currently can't
>> replicate transactions that are still in process - replication starts
>> when the transaction commits.  Also, we don't have any way for
>> synchronous replication to wait for multiple nodes.
>
> Well, there is a WIP patch for that, which IMHO would be much improved
> by having a concrete use-case like this one.  What nobody is working on
> -- and we've vetoed in the past -- is a way of automatically failing and
> removing from replication any node which repeatedly fails to sync, which
> would be a requirement for this model.

Yep.  It's clear to me we need that in general, not just for sharding.
To me, the key is to make sure there's a way for the cluster-ware to
know about the state transitions.  Currently, when the synchronous
standby changes, PostgreSQL doesn't tell anyone.  That's a problem.

> You'd also need a way to let the connection nodes know when a replica
> has fallen behind so that they can be taken out of
> load-balancing/sharding for read queries.  For the synchronous model,
> that would be "fallen behind at all"; for asynchronous it would be
> "fallen more than ### behind".

How is that different from the previous thing?  Just that we'd treat
"lagging" as "down" beyond some threshold?  That doesn't seem like a
mandatory feature.

>> But in theory
>> those seem like limitations that can be lifted.  Also, the GTM needs
>> to be aware that this stuff is happening, or it will DTWT.  That too
>> seems like a problem that can be solved.
>
> Yeah?  I'd assume that a GTM would be antithetical to two-stage copying.

I don't think so.  If transaction A writes data on X which is
replicated to Y and then commits, a new snapshot which shows A as
committed can't be used on Y until A's changes have been replicated
there.  That could be enforced by having the commit of A wait for
replication, or by having an attempt by a later transaction to use the
snapshot on Y wait until replication completes, or some even more
sophisticated strategy that considers whether the replication backlog
touches the same data that the new transaction will read.  It's
complicated, but it doesn't seem intractable.

>  I'm not a big fan of a GTM at all, frankly; it makes clusters much
> harder to set up, and becomes a SPOF.

I partially agree.  I think it's very important that the GTM is an
optional feature of whatever we end up with, rather than an
indispensable component.  People who don't want it shouldn't have to
pay the price in performance and administrative complexity.  But at
the same time, I think a lot of people will want it, because without
it, the fact that sharding is in use is much less transparent to the
application.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Horizontal scalability/sharding

From
Tomas Vondra
Date:

On 09/02/2015 08:27 PM, Robert Haas wrote:
> On Wed, Sep 2, 2015 at 1:59 PM, Merlin Moncure <mmoncure@gmail.com>
> wrote:
>>
>> This strikes me as a bit of a conflict of interest with FDW which
>> seems to want to hide the fact that it's foreign; the FDW
>> implementation makes it's own optimization decisions which might
>> make sense for single table queries but breaks down in the face of
>> joins.

+1 to these concerns

> Well, I don't think that ALL of the logic should go into the FDW.

Then maybe we shouldn't call this "FDW-based sharding" (or "FDW 
approach" or whatever was used in this thread so far) because that kinda 
implies that the proposal is to build on FDW.

In my mind, FDW is a wonderful tool to integrate PostgreSQL with 
external data sources, and it's nicely shaped for this purpose, which 
implies the abstractions and assumptions in the code.

The truth however is that many current uses of the FDW API are actually 
using it for different purposes because there's no other way to do that, 
not because FDWs are the "right way". And this includes the attempts to 
build sharding on FDW, I think.

Situations like this result in "improvements" of the API that seem to 
improve the API for the second group, but make the life harder for the 
original FDW API audience by making the API needlessly complex. And I 
say "seem to improve" because the second group eventually runs into the 
fundamental abstractions and assumptions the API is based on anyway.

And based on the discussions at pgcon, I think this is the main reason 
why people cringe when they hear "FDW" and "sharding" in the same sentence.

I'm not opposed to reusing the FDW infrastructure, of course.
> In particular, in this example, parallel aggregate needs the same> query rewrite, so the logic for that should live
incore so that> both parallel and distributed queries get the benefit.
 

I'm not sure the parallel query is a great example here - maybe I'm 
wrong but I think it's a fairly isolated piece of code, and we have 
pretty clear idea of the two use cases.

I'm sure it's non-trivial to design it well for both cases, but I think 
the questions for FWD/sharding will be much more about abstract concepts 
than particular technical solutions.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Horizontal scalability/sharding

From
Josh Berkus
Date:
On 09/02/2015 12:30 PM, Robert Haas wrote:
> On Wed, Sep 2, 2015 at 3:03 PM, Josh Berkus <josh@agliodbs.com> wrote:
>>> 4. Therefore, I think that we should instead use logical replication,
>>> which might be either synchronous or asynchronous.  When you modify
>>> one copy of the data, that change will then be replicated to all other
>>> nodes.  If you are OK with eventual consistency, this replication can
>>> be asynchronous, and nodes that are off-line will catch up when they
>>> are on-line.  If you are not OK with that, then you must replicate
>>> synchronously to every node before transaction commit; or at least you
>>> must replicate synchronously to every node that is currently on-line.
>>> This presents some challenges: logical decoding currently can't
>>> replicate transactions that are still in process - replication starts
>>> when the transaction commits.  Also, we don't have any way for
>>> synchronous replication to wait for multiple nodes.
>>
>> Well, there is a WIP patch for that, which IMHO would be much improved
>> by having a concrete use-case like this one.  What nobody is working on
>> -- and we've vetoed in the past -- is a way of automatically failing and
>> removing from replication any node which repeatedly fails to sync, which
>> would be a requirement for this model.
> 
> Yep.  It's clear to me we need that in general, not just for sharding.
> To me, the key is to make sure there's a way for the cluster-ware to
> know about the state transitions.  Currently, when the synchronous
> standby changes, PostgreSQL doesn't tell anyone.  That's a problem.

There are many parts of our replication which are still effectively
unmonitorable. For example, there's still no way to tell from the
replica that it's lost contact with the master except by tailing the
log.  If we try to build bigger systems on top of these components,
we'll find that we need to add a lot of instrumentation.

> 
>> You'd also need a way to let the connection nodes know when a replica
>> has fallen behind so that they can be taken out of
>> load-balancing/sharding for read queries.  For the synchronous model,
>> that would be "fallen behind at all"; for asynchronous it would be
>> "fallen more than ### behind".
> 
> How is that different from the previous thing?  Just that we'd treat
> "lagging" as "down" beyond some threshold?  That doesn't seem like a
> mandatory feature.

It's a mandatory feature if you want to load-balance reads.  We have to
know which nodes not to send reads to because they are out of sync.

>> Yeah?  I'd assume that a GTM would be antithetical to two-stage copying.
> 
> I don't think so.  If transaction A writes data on X which is
> replicated to Y and then commits, a new snapshot which shows A as
> committed can't be used on Y until A's changes have been replicated
> there.  That could be enforced by having the commit of A wait for
> replication, or by having an attempt by a later transaction to use the
> snapshot on Y wait until replication completes, or some even more
> sophisticated strategy that considers whether the replication backlog
> touches the same data that the new transaction will read.  It's
> complicated, but it doesn't seem intractable.

I need to see this on a chalkboard to understand it.

>>  I'm not a big fan of a GTM at all, frankly; it makes clusters much
>> harder to set up, and becomes a SPOF.
> 
> I partially agree.  I think it's very important that the GTM is an
> optional feature of whatever we end up with, rather than an
> indispensable component.  People who don't want it shouldn't have to
> pay the price in performance and administrative complexity.  But at
> the same time, I think a lot of people will want it, because without
> it, the fact that sharding is in use is much less transparent to the
> application.

If it can be optional, then we're pretty close to covering most use
cases with one general infrastructure.  That would be nice.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Horizontal scalability/sharding

From
Bruce Momjian
Date:
On Wed, Sep  2, 2015 at 02:41:46PM -0400, Robert Haas wrote:
> 4. Therefore, I think that we should instead use logical replication,
> which might be either synchronous or asynchronous.  When you modify
> one copy of the data, that change will then be replicated to all other
> nodes.  If you are OK with eventual consistency, this replication can
> be asynchronous, and nodes that are off-line will catch up when they
> are on-line.  If you are not OK with that, then you must replicate
> synchronously to every node before transaction commit; or at least you
> must replicate synchronously to every node that is currently on-line.
> This presents some challenges: logical decoding currently can't
> replicate transactions that are still in process - replication starts
> when the transaction commits.  Also, we don't have any way for
> synchronous replication to wait for multiple nodes.  But in theory
> those seem like limitations that can be lifted.  Also, the GTM needs
> to be aware that this stuff is happening, or it will DTWT.  That too
> seems like a problem that can be solved.

Can you explain why logical replication is better than binary
replication for this use-case?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Horizontal scalability/sharding

From
Bruce Momjian
Date:
On Wed, Sep  2, 2015 at 12:03:36PM -0700, Josh Berkus wrote:
> Well, there is a WIP patch for that, which IMHO would be much improved
> by having a concrete use-case like this one.  What nobody is working on
> -- and we've vetoed in the past -- is a way of automatically failing and
> removing from replication any node which repeatedly fails to sync, which
> would be a requirement for this model.
> 
> You'd also need a way to let the connection nodes know when a replica
> has fallen behind so that they can be taken out of
> load-balancing/sharding for read queries.  For the synchronous model,
> that would be "fallen behind at all"; for asynchronous it would be
> "fallen more than ### behind".

I think this gets back to the idea of running an administrative alert
command when we switch to using a different server for
synchronous_standby_names.  We can't just keep requiring external
tooling to identify things that the database knows easily and can send
an alert.  Removing failed nodes is also something we should do and
notify users about.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Horizontal scalability/sharding

From
Robert Haas
Date:
On Wed, Sep 2, 2015 at 6:56 PM, Bruce Momjian <bruce@momjian.us> wrote:
> On Wed, Sep  2, 2015 at 02:41:46PM -0400, Robert Haas wrote:
>> 4. Therefore, I think that we should instead use logical replication,
>> which might be either synchronous or asynchronous.  When you modify
>> one copy of the data, that change will then be replicated to all other
>> nodes.  If you are OK with eventual consistency, this replication can
>> be asynchronous, and nodes that are off-line will catch up when they
>> are on-line.  If you are not OK with that, then you must replicate
>> synchronously to every node before transaction commit; or at least you
>> must replicate synchronously to every node that is currently on-line.
>> This presents some challenges: logical decoding currently can't
>> replicate transactions that are still in process - replication starts
>> when the transaction commits.  Also, we don't have any way for
>> synchronous replication to wait for multiple nodes.  But in theory
>> those seem like limitations that can be lifted.  Also, the GTM needs
>> to be aware that this stuff is happening, or it will DTWT.  That too
>> seems like a problem that can be solved.
>
> Can you explain why logical replication is better than binary
> replication for this use-case?

Uh, well, for the same reasons it is better in many other cases.
Particularly, you probably don't want to replicate all the data on
machine A to machine B, just some of it.

Typically, sharding solutions store multiple copies of each piece of
data.  So let's say you have 4 machines.  You divide the data into 12
chunks.  Each machine is the write-master for 2 of those chunks, but
has secondary copies of 3 others.  So maybe things start out like
this:

machine #1: master for chunks 1, 2, 3; also has copies of chunks 4, 7, 10
machine #2: master for chunks 4, 5, 6; also has copies of chunks 1, 8, 11
machine #3: master for chunks 7, 8, 9; also has copies of chunks 2, 5, 12
machine #4: master for chunks 10, 11, 12; also has copies of chunks 3, 6, 9

If machine #1 is run over by a rabid triceratops, you can make machine
#2 the master for chunk 1, machine #3 the master for chunk 2, and
machine #4 the master for chunk 3.  The write load therefore remains
evenly divided.  If you can only copy entire machines, you can't
achieve that in this situation.

I'm not saying that the above is exactly what we're going to end up
with, or even necessarily close.  But a big part of the point of
sharding is that not all the machines have the same data - otherwise
you are not write scaling.  But it will frequently be the case, for
various reasons, that they have *overlapping* sets of data.  Logical
replication can handle that; physical replication can't.

In Postgres-XC, all tables are either sharded (part of the table is
present on each node) or distributed (all of the table is present on
every node).  Clearly, there's no way to use physical replication in
that scenario except if you are OK with having two copies of every
node.  But that's not a very good solution.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Horizontal scalability/sharding

From
Michael Paquier
Date:
On Thu, Sep 3, 2015 at 3:41 AM, Robert Haas wrote:
> 3. IIUC, Postgres-XC handles this problem by reducing at least
> volatile functions, maybe all functions, to constants.  Then it
> generates an SQL statement to be sent to the data node to make the
> appropriate change.  If there's more than one copy of the data, we
> send a separate copy of the SQL statement to every node.  I'm not sure
> exactly what happens if some of those nodes are not available, but I
> don't think it's anything good.  Fundamentally, this model doesn't
> allow for many good options in that case.

I don't recall that. Immutable functions are switched to constants in
the query sent to datanodes. Volatile and stable functions are
evaluated locally after fetching the results from the remote node. Not
that efficient for warehouse loads. My 2c.

> 4. Therefore, I think that we should instead use logical replication,
> which might be either synchronous or asynchronous.  When you modify
> one copy of the data, that change will then be replicated to all other
> nodes.  If you are OK with eventual consistency, this replication can
> be asynchronous, and nodes that are off-line will catch up when they
> are on-line.  If you are not OK with that, then you must replicate
> synchronously to every node before transaction commit; or at least you
> must replicate synchronously to every node that is currently on-line.
> This presents some challenges: logical decoding currently can't
> replicate transactions that are still in process - replication starts
> when the transaction commits.  Also, we don't have any way for
> synchronous replication to wait for multiple nodes.

That's something that the quorum synchronous patch would address.
Still, having the possibility to be synchronous across multiple nodes
does not seem like to be something at the top of the list.

> Also, the GTM needs to be aware that this stuff is happening, or it will DTWT.  That too seems like a problem that
canbe solved.
 

If I understood correctly, yes it is with its centralized transaction
facility each node is aware of the transaction status via the global
snapshot.
-- 
Michael



Re: Horizontal scalability/sharding

From
"Joshua D. Drake"
Date:
On 09/02/2015 03:56 PM, Bruce Momjian wrote:
> On Wed, Sep  2, 2015 at 02:41:46PM -0400, Robert Haas wrote:
>> 4. Therefore, I think that we should instead use logical replication,
>> which might be either synchronous or asynchronous.  When you modify
>> one copy of the data, that change will then be replicated to all other
>> nodes.  If you are OK with eventual consistency, this replication can
>> be asynchronous, and nodes that are off-line will catch up when they
>> are on-line.  If you are not OK with that, then you must replicate
>> synchronously to every node before transaction commit; or at least you
>> must replicate synchronously to every node that is currently on-line.
>> This presents some challenges: logical decoding currently can't
>> replicate transactions that are still in process - replication starts
>> when the transaction commits.  Also, we don't have any way for
>> synchronous replication to wait for multiple nodes.  But in theory
>> those seem like limitations that can be lifted.  Also, the GTM needs
>> to be aware that this stuff is happening, or it will DTWT.  That too
>> seems like a problem that can be solved.
>
> Can you explain why logical replication is better than binary
> replication for this use-case?
>

Selectivity?

JD


-- 
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.



Re: Horizontal scalability/sharding

From
Bruce Momjian
Date:
On Wed, Sep  2, 2015 at 07:50:25PM -0700, Joshua Drake wrote:
> >Can you explain why logical replication is better than binary
> >replication for this use-case?
> >
> 
> Selectivity?

I was assuming you would just create identical slaves to handle failure,
rather than moving selected data around.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Horizontal scalability/sharding

From
Bruce Momjian
Date:
On Wed, Sep  2, 2015 at 09:03:25PM -0400, Robert Haas wrote:
> > Can you explain why logical replication is better than binary
> > replication for this use-case?
> 
> Uh, well, for the same reasons it is better in many other cases.
> Particularly, you probably don't want to replicate all the data on
> machine A to machine B, just some of it.
> 
> Typically, sharding solutions store multiple copies of each piece of
> data.  So let's say you have 4 machines.  You divide the data into 12
> chunks.  Each machine is the write-master for 2 of those chunks, but
> has secondary copies of 3 others.  So maybe things start out like
> this:
> 
> machine #1: master for chunks 1, 2, 3; also has copies of chunks 4, 7, 10
> machine #2: master for chunks 4, 5, 6; also has copies of chunks 1, 8, 11
> machine #3: master for chunks 7, 8, 9; also has copies of chunks 2, 5, 12
> machine #4: master for chunks 10, 11, 12; also has copies of chunks 3, 6, 9
> 
> If machine #1 is run over by a rabid triceratops, you can make machine
> #2 the master for chunk 1, machine #3 the master for chunk 2, and
> machine #4 the master for chunk 3.  The write load therefore remains
> evenly divided.  If you can only copy entire machines, you can't
> achieve that in this situation.

I see the advantage of this now.  My original idea is that each shard
would have its own standby for disaster recovery, but your approach
above, which I know is typical, allows the shards to back up each other.
You could say shard 2 is the backup for shard 1, but then if shard one
goes bad, the entire workload of shard 1 goes to shard 2.  With the
above approach, the load of shard 1 is shared by all the shards.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Horizontal scalability/sharding

From
Amit Kapila
Date:
On Thu, Sep 3, 2015 at 8:28 AM, Bruce Momjian <bruce@momjian.us> wrote:
>
> On Wed, Sep  2, 2015 at 07:50:25PM -0700, Joshua Drake wrote:
> > >Can you explain why logical replication is better than binary
> > >replication for this use-case?
> > >
> >
> > Selectivity?
>
> I was assuming you would just create identical slaves to handle failure,
> rather than moving selected data around.
>

Yes, I also think so, otherwise when the shard goes down and it's replica
has to take the place of shard, it will take more time to make replica
available as it won't have all the data as original shard had. 


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Horizontal scalability/sharding

From
Pavan Deolasee
Date:


On Wed, Sep 2, 2015 at 9:04 PM, Oleg Bartunov <obartunov@gmail.com> wrote:



One lesson from XL we got is that we need testing framework for cluster, so any cluster project should at least pass functional and performance testing.

+1. In early XC days, we focused a lot on adding newer features and supporting as many PG features as possible. That took its toll on the testing and QA. It was a mistake though my feeling was we tried to correct that to some extend with XL. We did a 9.5 merge, which of course was a big deal, but other than more time is being spent on improving stability and performance

XL was very easy to break and I'm wondering how many corner cases still exists. 

Your team reported 2 or 3 major issues which I think we were able to fix quite quickly. But if there are more such issues which your team has recorded somewhere, I would request you to send them to the XL mailing list. I would definitely want to look at them and address them.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Horizontal scalability/sharding

From
Tatsuo Ishii
Date:
>> One lesson from XL we got is that we need testing framework for cluster,
>> so any cluster project should at least pass functional and performance
>> testing.
>>
> 
> +1. In early XC days, we focused a lot on adding newer features and
> supporting as many PG features as possible. That took its toll on the
> testing and QA. It was a mistake though my feeling was we tried to correct
> that to some extend with XL. We did a 9.5 merge, which of course was a big
> deal, but other than more time is being spent on improving stability and
> performance

Agreed. Any cluster project needs a cluster testing framework.

pgpool-II project runs "build farm" which runs cluster regression
tests every day. The tests includes several versions of pgpool-II and
PostgreSQL combinations using docker. Still it needs more tests but
even with limited test cases, it is pretty usefull to detect bugs.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: Horizontal scalability/sharding

From
Petr Jelinek
Date:
On 2015-09-02 19:57, Josh Berkus wrote:
> On 09/01/2015 04:14 PM, Petr Jelinek wrote:
>> On 2015-09-02 00:09, Josh Berkus wrote:
>>> Not really, the mechanism is different and the behavior is different.
>>> One critical deficiency in using binary syncrep is that you can't do
>>> round-robin redundancy at all; every redundant node has to be an exact
>>> mirror of another node.  In a good HA distributed system, you want
>>> multiple shards per node, and you want each shard to be replicated to a
>>> different node, so that in the event of node failure you're not dumping
>>> the full load on one other server.
>>>
>>
>> This assumes that we use binary replication, but we can reasonably use
>> logical replication which can quite easily do filtering of what's
>> replicated where.
>
> Is there a way to do logical synchronous replication?  I didn't think
> there was.
>

Yes, the logical replication has similar syncrep properties as the
binary one (feedback works same way).

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: Horizontal scalability/sharding

From
Bruce Momjian
Date:
On Thu, Sep  3, 2015 at 10:33:12AM +0200, Petr Jelinek wrote:
> On 2015-09-02 19:57, Josh Berkus wrote:
> > On 09/01/2015 04:14 PM, Petr Jelinek wrote:
> >> On 2015-09-02 00:09, Josh Berkus wrote:
> >>> Not really, the mechanism is different and the behavior is different.
> >>> One critical deficiency in using binary syncrep is that you can't do
> >>> round-robin redundancy at all; every redundant node has to be an exact
> >>> mirror of another node.  In a good HA distributed system, you want
> >>> multiple shards per node, and you want each shard to be replicated to a
> >>> different node, so that in the event of node failure you're not dumping
> >>> the full load on one other server.
> >>>
> >>
> >> This assumes that we use binary replication, but we can reasonably use
> >> logical replication which can quite easily do filtering of what's
> >> replicated where.
> >
> > Is there a way to do logical synchronous replication?  I didn't think
> > there was.
> >
> 
> Yes, the logical replication has similar syncrep properties as the
> binary one (feedback works same way).

Yes, I assumed that.  Logical replication uses WAL, so if you are
synchronous with WAL, logical replication is synchronous too.  However,
of course, it is synchronous in being durable, not synchronous in terms
of applying the WAL.  This is true of binary and logical replication.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Horizontal scalability/sharding

From
Tomas Vondra
Date:
Hi,

On 09/03/2015 05:02 AM, Amit Kapila wrote:
> On Thu, Sep 3, 2015 at 8:28 AM, Bruce Momjian <bruce@momjian.us
> <mailto:bruce@momjian.us>> wrote:
>  >
>  > On Wed, Sep  2, 2015 at 07:50:25PM -0700, Joshua Drake wrote:
>  > > >Can you explain why logical replication is better than binary
>  > > >replication for this use-case?
>  > > >
>  > >
>  > > Selectivity?
>  >
>> I was assuming you would just create identical slaves to handle
>> failure, rather than moving selected data around.
>  >
>
> Yes, I also think so, otherwise when the shard goes down and it's
> replica has to take the place of shard, it will take more time to
> make replica available as it won't have all the data as original
> shard had.

Not really, the idea is that you don't need to create the replica 
immediately. The system recognizes that primary shard location is 
unavailable and redirects the tasks to the "replicas." So the time to 
recreate the failed node is not that critical.

It needs to be done in a smart way to prevent some typical issues like 
suddenly doubling the load on replicas due to failure of the primary 
location. By using different group of nodes for each "data segment" you 
can eliminate this, because the group of nodes to handle the additional 
load will be larger.

The other issue then of course is that the groups of nodes must not be 
entirely random, otherwise the cluster would suffer data loss in case of 
outage of arbitrary group of K nodes (where K is the number of replicas 
for each piece of data).

It's also non-trivial to do this when you have to consider racks, data 
centers etc.

With regular slaves you can't do any of this - no matter what you do, 
you have to load balance the additional load only on the slaves.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Horizontal scalability/sharding

From
Kevin Grittner
Date:
Josh Berkus <josh@agliodbs.com> wrote:

>>> You'd also need a way to let the connection nodes know when a replica
>>> has fallen behind so that they can be taken out of
>>> load-balancing/sharding for read queries.  For the synchronous model,
>>> that would be "fallen behind at all"; for asynchronous it would be
>>> "fallen more than ### behind".
>>
>> How is that different from the previous thing?  Just that we'd treat
>> "lagging" as "down" beyond some threshold?  That doesn't seem like a
>> mandatory feature.
>
> It's a mandatory feature if you want to load-balance reads.  We have to
> know which nodes not to send reads to because they are out of sync.

There is another approach to this that we should consider how (if?)
we are going to cover: database affinity.  I have seen cases where
there are multiple databases which are targets of asynchronous
replication, with a web application load balancing among them.  The
application kept track of which copy each connection was using, so
that if when they were not exactly in sync the user never saw "time
moving backward".  Two different users might see versions of the
data from different points in time, but that generally doesn't
matter, especially if the difference is just a few minutes.  If one
copy got too far behind for some reason, they would load-shift to
the other servers (time still moves forward, only there is a "jump"
forward at the shift).  This would allow the tardy database to be
dedicated to catching up again.

Bottom line is that this very smooth behavior required two features
-- the ability for the application to control database affinity,
and the ability to shift that affinity gracefully (with no down
time).

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Horizontal scalability/sharding

From
Bruce Momjian
Date:
On Thu, Sep  3, 2015 at 03:40:40PM +0200, Tomas Vondra wrote:
> Not really, the idea is that you don't need to create the replica
> immediately. The system recognizes that primary shard location is
> unavailable and redirects the tasks to the "replicas." So the time
> to recreate the failed node is not that critical.
> 
> It needs to be done in a smart way to prevent some typical issues
> like suddenly doubling the load on replicas due to failure of the
> primary location. By using different group of nodes for each "data
> segment" you can eliminate this, because the group of nodes to
> handle the additional load will be larger.
> 
> The other issue then of course is that the groups of nodes must not
> be entirely random, otherwise the cluster would suffer data loss in
> case of outage of arbitrary group of K nodes (where K is the number
> of replicas for each piece of data).
> 
> It's also non-trivial to do this when you have to consider racks,
> data centers etc.
> 
> With regular slaves you can't do any of this - no matter what you
> do, you have to load balance the additional load only on the slaves.

Yes, and imagine doing this with FDW's, updating the catalog table
location of the FDW as part of the failover process --- interesting.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: Horizontal scalability/sharding

From
Robert Haas
Date:
On Thu, Sep 3, 2015 at 6:57 AM, Bruce Momjian <bruce@momjian.us> wrote:
> Yes, I assumed that.  Logical replication uses WAL, so if you are
> synchronous with WAL, logical replication is synchronous too.  However,
> of course, it is synchronous in being durable, not synchronous in terms
> of applying the WAL.  This is true of binary and logical replication.

But, Thomas Munro is fixing it!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Horizontal scalability/sharding

From
Andres Freund
Date:
On 2015-09-03 12:10:08 -0400, Robert Haas wrote:
> On Thu, Sep 3, 2015 at 6:57 AM, Bruce Momjian <bruce@momjian.us> wrote:
> > Yes, I assumed that.  Logical replication uses WAL, so if you are
> > synchronous with WAL, logical replication is synchronous too.  However,
> > of course, it is synchronous in being durable, not synchronous in terms
> > of applying the WAL.  This is true of binary and logical replication.

Actually that's not really true - it's just a question which LSNs you
return. For UDR/BDR the relevant LSN is the LSN of the last durably
committed transaction. And thus they wait for apply, not anything else.

> But, Thomas Munro is fixing it!

+ many to that effort.

Greetings,

Andres Freund



Re: Horizontal scalability/sharding

From
Josh Berkus
Date:
On 09/03/2015 03:57 AM, Bruce Momjian wrote:
>> > 
>> > Yes, the logical replication has similar syncrep properties as the
>> > binary one (feedback works same way).

Oh?  What does UDR/BDR currently support for sync?

> Yes, I assumed that.  Logical replication uses WAL, so if you are
> synchronous with WAL, logical replication is synchronous too.  However,
> of course, it is synchronous in being durable, not synchronous in terms
> of applying the WAL.  This is true of binary and logical replication.

Well, there's no such thing as simultaneity in scalable architectures.
But users are already used to that ... anybody who load-balances to read
slaves knows about lag.  The only way* to ensure near-simultenaity is to
have some kind of single-node, single-process GTM for the cluster, and
then your actual scalability goes bye-bye.

The bigger issue we'll need to address with this is the fight between
lag and load-balancing, which would become a much worse issue with
read-load-balanced shards which are transparent to the user.  They'd see
the effects of lag, without having actually chosen to use this or that
replica.  This is the other reason to look at logical replication;
presumably with logrep, we can be more discriminating about what
activities cause lag (for one thing, vacuum won't).

Also:
On 09/03/2015 07:00 AM, Kevin Grittner wrote:
> There is another approach to this that we should consider how (if?)
> we are going to cover: database affinity.  I have seen cases where
> there are multiple databases which are targets of asynchronous
> replication, with a web application load balancing among them.  The
> application kept track of which copy each connection was using, so
> that if when they were not exactly in sync the user never saw "time
> moving backward".  Two different users might see versions of the
> data from different points in time, but that generally doesn't
> matter, especially if the difference is just a few minutes.  If one
> copy got too far behind for some reason, they would load-shift to
> the other servers (time still moves forward, only there is a "jump"
> forward at the shift).  This would allow the tardy database to be
> dedicated to catching up again.
>
> Bottom line is that this very smooth behavior required two features
> -- the ability for the application to control database affinity,
> and the ability to shift that affinity gracefully (with no down
> time).

Yes.  Frankly, it would be *easier* to code things so that the same
session always gets its requests load balanced to the same copies;
making that a feature, too, is nice.


(* there are actually other ways to come close to simultaneity, but they
are much more complicated)

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Horizontal scalability/sharding

From
Ozgun Erdogan
Date:
Hey Robert,

Now the question is, where should the code that does all of this live?
 postgres_fdw?  Some new, sharding-specific FDW?  In core?  I don't
know for sure, but what I do know is that we could make a lot of
progress over where we are today by just improving postgres_fdw, and I
don't think those improvements are even all that difficult.  If we
decide we need to implement something new, it's going to be a huge
project that will take years to complete, with uncertain results.  I'd
rather have a postgres_fdw-based implementation that is imperfect and
can't handle some kinds of queries in 9.6 than a promise that by 9.9
we'll have something really great that handles MPP perfectly.

Distributed shuffles (Map/Reduce) are hard. When we looked at using FDWs for pg_shard, we thought that Map/Reduce would require a comprehensive revamp of the APIs.

For Citus, a second part of the question is as FDW writers. We implemented cstore_fdw, json_fdw, and mongo_fdw, and these wrappers don't benefit from even the simple join pushdown that doesn't require Map/Reduce.

The PostgreSQL wiki lists 85 foreign data wrappers, and only 18 of these have support for joins: https://wiki.postgresql.org/wiki/Foreign_data_wrappers

Best,
Ozgun

Re: Horizontal scalability/sharding

From
Ashutosh Bapat
Date:


On Sat, Sep 5, 2015 at 4:22 AM, Ozgun Erdogan <ozgun@citusdata.com> wrote:
Hey Robert,

Now the question is, where should the code that does all of this live?
 postgres_fdw?  Some new, sharding-specific FDW?  In core?  I don't
know for sure, but what I do know is that we could make a lot of
progress over where we are today by just improving postgres_fdw, and I
don't think those improvements are even all that difficult.  If we
decide we need to implement something new, it's going to be a huge
project that will take years to complete, with uncertain results.  I'd
rather have a postgres_fdw-based implementation that is imperfect and
can't handle some kinds of queries in 9.6 than a promise that by 9.9
we'll have something really great that handles MPP perfectly.

Distributed shuffles (Map/Reduce) are hard. When we looked at using FDWs for pg_shard, we thought that Map/Reduce would require a comprehensive revamp of the APIs.

For Citus, a second part of the question is as FDW writers. We implemented cstore_fdw, json_fdw, and mongo_fdw, and these wrappers don't benefit from even the simple join pushdown that doesn't require Map/Reduce.

I didn't get this. Join pushdown infrastructure (chiefly set of hooks provided in join planning paths) is part of 9.5. Isn't that sufficient to implement join push-down for above FDWs? Or FDW writers are facing problems while implementing those hooks. In either case that should be reported on hackers.
 

The PostgreSQL wiki lists 85 foreign data wrappers, and only 18 of these have support for joins: https://wiki.postgresql.org/wiki/Foreign_data_wrappers

Best,
Ozgun



--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Horizontal scalability/sharding

From
Ahsan Hadi
Date:


On Monday, September 7, 2015, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:


On Sat, Sep 5, 2015 at 4:22 AM, Ozgun Erdogan <ozgun@citusdata.com> wrote:
Hey Robert,

Now the question is, where should the code that does all of this live?
 postgres_fdw?  Some new, sharding-specific FDW?  In core?  I don't
know for sure, but what I do know is that we could make a lot of
progress over where we are today by just improving postgres_fdw, and I
don't think those improvements are even all that difficult.  If we
decide we need to implement something new, it's going to be a huge
project that will take years to complete, with uncertain results.  I'd
rather have a postgres_fdw-based implementation that is imperfect and
can't handle some kinds of queries in 9.6 than a promise that by 9.9
we'll have something really great that handles MPP perfectly.

Distributed shuffles (Map/Reduce) are hard. When we looked at using FDWs for pg_shard, we thought that Map/Reduce would require a comprehensive revamp of the APIs.

For Citus, a second part of the question is as FDW writers. We implemented cstore_fdw, json_fdw, and mongo_fdw, and these wrappers don't benefit from even the simple join pushdown that doesn't require Map/Reduce.

I didn't get this. Join pushdown infrastructure (chiefly set of hooks provided in join planning paths) is part of 9.5. Isn't that sufficient to implement join push-down for above FDWs? Or FDW writers are facing problems while implementing those hooks. In either case that should be reported on hackers.
 

I don't think any FDW writer (other the postgres_fdw) has tried to implement join push down in the respective FDW's using the new API. 
 

The PostgreSQL wiki lists 85 foreign data wrappers, and only 18 of these have support for joins: https://wiki.postgresql.org/wiki/Foreign_data_wrappers

Best,
Ozgun



--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


--
Ahsan Hadi
Snr Director Product Development
EnterpriseDB Corporation
The Enterprise Postgres Company

Phone: +92-51-8358874   
Mobile: +92-333-5162114

Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb

This e-mail message (and any attachment) is intended for the use of the individual or entity to whom it is addressed. This message contains information from EnterpriseDB Corporation that may be privileged, confidential, or exempt from disclosure under applicable law. If you are not the intended recipient or authorized to receive this for the intended recipient, any use, dissemination, distribution, retention, archiving, or copying of this communication is strictly prohibited. If you have received this e-mail in error, please notify the sender immediately by reply e-mail and delete this message.

Re: Horizontal scalability/sharding

From
Thomas Munro
Date:
On Thu, Sep 3, 2015 at 7:03 AM, Josh Berkus <josh@agliodbs.com> wrote:
On 09/02/2015 11:41 AM, Robert Haas wrote:
> 4. Therefore, I think that we should instead use logical replication,
> which might be either synchronous or asynchronous.  When you modify
> one copy of the data, that change will then be replicated to all other
> nodes.  If you are OK with eventual consistency, this replication can
> be asynchronous, and nodes that are off-line will catch up when they
> are on-line.  If you are not OK with that, then you must replicate
> synchronously to every node before transaction commit; or at least you
> must replicate synchronously to every node that is currently on-line.
> This presents some challenges: logical decoding currently can't
> replicate transactions that are still in process - replication starts
> when the transaction commits.  Also, we don't have any way for
> synchronous replication to wait for multiple nodes.

Well, there is a WIP patch for that, which IMHO would be much improved
by having a concrete use-case like this one.  What nobody is working on
-- and we've vetoed in the past -- is a way of automatically failing and
removing from replication any node which repeatedly fails to sync, which
would be a requirement for this model.

You'd also need a way to let the connection nodes know when a replica
has fallen behind so that they can be taken out of
load-balancing/sharding for read queries.  For the synchronous model,
that would be "fallen behind at all"; for asynchronous it would be
"fallen more than ### behind".

I have been thinking about that problem in the context of synchronous_commit = apply, and while trying to review the multiple synchronous patch.  How are you supposed to actually make use of remote-apply semantics without a way to find a replica that is consistent?  And what does consistent mean?  I'm going to say consistency means: it sees at least all effects of all preceding COMMITs that returned successfully.  It's trivial in a no-timeout, single standby topology (if commit returned, the sole sync replica has applied your transaction and replied), but beyond that it obviously requires some more infrastructure and concepts.  Here is my suggestion:

Imagine if we could configure setups like this:

1.  I have 4 servers called london1, london2, paris1, paris2 (see nearby multiple sync server thread/patch).
2.  I have synchronous_commit = apply (see nearby thread/patch)
3.  Currently london1 is master, with the other 3 in the synchronous replication set, and I want a minimum replication set of size 2 so I can lose 1 of those and continue

So far so good, now for the double-vapourware part:

4.  The replicas know whether they are currently part of the master's synchronous replication set or not because it tells them
5.  While waiting for replicas, the master only waits up to X milliseconds, and if no reply is received from a given server it drops that server from the sync rep set, like a RAID controller dropping an unresponsive element from a RAID array, but still returns control to the user if 2 nodes (our configured minimum) did reply
6.  If the sync rep set reaches our minimum size 2 because of a node being dropped, then you can no longer drop one, so commit hangs (work is blocked until enough servers connect and catch up again)
7.  If a replica sees that it hasn't received WAL records or pings from the master with timestamps from the past Y milliseconds, or receives a message explicitly telling it it's been dropped from the sync rep set, it will start rejecting my queries on the basis that it's potentially out of date
8.  If the master sees suitable apply reply messages stream in from a node that was dropped but has now caught up (possibly having reconnected), it will explicitly tell it that it's back in the sync rep set and start waiting for it again
9.  X is sufficiently larger than Y so that, combined with the 'you've been dropped from/rejoined the sync rep set' messages and well sync'd system clocks, it should not be possible for a replica to expose snapshots that don't include all effects from transactions whose COMMIT command returned on the master.  (If you don't have the concept of a dynamic replication set which replicas are dropped from and rejoin, then the master can't drop a node and carry on, like a RAID controller would, unless it's happy to wait for any old N nodes to reply.  Waiting for any N nodes to reply may be OK for log-flush-only sync rep, but if you want to use apply sync rep and have guarantees about visibility, and you simply wait for any N nodes to reply, then the nodes themselves don't know if they are up to date with master or not (they don't know if they were one of the ones that master waited for some transaction they haven't even heard about yet), so they may show users old data.   Also, if you don't have the master telling the replicas that it considers them to be in or out of the replication set, they don't know exactly when the master decides to consider them in again after they rejoin.)

Now I can connect to any server and specify my requirement to see all committed-on-the-master transactions (meaning: the COMMIT command returned success to the client), and either get an error telling me that it can't guarantee that at that moment (because it has been told it's not in the sync rep set or hasn't heard from the master recently enough), or happily proceed to query the database.  I can send my writes to the master node, and do all my reads on any node I like, and be sure they include whatever I just committed (or someone else just committed and told me about).  I can also use FDW to query the replicas from the master and know that they can see everything already committed (but of course not uncommitted changes; I'm guessing you get that on the GTM based systems).

The main problem I can see so far with this scheme is that you can see things on the replicas *before* the COMMIT returns.  Is that a problem?  I suspect all solutions to that problem involve centralised snapshot control (through a GTM or through the master).

The timeout and ping based drop/join idea is inspired by a non-Postgres system that was presented at CHAR(14) last year that is due to be released as open source one day (and whose author egged me on to try implementing that synchronous_commit = apply patch), and I guess loosely RAID.  Maybe there is better terminology or a name in database literature for this approach, I'm not sure, and maybe it has unacceptable holes.  It's a lot less radical than the GTM/MPP systems, since it just adds a few bells and whistles to the existing single master replication model, and obviously there are plenty more problems to solve to make really useful clustering technology, like master reelection, query routing/node location, load balancing and so forth.  But the master and the replicas have the information they need to do that.

--

Re: Horizontal scalability/sharding

From
Robert Haas
Date:
On Fri, Sep 4, 2015 at 6:52 PM, Ozgun Erdogan <ozgun@citusdata.com> wrote:
> Distributed shuffles (Map/Reduce) are hard. When we looked at using FDWs for
> pg_shard, we thought that Map/Reduce would require a comprehensive revamp of
> the APIs.

Well, so you've said.  But what kind of API do you want to see?
Taking control at some very high-level hook like ExecutorRun() is not
really a maintainable solution - it's fine if you've only got one guy
doing it, perhaps, but if you have several FDWs talking to different
kinds of remote systems, they can't all seize overall control.

>> For Citus, a second part of the question is as FDW writers. We implemented
> cstore_fdw, json_fdw, and mongo_fdw, and these wrappers don't benefit from
> even the simple join pushdown that doesn't require Map/Reduce.
>
> The PostgreSQL wiki lists 85 foreign data wrappers, and only 18 of these
> have support for joins:
> https://wiki.postgresql.org/wiki/Foreign_data_wrappers

What do you mean by "support for joins"?  Do you mean that only 18 of
the remote data sources can do joins?  If so, why does that matter?
I'd be quite happy if a join pushdown or "distributed shuffle" API had
as many as 18 users - I'd be quite happy if it had one (postgres_fdw).
The fact that not all FDWs can support every operation because of
limitations on the remote side isn't a reason not to support those
operations when the remote side is capable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Horizontal scalability/sharding

From
Petr Jelinek
Date:
On 2015-09-08 19:52, Robert Haas wrote:
> On Fri, Sep 4, 2015 at 6:52 PM, Ozgun Erdogan <ozgun@citusdata.com> wrote:
>>> For Citus, a second part of the question is as FDW writers. We implemented
>> cstore_fdw, json_fdw, and mongo_fdw, and these wrappers don't benefit from
>> even the simple join pushdown that doesn't require Map/Reduce.
>>
>> The PostgreSQL wiki lists 85 foreign data wrappers, and only 18 of these
>> have support for joins:
>> https://wiki.postgresql.org/wiki/Foreign_data_wrappers
>
> What do you mean by "support for joins"?  Do you mean that only 18 of
> the remote data sources can do joins?  If so, why does that matter?
> I'd be quite happy if a join pushdown or "distributed shuffle" API had
> as many as 18 users - I'd be quite happy if it had one (postgres_fdw).
> The fact that not all FDWs can support every operation because of
> limitations on the remote side isn't a reason not to support those
> operations when the remote side is capable.
>

Agreed. While I personally don't think FDWs are long term answer for 
sharding, I do believe that the infrastructure that is being developed 
for FDWs (join/aggregate pushdown) is needed anyway and there are many 
common issues that need solving in this area for FDWs, sharding and 
parallel query.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: Horizontal scalability/sharding

From
Merlin Moncure
Date:
On Mon, Sep 7, 2015 at 1:33 PM, Ahsan Hadi <ahsan.hadi@enterprisedb.com> wrote:
> I
>
> On Monday, September 7, 2015, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>>
>>
>>
>> On Sat, Sep 5, 2015 at 4:22 AM, Ozgun Erdogan <ozgun@citusdata.com> wrote:
>>>
>>> Hey Robert,
>>>
>>>> Now the question is, where should the code that does all of this live?
>>>>  postgres_fdw?  Some new, sharding-specific FDW?  In core?  I don't
>>>> know for sure, but what I do know is that we could make a lot of
>>>> progress over where we are today by just improving postgres_fdw, and I
>>>> don't think those improvements are even all that difficult.  If we
>>>> decide we need to implement something new, it's going to be a huge
>>>> project that will take years to complete, with uncertain results.  I'd
>>>> rather have a postgres_fdw-based implementation that is imperfect and
>>>> can't handle some kinds of queries in 9.6 than a promise that by 9.9
>>>> we'll have something really great that handles MPP perfectly.
>>>
>>>
>>> Distributed shuffles (Map/Reduce) are hard. When we looked at using FDWs
>>> for pg_shard, we thought that Map/Reduce would require a comprehensive
>>> revamp of the APIs.
>>>
>>> For Citus, a second part of the question is as FDW writers. We
>>> implemented cstore_fdw, json_fdw, and mongo_fdw, and these wrappers don't
>>> benefit from even the simple join pushdown that doesn't require Map/Reduce.
>>
>>
>> I didn't get this. Join pushdown infrastructure (chiefly set of hooks
>> provided in join planning paths) is part of 9.5. Isn't that sufficient to
>> implement join push-down for above FDWs? Or FDW writers are facing problems
>> while implementing those hooks. In either case that should be reported on
>> hackers.
>>
>
>
> I don't think any FDW writer (other the postgres_fdw) has tried to implement
> join push down in the respective FDW's using the new API.

Well, 'jdbc_fdw2' seems to implement deparsing at some level:
https://github.com/heimir-sverrisson/jdbc2_fdw/blob/master/deparse.c

...but this was likely a copy/paste job from the postgres_fdw.  It
should not escape note that the deparsing strategy has dependencies on
the optimizer and the parser.  This is not good; asking the FDW
implementations to implement SQL optimizations is not a sustainable
solution.  They should be indicating, for example, "I support ANSI-92
SQL", and the postgres planner should be deparsing the foreign table
definition and rewriting it, not the other way around.

merlin