Thread: I'd like to discuss scaleout at PGCon

I'd like to discuss scaleout at PGCon

From

"MauMau"

Date:

27 May 2018, 08:20:01

Hello,

I'm going to attend PGCon in Ottawa for the first time.  I am happy if
I can meet you.

Because I'm visually impaired, I only have vision to sense light.  If
you see a Japanese man with a height of 171 cm with a white cane, it's
probably me.  I'd be happy if you talk to me.  But as I'm still far
from good at listening and speaking English, I'm sorry if I take an
unfriendly attitude or if I can not keep on talking for a long time.


I'd like to have a session on scaleout design at the unconference.
I've created a wiki page for that (this is still just a memo; I'd like
to populate this page with you as the discussion in the community
progresses).  I'd appreciate it if someone could stand with me and
facilitate the discussion at the unconference.

https://wiki.postgresql.org/wiki/Scaleout_Design

The background is ... our company is faced with an immediate need to
develop the read-write scaleout feature on PostgreSQL.  We tried
Postgres-XL with much hope, but we found it difficult to achieve our
performance goal.  I will tell you the details at the conference.  But
personally, Postgres-XL seems to be very nice software, and I feel
that good parts of it should be integrated into core.

I know that many great hackers from 2ndQuadrant, EnterpriseDB, NTT,
Postgres Professional, CitusData, and so on are addressing this
difficult scaleout feature.  I don't think yet we are competent to
lead this development.

On the other hand, we have a proprietary RDBMS called Symfoware (I'm
sure you don't know it), which is not based on PostgreSQL, that
provides the scaleout feature.  Its architecture is a mix of shared
nothing and shared everything.  It implements deadlock detection and
resolution without a central node or periodic monitoring, parallel 2PC
across nodes, parallel crash recovery, client connection routing and
failover without any overhead of intermediary middleware during SQL
execution, etc.  So we may be able to help in some way.  I'd be happy
if we could help the community to proceed with development of
scaleout.

If you have a session for scaleout outside the unconference, could you
call me and let me join it?


By the way, the popularity score of PostgreSQL finally exceeded 400
points in the DB-Engines ranking!  The popularity difference with the
top products has shrunk greatly.  Let's make PostgreSQL more popular.

https://db-engines.com/en/ranking

    [as of May 27, 2018]
    Oracle=1290.42  MySQL=1223.34  SQL Server=1085.84
    PostgreSQL=400.90  MongoDB=342.11
    (Oracle / PostgreSQL ratio is 3.2)

    [as of Feb 2016, according to a memo at hand]
    Oracle=1476.14  MySQL=1321.13  SQL Server=??
    MongoDB=??  PostgreSQL=288.66
    (Oracle / PostgreSQL ratio is 5.1)


Regards
MauMau

Re: I'd like to discuss scaleout at PGCon

From

Robert Haas

Date:

31 May 2018, 05:26:23

On Sun, May 27, 2018 at 1:20 AM, MauMau <maumau307@gmail.com> wrote:
> I'm going to attend PGCon in Ottawa for the first time.  I am happy if
> I can meet you.

It was nice to meet you in person.

> I'd like to have a session on scaleout design at the unconference.
> I've created a wiki page for that (this is still just a memo; I'd like
> to populate this page with you as the discussion in the community
> progresses).  I'd appreciate it if someone could stand with me and
> facilitate the discussion at the unconference.
>
> https://wiki.postgresql.org/wiki/Scaleout_Design

We didn't have time in the unconference session to discuss these
topics in detail, for you have raised many issues here each of which
deserves discussion individually and in detail.  I wrote a blog post
somewhat related to this topic recently which you can find at
http://rhaas.blogspot.com/2018/05/built-in-sharding-for-postgresql.html
-- it focuses on a somewhat different set of issues than the ones you
raise but that's good, because we need to hear all perspectives.  In
terms of high-level architecture, I think you are right to wonder
about the possibility of a cloud-native mode based on separating
storage and compute.  Amazon Aurora offers that feature, but we might
want to have it in PostgreSQL.

Another, somewhat different thing that we might want is a fully
distributed database, with a distributed buffer cache, distributed
lock manager, distributed invalidation queue, etc.  That would be like
what Oracle RAC does, but it's a tremendous amount of work, and a lot
of that work has no value by itself.  You don't get the payoff until
it's all working.  There are a few pieces that might be independently
useful, though, like a distributed deadlock detector.  The goal of
this kind of effort is, I think, to support zillions of connections --
scale further than you can with just one node.  This would be a lot
better if combined with the cloud-native storage, but of course that's
even more work.

The FDW approach, of which I have been a supporter for some years now,
is really aiming at a different target, which is to allow efficient
analytics queries across a multi-node cluster.  I think we're getting
pretty close to being able to do that -- IMHO, the last fundamental
building block that we need is asynchronous execution, which Andres is
working on.  After that, it's a matter of adding other features that
people want (like cross-node MVCC) and improving the plans for queries
that still don't perform well (like joins that involve redistributing
one of the tables involved).

We might not want to confine ourselves strictly to the FDW interface
-- for example, I've thought about the idea of building introducing a
new relkind for a "distributed table".  A distributed table may be
present on the local node, in which case it can be scanned like a
local table, or it may be not present, in which case it can be scanned
like a foreign table by connecting to a node on which it is present.
The set of nodes on which a table is present is metadata that is
shared throughout the cluster.  Multi-master logical replication
propagates changes between all nodes on which the table is present.
With a concept like this, there is a lot of opportunity to optimize
queries by, for example, deciding to perform a join on a node on which
both input tables are present, to minimize data movement.  But even if
we create something like this, I see it as fundamentally an extension
of the FDW approach that would hopefully manage to reuse a lot of
what's already been built there.  I don't think we need to (or should)
throw away the work that's been done on FDW pushdown and start over --
we should find a way to build on top of it and add ease-of-use and
management features.

In fact, even if we said that we want a fully distributed database,
we'd probably still need some kind of distributed table concept.
Unless every node has a full copy of everything in the database, you
still need to reason about which data is present on which nodes and
optimize queries accordingly.

> By the way, the popularity score of PostgreSQL finally exceeded 400
> points in the DB-Engines ranking!  The popularity difference with the
> top products has shrunk greatly.  Let's make PostgreSQL more popular.
>
> https://db-engines.com/en/ranking
>
>     [as of May 27, 2018]
>     Oracle=1290.42  MySQL=1223.34  SQL Server=1085.84
>     PostgreSQL=400.90  MongoDB=342.11
>     (Oracle / PostgreSQL ratio is 3.2)
>
>     [as of Feb 2016, according to a memo at hand]
>     Oracle=1476.14  MySQL=1321.13  SQL Server=??
>     MongoDB=??  PostgreSQL=288.66
>     (Oracle / PostgreSQL ratio is 5.1)

From the chart view, in February 2016, SQL Server was at 1150.227, and
MongoDB was at 305.599.  Generally it looks like the "big three" --
Oracle, MySQL, and SQL Server -- are all slowly declining in the
rankings while PostgreSQL and MongoDB are slowly rising.

It's good to see PostgreSQL going up, but we have a long way to go to
really be in the big leagues.  I think in the last couple of years we
have done a much better job of adding interesting features than we did
for a few years before that.  Between PostgreSQL 9.6 and 11, we will
have added (and in many cases significantly improved) partitioning,
logical replication, parallel query, and stored procedures.  Along the
way we've improved a lot of other things, not least executor
performance.  I don't know how directly those things affect the
DB-Engines ranking, but they are such major things that I have to
believe they make PostgreSQL a real possibility for many users who
wouldn't previously have found it a viable option, and I assume that's
at least part of what is driving the numbers, though there are
probably other factors too, such as Oracle's acquisition of MySQL.  I
think it's pretty clear that we need to both continue to improve some
of these major new features we've added and at the same time keep
introducing even more new things if we want to continue to gain market
share and mind share.  I hope that features like scale-out and also
zheap are going to help us continue to whittle away at the gap, and I
look forward to seeing what else anyone may have in mind.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: I'd like to discuss scaleout at PGCon

From

MauMau

Date:

31 May 2018, 15:12:32

2018-05-31 11:26 GMT+09:00, Robert Haas <robertmhaas@gmail.com>:
> It was nice to meet you in person.

Me too.  And it was very kind of you to help me to display the wiki
page well and guide the session.  When I first heard your voice at the
Developer Meeting, I thought Bruce Momjian was speaking, because your
voice sounded similar to him...


> We didn't have time in the unconference session to discuss these
> topics in detail, for you have raised many issues here each of which
> deserves discussion individually and in detail.  I wrote a blog post
> somewhat related to this topic recently which you can find at
> http://rhaas.blogspot.com/2018/05/built-in-sharding-for-postgresql.html

Yes, I read this article before PGCon.  Your articles are always
helpful to catch the current overall situation of the community.


> In
> terms of high-level architecture, I think you are right to wonder
> about the possibility of a cloud-native mode based on separating
> storage and compute.  Amazon Aurora offers that feature, but we might
> want to have it in PostgreSQL.

> Another, somewhat different thing that we might want is a fully
> distributed database, with a distributed buffer cache, distributed
> lock manager, distributed invalidation queue, etc.  That would be like
> what Oracle RAC does, but it's a tremendous amount of work, and a lot
> of that work has no value by itself.  You don't get the payoff until
> it's all working.  There are a few pieces that might be independently
> useful, though, like a distributed deadlock detector.  The goal of
> this kind of effort is, I think, to support zillions of connections --
> scale further than you can with just one node.  This would be a lot
> better if combined with the cloud-native storage, but of course that's
> even more work.

Yes, I can understand the difficulty.  So, I simply wanted to ask
opinions at the unconference on which (hard) direction the community
wants to go and what database we want PostgreSQL to be like
ultimately.  Without that fundamental consensus, development work
might be wasted, facing objections after submitting patches.  As you
mentioned in your blog post and in a past email, I don't think anyone
yet has a clear image of what the scaleout of PostgreSQL should be.
How should we proceed?  Which approach should we take to minimize
rework?

a) Define functional specification with the basic overall architecture
(this doesn't mean to write a heavy detailed design document or
manual; I think a general README or wiki would be sufficient.)  At
this time, I expect we can evaluate how to use scaleout feature and
whether it's reasonably easy to use.  Then we can proceed to design
and code each part with relief -- 2PC, global consistency, failure
detection and failover, distributed lock management and deadlock
handling, etc.

b) Various developers design and code each part, bring together those
patches, and then all try to figure out how to combine them.

I'm in favor of a) at least at the basic architecture level.
Otherwise, such an unhappiness could happen:
"Hey, I made a patch to implement a distributed cache management like
Oracle Cache Fusion."
"No, we don't want features based on shared everything architecture."

I anticipated a decision process at the unconference like this:
"Do we want to build on shared everything architecture?"
"No, because it limits scalability, requires expensive shared storage,
and it won't run on many clouds."
"Then do we want to employ a new architecture like AWS Aurora?"
"That may be interesting.  But AWS could do it because they have an
infinitely scalable storage layer which is built and used for a long
time.  This architecture may not be our option.  But let's keep our
eye on leveraging services of major cloud vendors just like Vertica
does recently.  Cloud services are now like traditional
vendor-specific hardware.  Maybe PostgreSQL should utilize them just
like we use CPU-specific instructions now and GPU/persistent memory in
the near future."
"Then, it seems that we should go on the shared nothing architecture.
Is it OK?"
"Yes."



> The FDW approach, of which I have been a supporter for some years now,
> is really aiming at a different target, which is to allow efficient
> analytics queries across a multi-node cluster.

Oh, I didn't know you support FDW approach mainly for analytics.  I
guessed the first target was OLTP read-write scalability.


> We might not want to confine ourselves strictly to the FDW interface
> -- for example, I've thought about the idea of building introducing a
> new relkind for a "distributed table".  A distributed table may be
> present on the local node, in which case it can be scanned like a
> local table, or it may be not present, in which case it can be scanned
> like a foreign table by connecting to a node on which it is present.
> The set of nodes on which a table is present is metadata that is
> shared throughout the cluster.  Multi-master logical replication
> propagates changes between all nodes on which the table is present.
> With a concept like this, there is a lot of opportunity to optimize
> queries by, for example, deciding to perform a join on a node on which
> both input tables are present, to minimize data movement.

I agree.  XL, Oracle Sharding, and possibly MySQL Cluster does that,
too.  It seems like a must-do thing.


> But even if
> we create something like this, I see it as fundamentally an extension
> of the FDW approach that would hopefully manage to reuse a lot of
> what's already been built there.  I don't think we need to (or should)
> throw away the work that's been done on FDW pushdown and start over --
> we should find a way to build on top of it and add ease-of-use and
> management features.

Agreed.  I think we should not write much code from scratch, too.  On
the other hand, if we have to support sharding natively without FDW, I
wonder if we can reuse the FDW artifact.  I mean, extracting necessary
logics from FDW into common functions, and native sharding code also
calls them.


> In fact, even if we said that we want a fully distributed database,
> we'd probably still need some kind of distributed table concept.
> Unless every node has a full copy of everything in the database, you
> still need to reason about which data is present on which nodes and
> optimize queries accordingly.

Then, how about building the cluster membership management first,
including node management and failure detection/failover?  I think
that node management is necessary anyway, and other developers can
experiment other things on that cluster infrastructure.  Do you think
it would be helpful or wasteful?  I'm finding what we can do for early
scaleout release.


> From the chart view, in February 2016, SQL Server was at 1150.227, and
> MongoDB was at 305.599.  Generally it looks like the "big three" --

Thank you for looking at the chart and telling me the figures.

> I
> think it's pretty clear that we need to both continue to improve some
> of these major new features we've added and at the same time keep
> introducing even more new things if we want to continue to gain market
> share and mind share.  I hope that features like scale-out and also
> zheap are going to help us continue to whittle away at the gap, and I
> look forward to seeing what else anyone may have in mind.

Definitely.  I couldn't agree more.


Regards
MauMau

Re: I'd like to discuss scaleout at PGCon

From

Robert Haas

Date:

31 May 2018, 16:44:14

On Thu, May 31, 2018 at 8:12 AM, MauMau <maumau307@gmail.com> wrote:
> I anticipated a decision process at the unconference like this:
> "Do we want to build on shared everything architecture?"
> "No, because it limits scalability, requires expensive shared storage,
> and it won't run on many clouds."
> "Then do we want to employ a new architecture like AWS Aurora?"
> "That may be interesting.  But AWS could do it because they have an
> infinitely scalable storage layer which is built and used for a long
> time.  This architecture may not be our option.  But let's keep our
> eye on leveraging services of major cloud vendors just like Vertica
> does recently.  Cloud services are now like traditional
> vendor-specific hardware.  Maybe PostgreSQL should utilize them just
> like we use CPU-specific instructions now and GPU/persistent memory in
> the near future."
> "Then, it seems that we should go on the shared nothing architecture.
> Is it OK?"
> "Yes."

I think all of that except for the last two sentences accords with my
view, at least.  As for the last two sentences, I'm happy to see
someone develop some kind of cloud-native storage if they want, but I
have no immediate plans in that area myself.

>> The FDW approach, of which I have been a supporter for some years now,
>> is really aiming at a different target, which is to allow efficient
>> analytics queries across a multi-node cluster.
>
> Oh, I didn't know you support FDW approach mainly for analytics.  I
> guessed the first target was OLTP read-write scalability.

That seems like a harder target to me, because you will have an extra
hop involved -- SQL from the client to the first server, then via SQL
to a second server.  The work of parsing and planning also has to be
done twice, once for the foreign table and again for the table.  For
longer-running queries this overhead doesn't matter as much, but for
short-running queries it is significant.

> Agreed.  I think we should not write much code from scratch, too.  On
> the other hand, if we have to support sharding natively without FDW, I
> wonder if we can reuse the FDW artifact.  I mean, extracting necessary
> logics from FDW into common functions, and native sharding code also
> calls them.

I wouldn't do it that way.  I want to use postgres_fdw as a whole, not
break it up into pieces and reuse the individual bits of code.  We
should think also about whether other FDWs could be part of it,
although maybe that's not realistic.

>> In fact, even if we said that we want a fully distributed database,
>> we'd probably still need some kind of distributed table concept.
>> Unless every node has a full copy of everything in the database, you
>> still need to reason about which data is present on which nodes and
>> optimize queries accordingly.
>
> Then, how about building the cluster membership management first,
> including node management and failure detection/failover?  I think
> that node management is necessary anyway, and other developers can
> experiment other things on that cluster infrastructure.  Do you think
> it would be helpful or wasteful?  I'm finding what we can do for early
> scaleout release.

I don't know what "node management" and "failure dectection/failover"
mean specifically.  I'd like to hear proposals, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: I'd like to discuss scaleout at PGCon

From

MauMau

Date:

01 June 2018, 06:00:36

2018-05-31 22:44 GMT+09:00, Robert Haas <robertmhaas@gmail.com>:
> On Thu, May 31, 2018 at 8:12 AM, MauMau <maumau307@gmail.com> wrote:
>> Oh, I didn't know you support FDW approach mainly for analytics.  I
>> guessed the first target was OLTP read-write scalability.
>
> That seems like a harder target to me, because you will have an extra
> hop involved -- SQL from the client to the first server, then via SQL
> to a second server.  The work of parsing and planning also has to be
> done twice, once for the foreign table and again for the table.  For
> longer-running queries this overhead doesn't matter as much, but for
> short-running queries it is significant.

Yes, that extra hop and double parsing/planning were the killer for
our performance goal when we tried to meet our customer's scaleout
needs with XL.  The application executes 82 DML statements in one
transaction.  Those DMLs consist of INSERT, UPDATE and SELECT that
only accesses one row with a primary key.  The target tables are only
a few, so the application PREPAREs a few statements and EXECUTEs them
repeatedly.  We placed the coordinator node of XL on the same host as
the application, and data nodes and GTM on other individual nodes.

The response time of XL compared to PostgreSQL was 2.4 times, and the
throughput (tps) was 43%.  Interestingly, perf showed that
base_yyparse() was the top CPU consumer on both coordinator and data
node, while base_yyparse() appeared near the bottom of the ranking.
The SQL processor should be one layer, not two layers.

In the above benchmark, each transaction only accessed data on one
data node.  That's what sharding principles recommend.  The FDW
approach would be no problem as long as the application follows the
sharding recommendation.

But not all applications will/can follow the sharding recommendation.
The above application, which is migrated from a mainframe, uses
INSERTs to load data, inserting rows onto various nodes.  Considering
your concern of double parsing/planning for a local foreign table and
a remote real table, wouldn't the FDW approach hit the wall?

> I don't know what "node management" and "failure dectection/failover"
> mean specifically.  I'd like to hear proposals, though.

That's nothing special or new.  Things like:

* Define a set of nodes that can join the cluster.
* Initialize or configure a node according to its role in the cluster.
* Decommission a node from the cluster.
* Define a node group in which all member nodes have the same data set
for redundancy.
* One command to start and shutdown the entire cluster.
* System tables to display the member nodes and node groups.
* Each node's in-memory view of the current cluster state.
* How each node monitors which other nodes.
* Elect a new primary node within a node group when the current
primary node fails.
* Whether each node group should be configured with a master-slaves
replication topology, or a multi-master topology like MySQL Group
Replication

Some of the above may end up with XL's things like
pgxc_node/pgxc_group system tables, pgxc_ctl command, CREATE/DROP
NODE/NODE GROUP commands, etc.

Regards
MauMau

Re: I'd like to discuss scaleout at PGCon

From

Ashutosh Bapat

Date:

01 June 2018, 17:44:43

On Thu, May 31, 2018 at 11:00 PM, MauMau <maumau307@gmail.com> wrote:
> 2018-05-31 22:44 GMT+09:00, Robert Haas <robertmhaas@gmail.com>:
>> On Thu, May 31, 2018 at 8:12 AM, MauMau <maumau307@gmail.com> wrote:
>>> Oh, I didn't know you support FDW approach mainly for analytics.  I
>>> guessed the first target was OLTP read-write scalability.
>>
>> That seems like a harder target to me, because you will have an extra
>> hop involved -- SQL from the client to the first server, then via SQL
>> to a second server.  The work of parsing and planning also has to be
>> done twice, once for the foreign table and again for the table.  For
>> longer-running queries this overhead doesn't matter as much, but for
>> short-running queries it is significant.
>
> Yes, that extra hop and double parsing/planning were the killer for
> our performance goal when we tried to meet our customer's scaleout
> needs with XL.  The application executes 82 DML statements in one
> transaction.  Those DMLs consist of INSERT, UPDATE and SELECT that
> only accesses one row with a primary key.  The target tables are only
> a few, so the application PREPAREs a few statements and EXECUTEs them
> repeatedly.  We placed the coordinator node of XL on the same host as
> the application, and data nodes and GTM on other individual nodes.
>

I agree that there's double parsing happening, but I am hesitant to
agree with the double planning claim. We do plan, let's say a join
between two foreign tables, on the local server, but that's only to
decide whether it's efficient to join locally or on the foreign
server. That means we create foreign paths for scan on the foreign
tables, may be as many parameterized plans as the number of join
conditions, and one path for the join pushdown that's it. We then
create local join paths but we need those to decide whether it's
efficient to join locally and if yes, which way. But don't create
paths as to how the foreign server would plan that join. That's not
double planning since we do not create same paths locally and on the
foreign server.

In order to avoid double parsing, we might want to find a way to pass
a "normalized" parse tree down to the foreign server. We need to
normalize the OIDs in the parse tree since those may be different
across the nodes.

>
>
>> I don't know what "node management" and "failure dectection/failover"
>> mean specifically.  I'd like to hear proposals, though.
>
> That's nothing special or new.  Things like:

That's a good summary of what we need here. Thanks for the summary.

>
> * Define a set of nodes that can join the cluster.
> * Initialize or configure a node according to its role in the cluster.
> * Decommission a node from the cluster.
> * Define a node group in which all member nodes have the same data set
> for redundancy.
> * One command to start and shutdown the entire cluster.

Right.

> * System tables to display the member nodes and node groups.

I think we need system tables on each node to store the cluster
configuration as seen by that node not just display. But that's a bit
of a detail.

> * Each node's in-memory view of the current cluster state.
> * How each node monitors which other nodes.

That's where we can have multiple ways. It's either each node
monitoring other nodes or we have a kind of watch-dog or a central
place (not necessarily a node, a file shared across the nodes might as
well server that purpose) from where we can fetch the "last known"
state of the cluster. But that along with split brain problem is much
larger problem to solve. I agree we need some way to know the "last
known" status of the cluster and then correct it as the cluster has
conflicting experiences, and do all of that without having much
communication overhead.

> * Elect a new primary node within a node group when the current
> primary node fails.

Well, some configuration might want more than one primary or
coordinator nodes. Having a single primary in itself creates an SPOF,
which should be avoided. But then there will be some loads which will
be happy with a single primary and risks arising out of that.

> * Whether each node group should be configured with a master-slaves
> replication topology, or a multi-master topology like MySQL Group
> Replication

Instead of a master-slave configuration, we might want to use logical
replication or some such method to create replicas of tables on
multiple nodes in a cluster and let the optimizer take advantage of
that for join push-down or load balancing.

>
> Some of the above may end up with XL's things like
> pgxc_node/pgxc_group system tables, pgxc_ctl command, CREATE/DROP
> NODE/NODE GROUP commands, etc.
>

I agree.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: I'd like to discuss scaleout at PGCon

From

Simon Riggs

Date:

01 June 2018, 18:02:35

On 1 June 2018 at 15:44, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
> On Thu, May 31, 2018 at 11:00 PM, MauMau <maumau307@gmail.com> wrote:
>> 2018-05-31 22:44 GMT+09:00, Robert Haas <robertmhaas@gmail.com>:
>>> On Thu, May 31, 2018 at 8:12 AM, MauMau <maumau307@gmail.com> wrote:
>>>> Oh, I didn't know you support FDW approach mainly for analytics.  I
>>>> guessed the first target was OLTP read-write scalability.
>>>
>>> That seems like a harder target to me, because you will have an extra
>>> hop involved -- SQL from the client to the first server, then via SQL
>>> to a second server.  The work of parsing and planning also has to be
>>> done twice, once for the foreign table and again for the table.  For
>>> longer-running queries this overhead doesn't matter as much, but for
>>> short-running queries it is significant.
>>
>> Yes, that extra hop and double parsing/planning were the killer for
>> our performance goal when we tried to meet our customer's scaleout
>> needs with XL.  The application executes 82 DML statements in one
>> transaction.  Those DMLs consist of INSERT, UPDATE and SELECT that
>> only accesses one row with a primary key.  The target tables are only
>> a few, so the application PREPAREs a few statements and EXECUTEs them
>> repeatedly.  We placed the coordinator node of XL on the same host as
>> the application, and data nodes and GTM on other individual nodes.
>>
>
> I agree that there's double parsing happening, but I am hesitant to
> agree with the double planning claim. We do plan, let's say a join
> between two foreign tables, on the local server, but that's only to
> decide whether it's efficient to join locally or on the foreign
> server. That means we create foreign paths for scan on the foreign
> tables, may be as many parameterized plans as the number of join
> conditions, and one path for the join pushdown that's it. We then
> create local join paths but we need those to decide whether it's
> efficient to join locally and if yes, which way. But don't create
> paths as to how the foreign server would plan that join. That's not
> double planning since we do not create same paths locally and on the
> foreign server.
>
> In order to avoid double parsing, we might want to find a way to pass
> a "normalized" parse tree down to the foreign server. We need to
> normalize the OIDs in the parse tree since those may be different
> across the nodes.

Passing detailed info between servers is exactly what XL does.

It requires us to define a cluster, exactly as XL does.

And yes, its a good idea to replicate some tables to all nodes, as XL does.

So it seems we have at last some agreement that some of the things XL
does are the correct approaches.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: I'd like to discuss scaleout at PGCon

From

Simon Riggs

Date:

01 June 2018, 18:10:38

On 1 June 2018 at 04:00, MauMau <maumau307@gmail.com> wrote:

> The SQL processor should be one layer, not two layers.

For OLTP, that would be best. But it would be restricted to
single-node requests, leaving you the problem of how you know ahead of
time whether an SQL statement was single node or not.

Using a central coordinator node allows us to hide the decision of
single-node/multi-node from the user which seems essential for general
SQL. If you are able to restrict the types of requests users make then
we can do direct access to partitions - so there is scope for a
single-node API, as Mongo provides.

Using a central coordinator also allows multi-node transaction
control, global deadlock detection etc..

And that is why both XL and "FDW approach" rely on a central coordinator.

FDWs alone are not enough. It is clear that some more tight coupling
is required to get things to work well. For example, supporting SQL
query plans that allow for redistribution of data for joins.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: I'd like to discuss scaleout at PGCon

From

Tom Lane

Date:

01 June 2018, 18:27:14

Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> writes:
> In order to avoid double parsing, we might want to find a way to pass
> a "normalized" parse tree down to the foreign server. We need to
> normalize the OIDs in the parse tree since those may be different
> across the nodes.

I don't think this is a good idea at all.  It breaks any hope of
supporting remote servers that are not the identical version to the local
one (since their parsetrees might be different).  And "normalized OIDs"
sounds like "pie in the sky".  You might get away with asssuming that
built-in functions have stable OIDs, but you can't expect that for
functions in extensions.

            regards, tom lane

Re: I'd like to discuss scaleout at PGCon

From

Ashutosh Bapat

Date:

01 June 2018, 18:56:39

On Fri, Jun 1, 2018 at 11:10 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> Using a central coordinator also allows multi-node transaction
> control, global deadlock detection etc..

But that becomes an SPOF and then we have to configure a standby for
that. I am not saying that that's a bad design but it's not very good
for many work-loads. But it would be good if we could avoid any
"central server" in this configuration.

>
> And that is why both XL and "FDW approach" rely on a central coordinator.

I don't think we ever specified that "FDW approach" "relies" on a
central coordinator. One could configure and setup a cluster with
multiple coordinators using FDWs.

>
> FDWs alone are not enough. It is clear that some more tight coupling
> is required to get things to work well. For example, supporting SQL
> query plans that allow for redistribution of data for joins.

I think partitioning + FDW provide basic infrastructure for
distributing data, planning queries working with such data. We need
more glue to support node management, cluster configuration. So, I
agree with your statement. But I think it was clear from the beginning
that we need more than FDW and partitioning.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: I'd like to discuss scaleout at PGCon

From

Ashutosh Bapat

Date:

01 June 2018, 19:00:45

On Fri, Jun 1, 2018 at 11:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> writes:
>> In order to avoid double parsing, we might want to find a way to pass
>> a "normalized" parse tree down to the foreign server. We need to
>> normalize the OIDs in the parse tree since those may be different
>> across the nodes.
>
> I don't think this is a good idea at all.  It breaks any hope of
> supporting remote servers that are not the identical version to the local
> one (since their parsetrees might be different).  And "normalized OIDs"
> sounds like "pie in the sky".  You might get away with asssuming that
> built-in functions have stable OIDs, but you can't expect that for
> functions in extensions.

Sorry for confusing writeup. I didn't mean "normalized OIDs" as I
mentioned in my last sentence. I meant "normalized parse-tree" as in
the first sentence. In order to normalize parse trees, we need to at
least replace various OIDs in parse-tree with something that the
foreign server will understand correctly like table name on the
foreign table pointed to by local foreign table OR (schema qualified)
function names  and so on. There might be more things to "normalize"
in the parse tree other than OIDs, but I can't think of anything right
now.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: I'd like to discuss scaleout at PGCon

From

Merlin Moncure

Date:

01 June 2018, 19:29:43

On Wed, May 30, 2018 at 9:26 PM Robert Haas <robertmhaas@gmail.com> wrote:
> The FDW approach, of which I have been a supporter for some years now,
> is really aiming at a different target, which is to allow efficient
> analytics queries across a multi-node cluster.  I think we're getting
> pretty close to being able to do that -- IMHO, the last fundamental
> building block that we need is asynchronous execution, which Andres is
> working on.  After that, it's a matter of adding other features that
> people want (like cross-node MVCC) and improving the plans for queries
> that still don't perform well (like joins that involve redistributing
> one of the tables involved).

FWIW, Distributed analytical queries is the right market to be in.
This is the field in which I work, and this is where the action is at.
I am very, very, sure about this.  My view is that many of the
existing solutions to this problem (in particular hadoop class
soltuions) have major architectural downsides that make them
inappropriate in use cases that postgres really shines at; direct
hookups to low latency applications for example.  postgres is
fundamentally a more capable 'node' with its multiple man-millennia of
engineering behind it.  Unlimited vertical scaling (RAC etc) is
interesting too, but this is not the way the market is moving as
hardware advancements have reduced or eliminated the need for that in
many spheres.

The direction of the project is sound and we are on the cusp of the
point where multiple independent coalescing features (FDW, logical
replication, parallel query, executor enhancements) will open new
scaling avenues that will not require trading off the many other
benefits of SQL that competing contemporary solutions might.  The
broader development market is starting to realize this and that is a
major driver of the recent upswing in popularity.  This is benefiting
me tremendously personally due to having gone 'all-in' with postgres
almost 20 years ago :-D. (Time sure flies)    These are truly
wonderful times for the community.

merlin

Re: I'd like to discuss scaleout at PGCon

From

Simon Riggs

Date:

02 June 2018, 11:05:14

On 1 June 2018 at 16:56, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
> On Fri, Jun 1, 2018 at 11:10 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>
>> Using a central coordinator also allows multi-node transaction
>> control, global deadlock detection etc..
>
> But that becomes an SPOF and then we have to configure a standby for
> that. I am not saying that that's a bad design but it's not very good
> for many work-loads. But it would be good if we could avoid any
> "central server" in this configuration.
>
>>
>> And that is why both XL and "FDW approach" rely on a central coordinator.
>
> I don't think we ever specified that "FDW approach" "relies" on a
> central coordinator. One could configure and setup a cluster with
> multiple coordinators using FDWs.

Yes, of course. You're just misunderstanding me. I'm talking about a
query coordinator "role". There can be many coordinator components and
they can be spread out in variours ways, but for any one SQL query
there needs to be one coordinator node. Not a SPOF.

>>
>> FDWs alone are not enough. It is clear that some more tight coupling
>> is required to get things to work well. For example, supporting SQL
>> query plans that allow for redistribution of data for joins.
>
> I think partitioning + FDW provide basic infrastructure for
> distributing data, planning queries working with such data. We need
> more glue to support node management, cluster configuration. So, I
> agree with your statement. But I think it was clear from the beginning
> that we need more than FDW and partitioning.

No, it wasn't clear. But I'm glad to hear it. It might actually work then.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: I'd like to discuss scaleout at PGCon

From

Ashutosh Bapat

Date:

03 June 2018, 00:46:15

On Sat, Jun 2, 2018 at 4:05 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 1 June 2018 at 16:56, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
>> On Fri, Jun 1, 2018 at 11:10 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>>
>>> Using a central coordinator also allows multi-node transaction
>>> control, global deadlock detection etc..
>>
>> But that becomes an SPOF and then we have to configure a standby for
>> that. I am not saying that that's a bad design but it's not very good
>> for many work-loads. But it would be good if we could avoid any
>> "central server" in this configuration.
>>
>>>
>>> And that is why both XL and "FDW approach" rely on a central coordinator.
>>
>> I don't think we ever specified that "FDW approach" "relies" on a
>> central coordinator. One could configure and setup a cluster with
>> multiple coordinators using FDWs.
>
> Yes, of course. You're just misunderstanding me. I'm talking about a
> query coordinator "role". There can be many coordinator components and
> they can be spread out in variours ways, but for any one SQL query
> there needs to be one coordinator node. Not a SPOF.

In your earlier mail, which is included above, you mentioned central
coordinator for multi-node transaction control and global deadlock
detection. That doesn't sound like a "query coordinator role". It
sounds more like GTM, which is an SPOF. Anyway I am happy to clarify
that "FDW approach" relies on a query coordinator, the server which
faces the client. But I don't think we have decided how would the
transaction management and deadlock detection work in the shared
nothing cluster of PostgreSQL servers. There was discussion in
developer unconference this year, but I was not part of that as I was
holding another session the same time. May be somebody who attended
that session can post a summary here or provide a link to the summary
written elsewhere.

>
>>>
>>> FDWs alone are not enough. It is clear that some more tight coupling
>>> is required to get things to work well. For example, supporting SQL
>>> query plans that allow for redistribution of data for joins.
>>
>> I think partitioning + FDW provide basic infrastructure for
>> distributing data, planning queries working with such data. We need
>> more glue to support node management, cluster configuration. So, I
>> agree with your statement. But I think it was clear from the beginning
>> that we need more than FDW and partitioning.
>
> No, it wasn't clear. But I'm glad to hear it. It might actually work then.

Good to see some agreement.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: I'd like to discuss scaleout at PGCon

From

Simon Riggs

Date:

03 June 2018, 09:00:41

On 2 June 2018 at 22:46, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:

>>>> And that is why both XL and "FDW approach" rely on a central coordinator.
>>>
>>> I don't think we ever specified that "FDW approach" "relies" on a
>>> central coordinator. One could configure and setup a cluster with
>>> multiple coordinators using FDWs.
>>
>> Yes, of course. You're just misunderstanding me. I'm talking about a
>> query coordinator "role". There can be many coordinator components and
>> they can be spread out in variours ways, but for any one SQL query
>> there needs to be one coordinator node. Not a SPOF.
>
> In your earlier mail, which is included above, you mentioned central
> coordinator for multi-node transaction control and global deadlock
> detection. That doesn't sound like a "query coordinator role". It
> sounds more like GTM, which is an SPOF.

In XL, GTM is a singe component managing transaction ids. That has a
standby, so is not a SPOF.

But that is not what I mean. I don't believe that a GTM-style
component is necessary in a future in-core scalablility solution.

Each incoming query needs to be planned and executed from one
coordinator component, then the work performed across many workers on
different nodes (or just one). We could have coordinator components on
each worker node, or we could have a set of coordinator nodes and a
set of worker nodes.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: I'd like to discuss scaleout at PGCon

From

Ashutosh Bapat

Date:

04 June 2018, 07:05:09

On Sun, Jun 3, 2018 at 2:00 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> In XL, GTM is a singe component managing transaction ids. That has a
> standby, so is not a SPOF.
>
> But that is not what I mean. I don't believe that a GTM-style
> component is necessary in a future in-core scalablility solution.
>

I agree. I think, a distributed algorithm, which does not need a
single GTM-style node, would be better. That automatically provides
high availability without configuring a standby.

> Each incoming query needs to be planned and executed from one
> coordinator component, then the work performed across many workers on
> different nodes (or just one).

Each node need to be confiugred and maintained. That requires efforts.
So we need to keep the number of nodes to a minimum. With a
coordinator and worker node segregation, we require at least two nodes
in a cluster and just that configuration doesn't provide much
scalability. With each node functioning as coordinator (facing
clients) and worker (facing other coordinators) keeps the number of
nodes to a minimum. It is good for HA.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: I'd like to discuss scaleout at PGCon

From

"MauMau"

Date:

05 June 2018, 17:58:44

From: Simon Riggs
> Passing detailed info between servers is exactly what XL does.
>
> It requires us to define a cluster, exactly as XL does.
>
> And yes, its a good idea to replicate some tables to all nodes, as
XL does.
>
> So it seems we have at last some agreement that some of the things
XL
> does are the correct approaches.

Exactly.  I was impressed when I read the paper on XC for the first
time.  I respect what Suzuki-san and people from NTT, EnterpriseDB,
and 2ndQuadrant have done for XC/XL.  Just like some people believe we
should leverage the artifact for FDW, I think we should leverage the
code and idea of XC/XL.

Regards
MauMau

Re: I'd like to discuss scaleout at PGCon

From

"MauMau"

Date:

05 June 2018, 19:14:04

From: Robert Haas
On Thu, May 31, 2018 at 8:12 AM, MauMau <maumau307@gmail.com> wrote:
>> Oh, I didn't know you support FDW approach mainly for analytics.  I
>> guessed the first target was OLTP read-write scalability.
>
> That seems like a harder target to me, because you will have an
extra
> hop involved -- SQL from the client to the first server, then via
SQL
> to a second server.  The work of parsing and planning also has to be
> done twice, once for the foreign table and again for the table.  For
> longer-running queries this overhead doesn't matter as much, but for
> short-running queries it is significant.

From: Simon Riggs
On 1 June 2018 at 04:00, MauMau <maumau307@gmail.com> wrote:
>> The SQL processor should be one layer, not two layers.

> For OLTP, that would be best. But it would be restricted to
> single-node requests, leaving you the problem of how you know ahead
of
> time whether an SQL statement was single node or not.
>
> Using a central coordinator node allows us to hide the decision of
> single-node/multi-node from the user which seems essential for
general
> SQL. If you are able to restrict the types of requests users make
then
> we can do direct access to partitions - so there is scope for a
> single-node API, as Mongo provides.

I don't think an immediate server like the coordinators in XL is
necessary.  That extra hop can be eliminated by putting both the
coordinator and the data node roles in the same server process.  That
is, the node to which an application connects communicates with other
nodes only when it does not necessary data.

Furthermore, an extra hop and double parsing/planning could matter for
analytic queries, too.  For example, SAP HANA boasts of scanning 1
billion rows in one second.  In HANA's scaleout architecture, an
application can connect to any worker node and the node communicates
with other nodes only when necessary (there's one special node called
"master", but it manages the catalog and transactions; it's not an
extra hop like the coordinator in XL).  Vertica is an MPP analytics
database, but it doesn't have a node like the coordinator, either.  To
achieve maximum performance for real-time queries, the scaleout
architecture should avoid an extra hop when possible.

> Using a central coordinator also allows multi-node transaction
> control, global deadlock detection etc..

VoltDB does not have an always-pass hop like the coordinator in XL.
Our proprietary RDBMS named Symfoware, which is not based on
PostgreSQL, also doesn't have an extra hop, and can handle distributed
transactions and deadlock detection/resolution without any special
node like GTM.

Regards
MauMau

Re: I'd like to discuss scaleout at PGCon

From

"MauMau"

Date:

05 June 2018, 19:34:20

From: Ashutosh Bapat
> In order to normalize parse trees, we need to at
> least replace various OIDs in parse-tree with something that the
> foreign server will understand correctly like table name on the
> foreign table pointed to by local foreign table OR (schema
qualified)
> function names  and so on.

Yes, that's the drawback of each node in the cluster having different
OIDs for the same object.  That applies to XL, too.  In XL, the data
node returns the data type names of the columns in the result set to
the coordinator.  Then the coordinator seemed to parse each data type
name with base_yyparse() to convert the name to its OID on the
coordinator.  That's why base_yyparse() appeared at the top in the
perf profile.  That repeated typename-to-OID conversion would be
reduced by caching the conversion result, like the logical replication
of PostgreSQL does.  But managing the catalog at one place and using
the same OID values seems to concise to me as a concept.


Regards
MauMau

Re: I'd like to discuss scaleout at PGCon

From

"MauMau"

Date:

05 June 2018, 19:36:35

From: Ashutosh Bapat
> In order to normalize parse trees, we need to at
> least replace various OIDs in parse-tree with something that the
> foreign server will understand correctly like table name on the
> foreign table pointed to by local foreign table OR (schema
qualified)
> function names  and so on.

Yes, that's the drawback of each node in the cluster having different
OIDs for the same object.  That applies to XL, too.  In XL, the data
node returns the data type names of the columns in the result set to
the coordinator.  Then the coordinator seemed to parse each data type
name with base_yyparse() to convert the name to its OID on the
coordinator.  That's why base_yyparse() appeared at the top in the
perf profile.  That repeated typename-to-OID conversion would be
reduced by caching the conversion result, like the logical replication
of PostgreSQL does.  But managing the catalog at one place and using
the same OID values seems to concise to me as a concept.

Regards
MauMau

-----Original Message----- 
From: Ashutosh Bapat
Sent: Saturday, June 2, 2018 1:00 AM
To: Tom Lane
Cc: MauMau ; Robert Haas ; PostgreSQL Hackers
Subject: Re: I'd like to discuss scaleout at PGCon

On Fri, Jun 1, 2018 at 11:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> writes:
>> In order to avoid double parsing, we might want to find a way to
pass
>> a "normalized" parse tree down to the foreign server. We need to
>> normalize the OIDs in the parse tree since those may be different
>> across the nodes.
>
> I don't think this is a good idea at all.  It breaks any hope of
> supporting remote servers that are not the identical version to the
local
> one (since their parsetrees might be different).  And "normalized
OIDs"
> sounds like "pie in the sky".  You might get away with asssuming
that
> built-in functions have stable OIDs, but you can't expect that for
> functions in extensions.

Sorry for confusing writeup. I didn't mean "normalized OIDs" as I
mentioned in my last sentence. I meant "normalized parse-tree" as in
the first sentence. In order to normalize parse trees, we need to at
least replace various OIDs in parse-tree with something that the
foreign server will understand correctly like table name on the
foreign table pointed to by local foreign table OR (schema qualified)
function names  and so on. There might be more things to "normalize"
in the parse tree other than OIDs, but I can't think of anything right
now.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: I'd like to discuss scaleout at PGCon

From

"MauMau"

Date:

05 June 2018, 19:48:47

From: Ashutosh Bapat
> Each node need to be confiugred and maintained. That requires
efforts.
> So we need to keep the number of nodes to a minimum. With a
> coordinator and worker node segregation, we require at least two
nodes
> in a cluster and just that configuration doesn't provide much
> scalability. With each node functioning as coordinator (facing
> clients) and worker (facing other coordinators) keeps the number of
> nodes to a minimum. It is good for HA.

I think so, too.  Every node should behave as both the coordinator and
the data node in XL parlance.  But I don't preclude a central node.
Some node needs to manage sequences, and it may as well manage the
system catalog.


Regards
MauMau

Re: I'd like to discuss scaleout at PGCon

From

"MauMau"

Date:

05 June 2018, 20:17:07

From: Merlin Moncure
> FWIW, Distributed analytical queries is the right market to be in.
> This is the field in which I work, and this is where the action is
at.
> I am very, very, sure about this.  My view is that many of the
> existing solutions to this problem (in particular hadoop class
> soltuions) have major architectural downsides that make them
> inappropriate in use cases that postgres really shines at; direct
> hookups to low latency applications for example.  postgres is
> fundamentally a more capable 'node' with its multiple man-millennia
of
> engineering behind it.  Unlimited vertical scaling (RAC etc) is
> interesting too, but this is not the way the market is moving as
> hardware advancements have reduced or eliminated the need for that
in
> many spheres.

I'm feeling the same.  As the Moore's Law ceases to hold, software
needs to make most of the processor power.  Hadoop and Spark are
written in Java and Scala.  According to Google [1] (see Fig. 8), Java
is slower than C++ by 3.7x - 12.6x, and Scala is slower than C++ by
2.5x - 3.6x.

Won't PostgreSQL be able to cover the workloads of Hadoop and Spark
someday, when PostgreSQL supports scaleout, in-memory database,
multi-model capability, and in-database filesystem?  That may be a
pipedream, but why do people have to tolerate the separation of the
relational-based data  warehouse and Hadoop-based data lake?


[1]    Robert Hundt. "Loop Recognition in C++/Java/Go/Scala".
Proceedings of Scala Days 2011

Regards
MauMau

Re: I'd like to discuss scaleout at PGCon

From

"MauMau"

Date:

05 June 2018, 20:53:37

From: Simon Riggs
On 1 June 2018 at 16:56, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
>> I think partitioning + FDW provide basic infrastructure for
>> distributing data, planning queries working with such data. We need
>> more glue to support node management, cluster configuration. So, I
>> agree with your statement. But I think it was clear from the
beginning
>> that we need more than FDW and partitioning.
>
> No, it wasn't clear. But I'm glad to hear it. It might actually work
then.

I found a possibly interesting description in the XL manual.  Although
XL performs various pushdowns like FDW, XL seems to perform some kind
of joins with datanode-to-datanode communication.  Doesn't this prove
that the FDW approach can't handle those joins optimally?  What kind
of joins use the shared queue?

https://www.postgres-xl.org/documentation/pg-xc-specifics.html
--------------------------------------------------
shared_queues (integer)
Datanode Only

For some joins that occur in queries, data from one Datanode may need
to be joined with data from another Datanode. Postgres-XL uses shared
queues for this purpose. During execution each Datanode knows if it
needs to produce or consume tuples, or both.

Note that there may be mulitple shared_queues used even for a single
query. So a value should be set taking into account the number of
connections it can accept and expected number of such joins occurring
simultaneously.
--------------------------------------------------

Regards
MauMau

Re: I'd like to discuss scaleout at PGCon

From

Michael Paquier

Date:

06 June 2018, 06:06:08

On Wed, Jun 06, 2018 at 01:14:04AM +0900, MauMau wrote:
> I don't think an immediate server like the coordinators in XL is
> necessary.  That extra hop can be eliminated by putting both the
> coordinator and the data node roles in the same server process.  That
> is, the node to which an application connects communicates with other
> nodes only when it does not necessary data.

Yes, I agree with that.  This was actually a concern I had over the
original XC design after a couple of years working on it.  The less
nodes, the easier the HA, even if applying any PITR logic in N nodes
instead of N*2 nodes with 2PC checks and cleanups is far from trivial
either..  It happens that the code resulting in splitting coordinator
and datanode was simpler to maintain than merging both, at the cost of
operation maintenance and complexity in running the thing.

> Furthermore, an extra hop and double parsing/planning could matter for
> analytic queries, too.  For example, SAP HANA boasts of scanning 1
> billion rows in one second.  In HANA's scaleout architecture, an
> application can connect to any worker node and the node communicates
> with other nodes only when necessary (there's one special node called
> "master", but it manages the catalog and transactions; it's not an
> extra hop like the coordinator in XL).  Vertica is an MPP analytics
> database, but it doesn't have a node like the coordinator, either.  To
> achieve maximum performance for real-time queries, the scaleout
> architecture should avoid an extra hop when possible.

Greenplum's orca planner (and Citus?) have such facilities if I recall
correctly, just mentioning that pushing down directly to remote nodes
compiled plans ready for execution exists here and there (that's not the
case of XC/XL).  For queries whose planning time is way shorter than its
actual execution, like analytical work that would not matter much.  But
not for OLTP and short transaction workloads.

>> Using a central coordinator also allows multi-node transaction
>> control, global deadlock detection etc..
>
> VoltDB does not have an always-pass hop like the coordinator in XL.

Greenplum uses also a single-coordinator, multi-datanode instance.  That
looks similar, right?

> Our proprietary RDBMS named Symfoware, which is not based on
> PostgreSQL, also doesn't have an extra hop, and can handle distributed
> transactions and deadlock detection/resolution without any special
> node like GTM.

Interesting to know that.  This is an area with difficult problems.  At
the closer to merge with Postgres head, the more fun (?) you get into
trying to support new SQL features, and sometimes you finish with hard
ERRORs or extra GUC switches to prevent any kind of inconsistent
operations.
--
Michael

Attachment

signature.asc

Re: I'd like to discuss scaleout at PGCon

From

Ashutosh Bapat

Date:

06 June 2018, 06:48:33

On Tue, Jun 5, 2018 at 10:04 PM, MauMau <maumau307@gmail.com> wrote:
> From: Ashutosh Bapat
>> In order to normalize parse trees, we need to at
>> least replace various OIDs in parse-tree with something that the
>> foreign server will understand correctly like table name on the
>> foreign table pointed to by local foreign table OR (schema
> qualified)
>> function names  and so on.
>
> Yes, that's the drawback of each node in the cluster having different
> OIDs for the same object.  That applies to XL, too.

Keeping OIDs same across the nodes would require extra communication
between nodes to keep track of next OID, dropped OIDs etc. We need to
weigh the time spent in that communication and the time saved during
parsing.

>  In XL, the data
> node returns the data type names of the columns in the result set to
> the coordinator.  Then the coordinator seemed to parse each data type
> name with base_yyparse() to convert the name to its OID on the
> coordinator.  That's why base_yyparse() appeared at the top in the
> perf profile.

I do not understand, why do we need base_yyparse() to parse type name.
We already have functions specifically for parsing object names.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: I'd like to discuss scaleout at PGCon

From

Simon Riggs

Date:

06 June 2018, 11:44:07

On 5 June 2018 at 17:14, MauMau <maumau307@gmail.com> wrote:

> Furthermore, an extra hop and double parsing/planning could matter for
> analytic queries, too.  For example, SAP HANA boasts of scanning 1
> billion rows in one second.  In HANA's scaleout architecture, an
> application can connect to any worker node and the node communicates
> with other nodes only when necessary (there's one special node called
> "master", but it manages the catalog and transactions; it's not an
> extra hop like the coordinator in XL).  Vertica is an MPP analytics
> database, but it doesn't have a node like the coordinator, either.  To
> achieve maximum performance for real-time queries, the scaleout
> architecture should avoid an extra hop when possible.

Agreed. When possible.

When is it possible?

Two ways I know of are:

1. have pre-knowledge in the client of where data is located
(difficult to scale)
2. you must put data in all places the client can connect to (i.e. multimaster)

Perhaps there are more?

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: I'd like to discuss scaleout at PGCon

From

Konstantin Knizhnik

Date:

06 June 2018, 11:58:51

On 05.06.2018 20:17, MauMau wrote:
> From: Merlin Moncure
>> FWIW, Distributed analytical queries is the right market to be in.
>> This is the field in which I work, and this is where the action is
> at.
>> I am very, very, sure about this.  My view is that many of the
>> existing solutions to this problem (in particular hadoop class
>> soltuions) have major architectural downsides that make them
>> inappropriate in use cases that postgres really shines at; direct
>> hookups to low latency applications for example.  postgres is
>> fundamentally a more capable 'node' with its multiple man-millennia
> of
>> engineering behind it.  Unlimited vertical scaling (RAC etc) is
>> interesting too, but this is not the way the market is moving as
>> hardware advancements have reduced or eliminated the need for that
> in
>> many spheres.
> I'm feeling the same.  As the Moore's Law ceases to hold, software
> needs to make most of the processor power.  Hadoop and Spark are
> written in Java and Scala.  According to Google [1] (see Fig. 8), Java
> is slower than C++ by 3.7x - 12.6x, and Scala is slower than C++ by
> 2.5x - 3.6x.
>
> Won't PostgreSQL be able to cover the workloads of Hadoop and Spark
> someday, when PostgreSQL supports scaleout, in-memory database,
> multi-model capability, and in-database filesystem?  That may be a
> pipedream, but why do people have to tolerate the separation of the
> relational-based data  warehouse and Hadoop-based data lake?
>
>
> [1]    Robert Hundt. "Loop Recognition in C++/Java/Go/Scala".
> Proceedings of Scala Days 2011
>
> Regards
> MauMau
>
>
I can not completely agree with it. I have done a lot of benchmarking of 
PostgreSQL, CitusDB, SparkSQL and native C/Scala code generated for 
TPC-H queries.
The picture is not so obvious... All this systems provides different 
scalability and so shows best performance at different hardware 
configurations.
Also Java JIT has made a good progress since 2011. Calculation intensive 
code (like matrix multiplication) implemented in Java is about 2 times 
slower than optimized C code.
But DBMSes are rarely CPU bounded. Even if all database fits in memory 
(which is not so common scenario for big data applications), speed of 
modern CPU is much higher than RAM access speed... Java application are 
slower than C/C++ mostly because of garbage collection. This is why 
SparkSQL is moving to off-heap approach when objects are allocated 
outside Java heap and so not affecting Java GC.  New versions of 
SparkSQL with off-heap memory and native code generation show very good 
performance. And high scalability always was one of the major features 
of SparkSQL.

So it is naive to expect that Postgres will be 4 times faster than 
SparkSQL on analytic queries just because it is written in C and 
SparkSQL - in Scala.
Postgres has made a very good progress in support of OLAP in last 
releases: it now supports parallel query execution, JIT, partitioning...
But still its scalability is very limited comparing with SparkSQL. I am 
not sure about GreenPlum with its sophisticated distributed query 
optimizer, but
most of other OLAP solutions for Postgres are not able to efficiently 
handle complex queries (with a lot of joins by non-partitioning keys).

I do not want to say that it is not possible to implement good analytic 
platform for OLAP on top of Postgres. But it is very challenged task.
And IMHO choice of programming language is not so important. What is 
more important is format of storing data. The bast systems for data 
analytic: Vartica, HyPer, KDB,...
are using vertical data mode. SparkSQL is also using Parquet file format 
which provides efficient extraction and processing of data.
With abstract storage API Postgres is also given a chance to implement 
efficient storage for OLAP data processing. But huge amount of work has 
to be done here.

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: I'd like to discuss scaleout at PGCon

From

Pavel Stehule

Date:

06 June 2018, 12:11:02

2018-06-06 10:58 GMT+02:00 Konstantin Knizhnik <k.knizhnik@postgrespro.ru>:

On 05.06.2018 20:17, MauMau wrote:
From: Merlin Moncure
FWIW, Distributed analytical queries is the right market to be in.
This is the field in which I work, and this is where the action is
at.
I am very, very, sure about this. My view is that many of the
existing solutions to this problem (in particular hadoop class
soltuions) have major architectural downsides that make them
inappropriate in use cases that postgres really shines at; direct
hookups to low latency applications for example. postgres is
fundamentally a more capable 'node' with its multiple man-millennia
of
engineering behind it. Unlimited vertical scaling (RAC etc) is
interesting too, but this is not the way the market is moving as
hardware advancements have reduced or eliminated the need for that
in
many spheres.
I'm feeling the same. As the Moore's Law ceases to hold, software
needs to make most of the processor power. Hadoop and Spark are
written in Java and Scala. According to Google [1] (see Fig. 8), Java
is slower than C++ by 3.7x - 12.6x, and Scala is slower than C++ by
2.5x - 3.6x.

Won't PostgreSQL be able to cover the workloads of Hadoop and Spark
someday, when PostgreSQL supports scaleout, in-memory database,
multi-model capability, and in-database filesystem? That may be a
pipedream, but why do people have to tolerate the separation of the
relational-based data warehouse and Hadoop-based data lake?

[1] Robert Hundt. "Loop Recognition in C++/Java/Go/Scala".
Proceedings of Scala Days 2011

Regards
MauMau

I can not completely agree with it. I have done a lot of benchmarking of PostgreSQL, CitusDB, SparkSQL and native C/Scala code generated for TPC-H queries.
The picture is not so obvious... All this systems provides different scalability and so shows best performance at different hardware configurations.
Also Java JIT has made a good progress since 2011. Calculation intensive code (like matrix multiplication) implemented in Java is about 2 times slower than optimized C code.
But DBMSes are rarely CPU bounded. Even if all database fits in memory (which is not so common scenario for big data applications), speed of modern CPU is much higher than RAM access speed... Java application are slower than C/C++ mostly because of garbage collection. This is why SparkSQL is moving to off-heap approach when objects are allocated outside Java heap and so not affecting Java GC. New versions of SparkSQL with off-heap memory and native code generation show very good performance. And high scalability always was one of the major features of SparkSQL.

So it is naive to expect that Postgres will be 4 times faster than SparkSQL on analytic queries just because it is written in C and SparkSQL - in Scala.
Postgres has made a very good progress in support of OLAP in last releases: it now supports parallel query execution, JIT, partitioning...
But still its scalability is very limited comparing with SparkSQL. I am not sure about GreenPlum with its sophisticated distributed query optimizer, but
most of other OLAP solutions for Postgres are not able to efficiently handle complex queries (with a lot of joins by non-partitioning keys).

I do not want to say that it is not possible to implement good analytic platform for OLAP on top of Postgres. But it is very challenged task.
And IMHO choice of programming language is not so important. What is more important is format of storing data. The bast systems for data analytic: Vartica, HyPer, KDB,...
are using vertical data mode. SparkSQL is also using Parquet file format which provides efficient extraction and processing of data.
With abstract storage API Postgres is also given a chance to implement efficient storage for OLAP data processing. But huge amount of work has to be done here.

Unfortunately, storage is one factor. For good performance columnar storages needs different executor. Although smart columnar storage can get very good compress ratio, so can has sense self.

Regards

Pavel

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: I'd like to discuss scaleout at PGCon

From

"MauMau"

Date:

06 June 2018, 17:46:22

From: Ashutosh Bapat
> Keeping OIDs same across the nodes would require extra communication
> between nodes to keep track of next OID, dropped OIDs etc. We need
to
> weigh the time spent in that communication and the time saved during
> parsing.

If we manage the system catalog for cluster-wide objects (tables,
indexes, databases, users, and almost all other objects) on one
central node, and separate the OID space for cluster-wide objects from
that for the node-local objects (temporary tables, tablespaces, and
what else?), we don't need to keep track of OIDs across nodes.


>>  In XL, the data
>> node returns the data type names of the columns in the result set
to
>> the coordinator.  Then the coordinator seemed to parse each data
type
>> name with base_yyparse() to convert the name to its OID on the
>> coordinator.  That's why base_yyparse() appeared at the top in the
>> perf profile.
>
> I do not understand, why do we need base_yyparse() to parse type
name.
> We already have functions specifically for parsing object names.

Looking at the XL source code, the following sequence of functions are
called when the coordinator handles the Row Description message ('T')
from the data node.  I guess the parsing is necessary to process type
names combined with type modifiers, e.g. "char(100)".

    create_tuple_desc
    parseTypeString
    typeStringToTypeName
    raw_parser


Regards
MauMau

Re: I'd like to discuss scaleout at PGCon

From

"MauMau"

Date:

06 June 2018, 18:53:20

From: Michael Paquier
> Greenplum's orca planner (and Citus?) have such facilities if I
recall
> correctly, just mentioning that pushing down directly to remote
nodes
> compiled plans ready for execution exists here and there (that's not
the
> case of XC/XL).  For queries whose planning time is way shorter than
its
> actual execution, like analytical work that would not matter much.
But
> not for OLTP and short transaction workloads.

It seems that Greenplum does:

https://greenplum.org/docs/580/admin_guide/query/topics/parallel-proc.
html#topic1

"The master receives, parses, and optimizes the query. The resulting
query plan is either parallel or targeted. The master dispatches
parallel query plans to all segments,..."

while Citus doesn't:

https://docs.citusdata.com/en/v7.4/develop/reference_processing.html#c
itus-query-processing

"Next, the planner breaks the query into two parts - the coordinator
query which runs on the coordinator and the worker query fragments
which run on individual shards on the workers. The planner then
assigns these query fragments to the workers such that all their
resources are used efficiently. After this step, the distributed query
plan is passed on to the distributed executor for execution.
...
Once the distributed executor sends the query fragments to the
workers, they are processed like regular PostgreSQL queries. The
PostgreSQL planner on that worker chooses the most optimal plan for
executing that query locally on the corresponding shard table. The
PostgreSQL executor then runs that query and returns the query results
back to the distributed executor."



BTW, the above page states that worker nodes directly exchanges data
during query execution.  Greenplum also does so among segment nodes to
join tables which are distributed by different key columns.  XL seems
to do so, too.  If this type of interaction is necessary, how would
the FDW approach do that?  The remote servers need to interact with
each other.

"The task tracker executor is designed to efficiently handle complex
queries which require repartitioning and shuffling intermediate data
among workers."



> Greenplum uses also a single-coordinator, multi-datanode instance.
That
> looks similar, right?

Greenplum uses a single master and multiple workers.  That's similar
to Citus.  But Greenplum is not similar to VoltDB nor Vertica, since
those allow applications to connect to any node.


>> Our proprietary RDBMS named Symfoware, which is not based on
>> PostgreSQL, also doesn't have an extra hop, and can handle
distributed
>> transactions and deadlock detection/resolution without any special
>> node like GTM.
>
> Interesting to know that.  This is an area with difficult problems.
At
> the closer to merge with Postgres head, the more fun (?) you get
into
> trying to support new SQL features, and sometimes you finish with
hard
> ERRORs or extra GUC switches to prevent any kind of inconsistent
> operations.

Yes, I hope our deadlock detection/resolution can be ported to
PostgreSQL.  But I'm also concerned like you, because Symfoware is
locking-based, not MVCC-based.

Regards
MauMau

Re: I'd like to discuss scaleout at PGCon

From

"MauMau"

Date:

06 June 2018, 19:37:56

From: Simon Riggs
On 5 June 2018 at 17:14, MauMau <maumau307@gmail.com> wrote:

>> Furthermore, an extra hop and double parsing/planning could matter
for
>> analytic queries, too.  For example, SAP HANA boasts of scanning 1
>> billion rows in one second.  In HANA's scaleout architecture, an
>> application can connect to any worker node and the node
communicates
>> with other nodes only when necessary (there's one special node
called
>> "master", but it manages the catalog and transactions; it's not an
>> extra hop like the coordinator in XL).  Vertica is an MPP analytics
>> database, but it doesn't have a node like the coordinator, either.
To
>> achieve maximum performance for real-time queries, the scaleout
>> architecture should avoid an extra hop when possible.

> Agreed. When possible.
>
> When is it possible?
>
> Two ways I know of are:
>
> 1. have pre-knowledge in the client of where data is located
> (difficult to scale)
> 2. you must put data in all places the client can connect to (i.e.
multimaster)

Regarding 1, I understood you meant by "difficult to scale" that
whenever the user adds/removes a node from a cluster and data
placement changes, he has to change his application's connection
string to point to the node where necessary data resides.
Oracle Sharding provides a solution for that problem.  See "6.1 Direct
Routing to a Shard" in the following manual page:

https://docs.oracle.com/en/database/oracle/oracle-database/18/shard/sh
arding-data-routing.html#GUID-64CAD794-FAAA-406B-9E20-0C35E96D3FA8

[Excerpt]
"Oracle clients and connections pools are able to recognize sharding
keys specified in the connection string for high performance data
dependent routing. A shard routing cache in the connection layer is
used to route database requests directly to the shard where the data
resides."

> Perhaps there are more?

Please see 6.1.1 below.  The application can connect to multiple nodes
and distribute processing without an extra hop.  This is also an
interesting idea, isn't it?

https://docs.voltdb.com/UsingVoltDB/ChapAppDesign.php#DesignAppConnect
Multi

Regards
MauMau

Re: I'd like to discuss scaleout at PGCon

From

Alvaro Herrera

Date:

06 June 2018, 21:16:23

On 2018-Jun-06, Ashutosh Bapat wrote:

> On Tue, Jun 5, 2018 at 10:04 PM, MauMau <maumau307@gmail.com> wrote:
> > From: Ashutosh Bapat
> >> In order to normalize parse trees, we need to at least replace
> >> various OIDs in parse-tree with something that the foreign server
> >> will understand correctly like table name on the foreign table
> >> pointed to by local foreign table OR (schema qualified) function
> >> names and so on.
> >
> > Yes, that's the drawback of each node in the cluster having
> > different OIDs for the same object.  That applies to XL, too.
> 
> Keeping OIDs same across the nodes would require extra communication
> between nodes to keep track of next OID, dropped OIDs etc. We need to
> weigh the time spent in that communication and the time saved during
> parsing.

We already have the ability to give objects some predetermined OID, for
pg_upgrade.

Maybe an easy (hah) thing to do is use 2PC for DDL, agree on a OID
that's free on every node, then create the object in all servers at the
same time.  We currently use the system-wide OID generator to assign the
OID, but seems an easy thing to change (much harder is to prevent
concurrent creation of objects using the arranged OID; maybe can reuse
speculative tokens in btrees for this).  Doing this imposes a cost at
DDL-execution-time only, which seems much better than imposing the cost
of translating name to OID on every server for every query.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: I'd like to discuss scaleout at PGCon

From

"MauMau"

Date:

07 June 2018, 01:38:57

From: Alvaro Herrera
> Maybe an easy (hah) thing to do is use 2PC for DDL, agree on a OID
> that's free on every node, then create the object in all servers at
the
> same time.  We currently use the system-wide OID generator to assign
the
> OID, but seems an easy thing to change (much harder is to prevent
> concurrent creation of objects using the arranged OID; maybe can
reuse
> speculative tokens in btrees for this).

Would it mean that DDL execution requires all (primary) nodes to be
running?  I don't say it's an immediate problem yet, but FYI, Oracle
Sharding records all DDL statements and re-send them to the down nodes
later.


> Doing this imposes a cost at
> DDL-execution-time only, which seems much better than imposing the
cost
> of translating name to OID on every server for every query.

Agreed.

Regards
MauMau

Re: I'd like to discuss scaleout at PGCon

From

"MauMau"

Date:

07 June 2018, 01:39:34

From: Alvaro Herrera
> Maybe an easy (hah) thing to do is use 2PC for DDL, agree on a OID
> that's free on every node, then create the object in all servers at
the
> same time.  We currently use the system-wide OID generator to assign
the
> OID, but seems an easy thing to change (much harder is to prevent
> concurrent creation of objects using the arranged OID; maybe can
reuse
> speculative tokens in btrees for this).

Would it mean that DDL execution requires all (primary) nodes to be
running?  I don't say it's an immediate problem yet, but FYI, Oracle
Sharding records all DDL statements and re-send them to the down nodes
later.


> Doing this imposes a cost at
> DDL-execution-time only, which seems much better than imposing the
cost
> of translating name to OID on every server for every query.

Agreed.

Regards
MauMau

Re: I'd like to discuss scaleout at PGCon

From

Ashutosh Bapat

Date:

07 June 2018, 07:58:15

On Wed, Jun 6, 2018 at 8:16 PM, MauMau <maumau307@gmail.com> wrote:
> From: Ashutosh Bapat
>> Keeping OIDs same across the nodes would require extra communication
>> between nodes to keep track of next OID, dropped OIDs etc. We need
> to
>> weigh the time spent in that communication and the time saved during
>> parsing.
>
> If we manage the system catalog for cluster-wide objects (tables,
> indexes, databases, users, and almost all other objects) on one
> central node, and separate the OID space for cluster-wide objects from
> that for the node-local objects (temporary tables, tablespaces, and
> what else?), we don't need to keep track of OIDs across nodes.
>

I think that will lead to many more network trips to the central node
in order to fetch OIDs of various objects. We could reduce that by
using node local caches but still invalidation messages will eat some
bandwidth. The network trips in this case may be more than what are
required for keeping track of OID space across nodes. Do you see
something like this in similar other products?

>
>>>  In XL, the data
>>> node returns the data type names of the columns in the result set
> to
>>> the coordinator.  Then the coordinator seemed to parse each data
> type
>>> name with base_yyparse() to convert the name to its OID on the
>>> coordinator.  That's why base_yyparse() appeared at the top in the
>>> perf profile.
>>
>> I do not understand, why do we need base_yyparse() to parse type
> name.
>> We already have functions specifically for parsing object names.
>
> Looking at the XL source code, the following sequence of functions are
> called when the coordinator handles the Row Description message ('T')
> from the data node.  I guess the parsing is necessary to process type
> names combined with type modifiers, e.g. "char(100)".
>
>     create_tuple_desc
>     parseTypeString
>     typeStringToTypeName
>     raw_parser
>

Hmm, strange. I had not seen that in Postgres-XC.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: I'd like to discuss scaleout at PGCon

From

Ashutosh Bapat

Date:

07 June 2018, 08:09:09

On Wed, Jun 6, 2018 at 11:46 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> On 2018-Jun-06, Ashutosh Bapat wrote:
>
>> On Tue, Jun 5, 2018 at 10:04 PM, MauMau <maumau307@gmail.com> wrote:
>> > From: Ashutosh Bapat
>> >> In order to normalize parse trees, we need to at least replace
>> >> various OIDs in parse-tree with something that the foreign server
>> >> will understand correctly like table name on the foreign table
>> >> pointed to by local foreign table OR (schema qualified) function
>> >> names and so on.
>> >
>> > Yes, that's the drawback of each node in the cluster having
>> > different OIDs for the same object.  That applies to XL, too.
>>
>> Keeping OIDs same across the nodes would require extra communication
>> between nodes to keep track of next OID, dropped OIDs etc. We need to
>> weigh the time spent in that communication and the time saved during
>> parsing.
>
> We already have the ability to give objects some predetermined OID, for
> pg_upgrade.

True. But that's only for a database not in action. We are talking
about database in action. Assigning a predetermined OID is just one of
and possibly the smallest thing in the bigger picture.

>
> Maybe an easy (hah) thing to do is use 2PC for DDL, agree on a OID
> that's free on every node, then create the object in all servers at the
> same time.  We currently use the system-wide OID generator to assign the
> OID, but seems an easy thing to change (much harder is to prevent
> concurrent creation of objects using the arranged OID; maybe can reuse
> speculative tokens in btrees for this).  Doing this imposes a cost at
> DDL-execution-time only, which seems much better than imposing the cost
> of translating name to OID on every server for every query.

This works if we consider that all the nodes are up always. If a few
nodes are down, the rest of the nodes need to determine the OID and
communicate it to the failed nodes when they come up. That's easier
said than done. The moment we design something like that, we have to
deal with split brain problem. Two sets of nodes which think the other
set is down, will keep assigning OIDs that they think are OK and later
see the conflicts when communicating the assigned OIDs.

Not that we can not implement something like this, but it is a lot of
work. We will need to be careful to identify the cases where the
scheme will fail and plug all the holes.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: I'd like to discuss scaleout at PGCon

From

Michael Paquier

Date:

07 June 2018, 08:23:10

On Thu, Jun 07, 2018 at 10:28:15AM +0530, Ashutosh Bapat wrote:
> On Wed, Jun 6, 2018 at 8:16 PM, MauMau <maumau307@gmail.com> wrote:
>> Looking at the XL source code, the following sequence of functions are
>> called when the coordinator handles the Row Description message ('T')
>> from the data node.  I guess the parsing is necessary to process type
>> names combined with type modifiers, e.g. "char(100)".
>>
>>     create_tuple_desc
>>     parseTypeString
>>     typeStringToTypeName
>>     raw_parser
>
> Hmm, strange. I had not seen that in Postgres-XC.

Where is that located in the XC code?  If you point out to the location
it may revive some past memories from either Ashutosh or me.
--
Michael

Attachment

signature.asc

Re: I'd like to discuss scaleout at PGCon

From

Ashutosh Bapat

Date:

07 June 2018, 08:25:21

On Thu, Jun 7, 2018 at 10:53 AM, Michael Paquier <michael@paquier.xyz> wrote:
> On Thu, Jun 07, 2018 at 10:28:15AM +0530, Ashutosh Bapat wrote:
>> On Wed, Jun 6, 2018 at 8:16 PM, MauMau <maumau307@gmail.com> wrote:
>>> Looking at the XL source code, the following sequence of functions are
>>> called when the coordinator handles the Row Description message ('T')
>>> from the data node.  I guess the parsing is necessary to process type
>>> names combined with type modifiers, e.g. "char(100)".
>>>
>>>     create_tuple_desc
>>>     parseTypeString
>>>     typeStringToTypeName
>>>     raw_parser
>>
>> Hmm, strange. I had not seen that in Postgres-XC.
>
> Where is that located in the XC code?  If you point out to the location
> it may revive some past memories from either Ashutosh or me.

I think the path will exist in XC as well since that's regular
PostgreSQL path, but I didn't see it being a hot path in XC. So, we
will need to find the workload/cases where it's a hot path.

> --
> Michael



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

RE: I'd like to discuss scaleout at PGCon

From

"Tsunakawa, Takayuki"

Date:

07 June 2018, 09:17:21

From: Michael Paquier [mailto:michael@paquier.xyz]
> > On Wed, Jun 6, 2018 at 8:16 PM, MauMau <maumau307@gmail.com> wrote:
> >> Looking at the XL source code, the following sequence of functions
> >> are called when the coordinator handles the Row Description message
> >> ('T') from the data node.  I guess the parsing is necessary to
> >> process type names combined with type modifiers, e.g. "char(100)".
> >>
> >>     create_tuple_desc
> >>     parseTypeString
> >>     typeStringToTypeName
> >>     raw_parser
> >
> > Hmm, strange. I had not seen that in Postgres-XC.
> 
> Where is that located in the XC code?  If you point out to the location
> it may revive some past memories from either Ashutosh or me.

The XL-specific code path is this in src/backend/pgxc/pool/execRemote.c (I don't know this also exists in XC):
handle_response -> HandleRowDescription -> create_tuple_desc

I think the rest exists in PostgreSQL:
parseTypeString -> typeStringToTypeName -> raw_parser

Regards
Takayuki Tsunakawa

RE: I'd like to discuss scaleout at PGCon

From

"Tsunakawa, Takayuki"

Date:

07 June 2018, 09:24:12

> From: Konstantin Knizhnik [mailto:k.knizhnik@postgrespro.ru]
> I can not completely agree with it. I have done a lot of benchmarking of
> PostgreSQL, CitusDB, SparkSQL and native C/Scala code generated for
> TPC-H queries.

Wow, you have an amazingly abundant experience.


> I do not want to say that it is not possible to implement good analytic
> platform for OLAP on top of Postgres. But it is very challenged task.
> And IMHO choice of programming language is not so important. What is
> more important is format of storing data. The bast systems for data
> analytic: Vartica, HyPer, KDB,...
> are using vertical data mode. SparkSQL is also using Parquet file format
> which provides efficient extraction and processing of data.
> With abstract storage API Postgres is also given a chance to implement
> efficient storage for OLAP data processing. But huge amount of work has
> to be done here.

Agreed on huge amount of work... And I must admit my dream is a pipedream now.


Regards
Takayuki Tsunakawa

Re: I'd like to discuss scaleout at PGCon

From

Bruce Momjian

Date:

22 June 2018, 20:34:17

On Fri, Jun  1, 2018 at 11:29:43AM -0500, Merlin Moncure wrote:
> FWIW, Distributed analytical queries is the right market to be in.
> This is the field in which I work, and this is where the action is at.
> I am very, very, sure about this.  My view is that many of the
> existing solutions to this problem (in particular hadoop class
> soltuions) have major architectural downsides that make them
> inappropriate in use cases that postgres really shines at; direct
> hookups to low latency applications for example.  postgres is
> fundamentally a more capable 'node' with its multiple man-millennia of
> engineering behind it.  Unlimited vertical scaling (RAC etc) is
> interesting too, but this is not the way the market is moving as
> hardware advancements have reduced or eliminated the need for that in
> many spheres.
> 
> The direction of the project is sound and we are on the cusp of the
> point where multiple independent coalescing features (FDW, logical
> replication, parallel query, executor enhancements) will open new
> scaling avenues that will not require trading off the many other
> benefits of SQL that competing contemporary solutions might.  The
> broader development market is starting to realize this and that is a
> major driver of the recent upswing in popularity.  This is benefiting
> me tremendously personally due to having gone 'all-in' with postgres
> almost 20 years ago :-D. (Time sure flies)    These are truly
> wonderful times for the community.

I am coming in late, but I am glad we are having this conversation.  We
have made great strides toward sharding while adding minimal
sharding-specific code.  We can now see a time when we will complete the
the minimal sharding-specific code tasks.  Once we reach that point, we
will need to decide what sharding-specific code to add, and to do that,
we need to understand which direction to go in, and to do that, we need
to know the trade-offs.

While I am glad people know a lot about how other projects handle
sharding, these can be only guides to how Postgres will handle such
workloads.  I think we need to get to a point where we have all of the
minimal sharding-specific code features done, at least as
proof-of-concept, and then test Postgres with various workloads like
OLTP/OLAP and read-write/read-only.  This will tell us where
sharding-specific code will have the greatest impact.

What we don't want to do is to add a bunch of sharding-specific code
without knowing which workloads it benefits, and how many of our users
will actually use sharding.  Some projects have it done that, and it
didn't end well since they then had a lot of product complexity with
little user value.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

Re: I'd like to discuss scaleout at PGCon

From

Merlin Moncure

Date:

22 June 2018, 21:28:58

On Fri, Jun 22, 2018 at 12:34 PM Bruce Momjian <bruce@momjian.us> wrote:
>
> On Fri, Jun  1, 2018 at 11:29:43AM -0500, Merlin Moncure wrote:
> > FWIW, Distributed analytical queries is the right market to be in.
> > This is the field in which I work, and this is where the action is at.
> > I am very, very, sure about this.  My view is that many of the
> > existing solutions to this problem (in particular hadoop class
> > soltuions) have major architectural downsides that make them
> > inappropriate in use cases that postgres really shines at; direct
> > hookups to low latency applications for example.  postgres is
> > fundamentally a more capable 'node' with its multiple man-millennia of
> > engineering behind it.  Unlimited vertical scaling (RAC etc) is
> > interesting too, but this is not the way the market is moving as
> > hardware advancements have reduced or eliminated the need for that in
> > many spheres.
> >
> > The direction of the project is sound and we are on the cusp of the
> > point where multiple independent coalescing features (FDW, logical
> > replication, parallel query, executor enhancements) will open new
> > scaling avenues that will not require trading off the many other
> > benefits of SQL that competing contemporary solutions might.  The
> > broader development market is starting to realize this and that is a
> > major driver of the recent upswing in popularity.  This is benefiting
> > me tremendously personally due to having gone 'all-in' with postgres
> > almost 20 years ago :-D. (Time sure flies)    These are truly
> > wonderful times for the community.
>
> While I am glad people know a lot about how other projects handle
> sharding, these can be only guides to how Postgres will handle such
> workloads.  I think we need to get to a point where we have all of the
> minimal sharding-specific code features done, at least as
> proof-of-concept, and then test Postgres with various workloads like
> OLTP/OLAP and read-write/read-only.  This will tell us where
> sharding-specific code will have the greatest impact.
>
> What we don't want to do is to add a bunch of sharding-specific code
> without knowing which workloads it benefits, and how many of our users
> will actually use sharding.  Some projects have it done that, and it
> didn't end well since they then had a lot of product complexity with
> little user value.

Key features from my perspective:
*) fdw in parallel.  how do i do it today? ghetto implemented parallel
queries with asynchronous dblink

*) column store

*) automatic partition management through shards

probably some more, gotta run :-)

merlin

Re: I'd like to discuss scaleout at PGCon

From

"Joshua D. Drake"

Date:

22 June 2018, 21:36:08

On 06/22/2018 11:28 AM, Merlin Moncure wrote:
>
> Key features from my perspective:
> *) fdw in parallel.  how do i do it today? ghetto implemented parallel
> queries with asynchronous dblink
>
> *) column store

Although not in core, we do have this as an extension through Citus 
don't we?

JD

-- 
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
***  A fault and talent of mine is to tell it exactly how it is.  ***
PostgreSQL centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://postgresconf.org
*****     Unless otherwise stated, opinions are my own.   *****

Re: I'd like to discuss scaleout at PGCon

From

Bruce Momjian

Date:

23 June 2018, 05:17:41

On Fri, Jun 22, 2018 at 01:28:58PM -0500, Merlin Moncure wrote:
> On Fri, Jun 22, 2018 at 12:34 PM Bruce Momjian <bruce@momjian.us> wrote:
> >
> > What we don't want to do is to add a bunch of sharding-specific code
> > without knowing which workloads it benefits, and how many of our users
> > will actually use sharding.  Some projects have it done that, and it
> > didn't end well since they then had a lot of product complexity with
> > little user value.
> 
> Key features from my perspective:
> *) fdw in parallel.  how do i do it today? ghetto implemented parallel
> queries with asynchronous dblink

Andres has outlined what needs to be done here:

    https://www.postgresql.org/message-id/20180525033538.6ypfwcqcxce6zkjj%40alap3.anarazel.de

> *) column store

This could be part of the plugable storage engine.

> *) automatic partition management through shards

Yes, but I am afraid we need to know where we are going before we can
implement management.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

Re: I'd like to discuss scaleout at PGCon

From

Haribabu Kommi

Date:

23 June 2018, 05:41:00

On Sat, Jun 23, 2018 at 12:17 PM Bruce Momjian <bruce@momjian.us> wrote:

On Fri, Jun 22, 2018 at 01:28:58PM -0500, Merlin Moncure wrote:
> On Fri, Jun 22, 2018 at 12:34 PM Bruce Momjian <bruce@momjian.us> wrote:
> >
> > What we don't want to do is to add a bunch of sharding-specific code
> > without knowing which workloads it benefits, and how many of our users
> > will actually use sharding. Some projects have it done that, and it
> > didn't end well since they then had a lot of product complexity with
> > little user value.
>
> Key features from my perspective:
> *) fdw in parallel. how do i do it today? ghetto implemented parallel
> queries with asynchronous dblink

Andres has outlined what needs to be done here:

https://www.postgresql.org/message-id/20180525033538.6ypfwcqcxce6zkjj%40alap3.anarazel.de

Sorry if this was already been discussed in up-thread.

Just I would like to bring out idea scale out by adding many instances that

can share the lock and buffer pool manager with all the instances with

the help of Remote direct memory access.

By adding pluggable buffer pool and lock manager, how about adding

many instances and all share the buffers using RDMA to provide

better scaling with shared everything.

Currently I didn't know have any idea whether is it possible or not and also

the problems in using RDMA.

Just want to check whether is it worth idea to consider in supporting scale out?

Regards,

Haribabu Kommi

Fujitsu Australia

Re: I'd like to discuss scaleout at PGCon

From

Bruce Momjian

Date:

23 June 2018, 06:03:37

On Sat, Jun 23, 2018 at 12:41:00PM +1000, Haribabu Kommi wrote:
> 
> On Sat, Jun 23, 2018 at 12:17 PM Bruce Momjian <bruce@momjian.us> wrote:
> 
>     On Fri, Jun 22, 2018 at 01:28:58PM -0500, Merlin Moncure wrote:
>     > On Fri, Jun 22, 2018 at 12:34 PM Bruce Momjian <bruce@momjian.us> wrote:
>     > >
>     > > What we don't want to do is to add a bunch of sharding-specific code
>     > > without knowing which workloads it benefits, and how many of our users
>     > > will actually use sharding.  Some projects have it done that, and it
>     > > didn't end well since they then had a lot of product complexity with
>     > > little user value.
>     >
>     > Key features from my perspective:
>     > *) fdw in parallel.  how do i do it today? ghetto implemented parallel
>     > queries with asynchronous dblink
> 
>     Andres has outlined what needs to be done here:
> 
>             https://www.postgresql.org/message-id/
>     20180525033538.6ypfwcqcxce6zkjj%40alap3.anarazel.de 
> 
> 
> Sorry if this was already been discussed in up-thread.
> 
> Just I would like to bring out idea scale out by adding many instances that
> can share the lock and buffer pool manager with all the instances with
> the help of Remote direct memory access.
> 
> By adding pluggable buffer pool and lock manager, how about adding
> many instances and all share the buffers using RDMA to provide
> better scaling with shared everything.
> 
> Currently I didn't know have any idea whether is it possible or not and also
> the problems in using RDMA. 
> 
> Just want to check whether is it worth idea to consider in supporting scale
> out?

Yes, Robert Haas did mention this.  It might be something we consider
much later.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

RE: I'd like to discuss scaleout at PGCon

From

"tsunakawa.takay@fujitsu.com"

Date:

16 June 2020, 08:01:21

Hello, hackers

I'm very sorry to have left this thread for a long time.  I've come out of hibernation, pushed by the need.  Please let
meresume the discussion on the scale-out design. 

I'll read this thread and related ones again to refresh my memory, and I'd like to continue to assemble ideas and
opinionsin the following wiki page: 

Scaleout Design - PostgreSQL wiki
https://wiki.postgresql.org/wiki/Scaleout_Design

I know there are lots of topics to decide in order to formulate the architectural design and functional specification.
Toget the most with minimal waste of efforts, I think we need to clarify what we want to achieve.  Some of the most
importanttopics are: 

* What workloads do we target?  Does Postgres give up on data warehousing capability that's comparable with existing
productsand cloud services? 

* What architecture(s) do we want?  Shared nothing, shared disk, a combination of them, or a totally new one.

* FDW or non-FDW approach? (as Simon and I mentioned in this thread, I don't think FDW is suitable for the scale-out,
althoughwe should try to reuse the code in postgres_fdw.) 

From: Bruce Momjian <bruce@momjian.us>
> On Sat, Jun 23, 2018 at 12:41:00PM +1000, Haribabu Kommi wrote:
> > Just I would like to bring out idea scale out by adding many instances that
> > can share the lock and buffer pool manager with all the instances with
> > the help of Remote direct memory access.
> >
> > By adding pluggable buffer pool and lock manager, how about adding
> > many instances and all share the buffers using RDMA to provide
> > better scaling with shared everything.
> >
> > Currently I didn't know have any idea whether is it possible or not and also
> > the problems in using RDMA.
> >
> > Just want to check whether is it worth idea to consider in supporting scale
> > out?
>
> Yes, Robert Haas did mention this.  It might be something we consider
> much later.

I said that we wouldn't need shared disk scale-out at or around PGCon developer unconference 2018.  But after that, I
realizedOracle RAC users want shared disk scale-out for Postgres. 

So, I wrote about the comparison of shared nothing and shared disk, and the rough design of shared disk scale-out as in
theattached PDF file (I also attached it in the above wiki.)  The attached file is what I used in PostgreSQL conference
Japan2019 to ask about how many users want shared disk scale-out.  25 out of 53 participants raised their hands to show
theirfeelings of "want it" or "good to have."  That was much more people than I had expected. 

On the other hand, as I questioned in the last slide of the presentation, I'm not sure if we really need shared disk
scale-outin this era that really powerful machines are available even on public clouds.  Should we skip shared disk
scale-outand just pursue single-node scale-up and shared nothing scale-out? 

Amazon Aurora as an Alternative to Oracle RAC
https://aws.amazon.com/jp/blogs/database/amazon-aurora-as-an-alternative-to-oracle-rac/

"Stepping back and looking at the bigger picture, Amazon Aurora introduces a simplified solution that can function as
anOracle RAC alternative for many typical OLTP applications that need high performance writes, scalable reads, very
highlevels of high availability with lower operational overhead." 

Migration Complete -- Amazon’s Consumer Business Just Turned off its Final Oracle Database | AWS News Blog

https://aws.amazon.com/jp/blogs/aws/migration-complete-amazons-consumer-business-just-turned-off-its-final-oracle-database/

Could you have a look at my presentation and give your opinion on whether we want shared disk scale-out for Postgres?

Regards
Takayuki Tsunakawa

Attachment

PostgreSQL_shared_disk_scaleout_EN.pdf

RE: I'd like to discuss scaleout at PGCon

From

"tsunakawa.takay@fujitsu.com"

Date:

18 June 2020, 04:38:06

Hello,

It seems you didn't include pgsql-hackers.

From: Sumanta Mukherjee <sumanta.mukherjee@enterprisedb.com>

> I saw the presentation and it is great except that it seems to be unclear of both SD and SN if the storage and the compute are being explicitly separated. Separation of storage and compute would have some cost advantages as per my understanding. The following two work (ref below) has some information about the usefulness of this technique for scale out and so it would be an interesting addition to see if in the SN architecture that is being proposed could be modified to take care of this phenomenon and reap the gain.

Thanks. Separation of compute and storage is surely to be considered. Unlike the old days when the shared storage was considered to be a bottleneck with slow HDDs and FC-SAN, we could now expect high speed shared storage thanks to flash memory, NVMe-oF, and RDMA.

> 1. Philip A. Bernstein, Colin W. Reid, and Sudipto Das. 2011. Hyder - A

> Transactional Record Manager for Shared Flash. In CIDR 2011.

This is interesting. I'll go into this. Do you know there's any product based on Hyder? OTOH, Hyder seems to require drastic changes when adopting for Postgres -- OCC, log-structured database, etc. I'd like to hear how feasible those are. However, its scale-out capability without the need for data or application partitioning appears appealing.

To explore another possibility that would have more affinity with the current Postgres, let me introduce our proprietary product called Symfoware. It's not based on Postgres.

It has shared nothing scale-out functionality with full ACID based on 2PC, conventional 2PL locking and distributed deadlock resolution. Despite being shared nothing, all the database files and transaction logs are stored on shared storage.

The database is divided into "log groups." Each log group has one transaction log and multiple tablespaces (it's called "database space" instead of tablespace.)

Each DB instance in the cluster owns multiple log groups, and handles reads/writes to the data in its owning log groups. When a DB instance fails, other surviving DB instances take over the log groups of the failed DB instance, recover the data using the transaction log of the log group, and accepts reads/writes to the data in the log group. The DBA configures which DB instance initially owns which log groups and which DB instances are candidates to take over which log groups.

This way, no server is idle as a standby. All DB instances work hard to process read-write transactions. This "no idle server for HA" is one of the things Oracle RAC users want in terms of cost.

However, it still requires data and application partitioning unlike Hyder. Does anyone think of a way to eliminate partitioning? Data and application partitioning is what Oracle RAC users want to avoid or cannot tolerate.

Ref: Introduction of the Symfoware shared nothing scale-out called "load share."

https://pdfs.semanticscholar.org/8b60/163593931cebc58e9f637cfb501500230adc.pdf

Regards

Takayuki Tsunakawa

--- below is Sumanta's original mail ---

From: Sumanta Mukherjee <sumanta.mukherjee@enterprisedb.com>
Sent: Wednesday, June 17, 2020 5:34 PM
To: Tsunakawa, Takayuki/綱川貴之 <tsunakawa.takay@fujitsu.com>
Cc: Bruce Momjian <bruce@momjian.us>; Merlin Moncure <mmoncure@gmail.com>; Robert Haas <robertmhaas@gmail.com>; maumau307@gmail.com
Subject: Re: I'd like to discuss scaleout at PGCon

Hello,

I saw the presentation and it is great except that it seems to be unclear of both SD and SN if the storage and the compute are being explicitly separated. Separation of storage and compute would have some cost advantages as per my understanding. The following two work (ref below) has some information about the usefulness of this technique for scale out and so it would be an interesting addition to see if in the SN architecture that is being proposed could be modified to take care of this phenomenon and reap the gain.

1. Philip A. Bernstein, Colin W. Reid, and Sudipto Das. 2011. Hyder - A
Transactional Record Manager for Shared Flash. In CIDR 2011.

2. Dhruba Borthakur. 2017. The Birth of RocksDB-Cloud. http://rocksdb.
blogspot.com/2017/05/the-birth-of-rocksdb-cloud.html.

With Regards,

Sumanta Mukherjee.

EnterpriseDB: http://www.enterprisedb.com

Re: I'd like to discuss scaleout at PGCon

From

Sumanta Mukherjee

Date:

22 June 2020, 04:53:33

Hi,

I read through the symfora paper and it is a nice technique. I am not very sure about where Hyder is used commercially but given that it has come out of Microsoft Research so some microsoft products might be using it/some of these concepts already.

With Regards,

Sumanta Mukherjee.

EnterpriseDB: http://www.enterprisedb.com

On Wed, Jun 17, 2020 at 9:38 PM tsunakawa.takay@fujitsu.com <tsunakawa.takay@fujitsu.com> wrote:

Hello,

It seems you didn't include pgsql-hackers.

From: Sumanta Mukherjee <sumanta.mukherjee@enterprisedb.com>
> I saw the presentation and it is great except that it seems to be unclear of both SD and SN if the storage and the compute are being explicitly separated. Separation of storage and compute would have some cost advantages as per my understanding. The following two work (ref below) has some information about the usefulness of this technique for scale out and so it would be an interesting addition to see if in the SN architecture that is being proposed could be modified to take care of this phenomenon and reap the gain.

Thanks. Separation of compute and storage is surely to be considered. Unlike the old days when the shared storage was considered to be a bottleneck with slow HDDs and FC-SAN, we could now expect high speed shared storage thanks to flash memory, NVMe-oF, and RDMA.

> 1. Philip A. Bernstein, Colin W. Reid, and Sudipto Das. 2011. Hyder - A
> Transactional Record Manager for Shared Flash. In CIDR 2011.

This is interesting. I'll go into this. Do you know there's any product based on Hyder? OTOH, Hyder seems to require drastic changes when adopting for Postgres -- OCC, log-structured database, etc. I'd like to hear how feasible those are. However, its scale-out capability without the need for data or application partitioning appears appealing.

To explore another possibility that would have more affinity with the current Postgres, let me introduce our proprietary product called Symfoware. It's not based on Postgres.

It has shared nothing scale-out functionality with full ACID based on 2PC, conventional 2PL locking and distributed deadlock resolution. Despite being shared nothing, all the database files and transaction logs are stored on shared storage.

The database is divided into "log groups." Each log group has one transaction log and multiple tablespaces (it's called "database space" instead of tablespace.)

Each DB instance in the cluster owns multiple log groups, and handles reads/writes to the data in its owning log groups. When a DB instance fails, other surviving DB instances take over the log groups of the failed DB instance, recover the data using the transaction log of the log group, and accepts reads/writes to the data in the log group. The DBA configures which DB instance initially owns which log groups and which DB instances are candidates to take over which log groups.

This way, no server is idle as a standby. All DB instances work hard to process read-write transactions. This "no idle server for HA" is one of the things Oracle RAC users want in terms of cost.

However, it still requires data and application partitioning unlike Hyder. Does anyone think of a way to eliminate partitioning? Data and application partitioning is what Oracle RAC users want to avoid or cannot tolerate.

Ref: Introduction of the Symfoware shared nothing scale-out called "load share."
https://pdfs.semanticscholar.org/8b60/163593931cebc58e9f637cfb501500230adc.pdf

Regards
Takayuki Tsunakawa

--- below is Sumanta's original mail ---
From: Sumanta Mukherjee <sumanta.mukherjee@enterprisedb.com>
Sent: Wednesday, June 17, 2020 5:34 PM
To: Tsunakawa, Takayuki/綱川貴之 <tsunakawa.takay@fujitsu.com>
Cc: Bruce Momjian <bruce@momjian.us>; Merlin Moncure <mmoncure@gmail.com>; Robert Haas <robertmhaas@gmail.com>; maumau307@gmail.com
Subject: Re: I'd like to discuss scaleout at PGCon

Hello,

I saw the presentation and it is great except that it seems to be unclear of both SD and SN if the storage and the compute are being explicitly separated. Separation of storage and compute would have some cost advantages as per my understanding. The following two work (ref below) has some information about the usefulness of this technique for scale out and so it would be an interesting addition to see if in the SN architecture that is being proposed could be modified to take care of this phenomenon and reap the gain.

1. Philip A. Bernstein, Colin W. Reid, and Sudipto Das. 2011. Hyder - A
Transactional Record Manager for Shared Flash. In CIDR 2011.

2. Dhruba Borthakur. 2017. The Birth of RocksDB-Cloud. http://rocksdb.
blogspot.com/2017/05/the-birth-of-rocksdb-cloud.html.

With Regards,
Sumanta Mukherjee.
EnterpriseDB: http://www.enterprisedb.com

Re: I'd like to discuss scaleout at PGCon

From

"MauMau"

Date:

15 September 2020, 04:37:32

Hello all,


I've overhauled the scaleout design wiki I presented at PGCon 2018
developer unconference and assembled the research of other DBMSs'
scale-out features.

Scaleout Design
https://wiki.postgresql.org/wiki/Scaleout_Design

I intentionally have put little conclusion on our specification and
design.  I'd like you to look at recent distributed databases, and
then think about and discuss what we want to aim for together.  I feel
it's better to separate a thread per topic or group of topics.

I'm sorry, but I'm not confident about the readability at all, because
I cannot draw figures due to my visual impairment, and the page is
full of text only.

What workload do you think we should focus on first, OLTP or
analytics?  I think OLTP, because open source Postgres probably has
been so far getting popular with OLTP.  Also, I don't expect many
people will use existing popular SaaS for data warehousing like Amazon
Redshift, Azure Synapse, Google BigQuery and Snowflake, rather than
build their analytics databases on public IaaS or on-premises.


Regards
MauMau

Re: I'd like to discuss scaleout at PGCon

From

"MauMau"

Date:

15 September 2020, 04:40:36

Hello all,

# I'm resending because some error occurred

I've overhauled the scaleout design wiki I presented at PGCon 2018
developer unconference and assembled the research of other DBMSs'
scale-out features.

Scaleout Design
https://wiki.postgresql.org/wiki/Scaleout_Design

I intentionally have put little conclusion on our specification and
design.  I'd like you to look at recent distributed databases, and
then think about and discuss what we want to aim for together.  I feel
it's better to separate a thread per topic or group of topics.

I'm sorry, but I'm not confident about the readability at all, because
I cannot draw figures due to my visual impairment, and the page is
full of text only.

What workload do you think we should focus on first, OLTP or
analytics?  I think OLTP, because open source Postgres probably has
been so far getting popular with OLTP.  Also, I don't expect many
people will use existing popular SaaS for data warehousing like Amazon
Redshift, Azure Synapse, Google BigQuery and Snowflake, rather than
build their analytics databases on public IaaS or on-premises.


Regards
MauMau

RE: I'd like to discuss scaleout at PGCon

From

"tsunakawa.takay@fujitsu.com"

Date:

24 September 2020, 09:24:41

From: MauMau <maumau307@gmail.com>
> I intentionally have put little conclusion on our specification and
> design.  I'd like you to look at recent distributed databases, and
> then think about and discuss what we want to aim for together.  I feel
> it's better to separate a thread per topic or group of topics.

Finally, MariaDB is now equiped with a scale-out feature.

How MariaDB achieves global scale with Xpand
https://www.infoworld.com/article/3574077/how-mariadb-achieves-global-scale-with-xpand.html

I haven't read its documentation, and am not planning to read it now, but the feature looks nice.  I hope this will
alsobe a good stimulus.  I believe we should be aware of competitiveness when designing -- "If we adopt this
architectureor design to simplify the first version, will it really naturally evolve to a competitive product in the
futurewithout distorting the concept, design, and interface?"
 


Regards
Takayuki Tsunakawa