Thread: Quesion about querying distributed databases

Quesion about querying distributed databases

From
me nefcanto
Date:

Hello

Consider this scenario:

  • 3 servers, 3 databases, each on a separate server:
    • Products database: Contains the Products table (with over 100,000 records).
    • Taxonomy database: Contains the Categories and ItemCategories (EAV) tables.
    • Attributes database: Contains the Attributes and ItemAttributes (EAV) tables.

How do you find products based on the following criteria?

  1. A search in the title (e.g., "awesome shirts").
  2. Selected categories (e.g., "casual" and "sports").
  3. Selected attributes (e.g., "color: blue" and "size: large")

Regards
Saeed

Re: Quesion about querying distributed databases

From
Adrian Klaver
Date:
On 3/4/25 20:40, me nefcanto wrote:
> Hello
> 
> Consider this scenario:
> 
>   * 3 servers, 3 databases, each on a separate server:
>       o *Products database*: Contains the *Products* table (with over
>         100,000 records).
>       o *Taxonomy database*: Contains the *Categories* and
>         *ItemCategories (EAV)* tables.
>       o *Attributes database*: Contains the *Attributes* and
>         *ItemAttributes (EAV)* tables.
> 
> How do you find products based on the following criteria?

https://www.postgresql.org/docs/current/postgres-fdw.html

> 
>  1. A search in the title (e.g., "awesome shirts").
>  2. Selected categories (e.g., "casual" and "sports").
>  3. Selected attributes (e.g., "color: blue" and "size: large")
> 
> 
> Regards
> Saeed

-- 
Adrian Klaver
adrian.klaver@aklaver.com




Re: Quesion about querying distributed databases

From
me nefcanto
Date:
Adrian Klaver, thank you for the link. I asked the AI to create a query for me using FDW.

This is the sample query:

with filtered_products as (
    select p.product_id
    from products.product p
    where p.title ilike '%search_term%'
), category_filtered as (
    select ic.product_id
    from taxonomy.item_categories ic
    where ic.category_id = any(array['category_id_1', 'category_id_2'])
), attribute_filtered as (
    select ia.product_id
    from attributes.item_attributes ia
    where ia.attribute_id = any(array['attribute_id_1', 'attribute_id_2'])
), final_products as (
    select f.product_id
    from filtered_products f
    join category_filtered c on f.product_id = c.product_id
    join attribute_filtered a on f.product_id = a.product_id
    order by f.product_id -- replace with relevant sorting column
    limit 50 offset 0
)
select p.*
from products.product p
join final_products fp on p.product_id = fp.product_id;

The problem here is that it collects all of the product_id values from the ItemCategories table. Let's say each product is put in one category only. This means that we have 100 thousand records in the ItemCategories table. Thus, to show a list of 20 products on the website, this query first fetches 100 thousand product_id values from the remote server.

That's not scalable. Is there a workaround for this?

Thank you
Saeed

On Wed, Mar 5, 2025 at 8:12 AM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 3/4/25 20:40, me nefcanto wrote:
> Hello
>
> Consider this scenario:
>
>   * 3 servers, 3 databases, each on a separate server:
>       o *Products database*: Contains the *Products* table (with over
>         100,000 records).
>       o *Taxonomy database*: Contains the *Categories* and
>         *ItemCategories (EAV)* tables.
>       o *Attributes database*: Contains the *Attributes* and
>         *ItemAttributes (EAV)* tables.
>
> How do you find products based on the following criteria?

https://www.postgresql.org/docs/current/postgres-fdw.html

>
>  1. A search in the title (e.g., "awesome shirts").
>  2. Selected categories (e.g., "casual" and "sports").
>  3. Selected attributes (e.g., "color: blue" and "size: large")
>
>
> Regards
> Saeed

--
Adrian Klaver
adrian.klaver@aklaver.com

Re: Quesion about querying distributed databases

From
Laurenz Albe
Date:
On Wed, 2025-03-05 at 10:12 +0330, me nefcanto wrote:
> Adrian Klaver, thank you for the link. I asked the AI to create a query for me using FDW.
>
> The problem here is that it collects all of the product_id values from the ItemCategories table [...]
>
> That's not scalable. Is there a workaround for this?

Without having scrutinized the case in detail: if your data are organized in an
entity-attribute-value design and distributed across three databases, you cannot
expect to end up with efficient queries.

Perhaps you can extract the data and load them into a reasonably organized
single database.  Such an ETL process might make the task much easier.

Yours,
Laurenz Albe



Re: Quesion about querying distributed databases

From
me nefcanto
Date:
Laurenz Albe, thanks for your answer.

Right now this data is in MariaDB, on separate databases (schema) but on one server. The solution in this situation is to have a cross-database query. (this is the status quo of our application)

Now our team has decided to migrate to Postgres. However, we realized that Postgres does not support cross-database queries. And if we want to do so, we should use FDW. So, we thought we might as well put databases on separate servers for scalability if we have to write more code. That's the reason behind this question.

But we're stuck at performance. In SQL Server and MariaDB, cross-database queries allow for neat separation of data while delivering good performance in the orchestration layer. You have separate databases, which allows for fine-grained management (different backup schedules, index recalculation, deployment, etc.) but at the same time you can write a query in your app, or in an orchestrator database (let's call it All) that is fast enough for millions of records.

However, we're stuck in this in Postgres. What solutions exist for this problem?

Regards
Saeed


On Wed, Mar 5, 2025 at 11:09 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Wed, 2025-03-05 at 10:12 +0330, me nefcanto wrote:
> Adrian Klaver, thank you for the link. I asked the AI to create a query for me using FDW.
>
> The problem here is that it collects all of the product_id values from the ItemCategories table [...]
>
> That's not scalable. Is there a workaround for this?

Without having scrutinized the case in detail: if your data are organized in an
entity-attribute-value design and distributed across three databases, you cannot
expect to end up with efficient queries.

Perhaps you can extract the data and load them into a reasonably organized
single database.  Such an ETL process might make the task much easier.

Yours,
Laurenz Albe

Re: Quesion about querying distributed databases

From
Laurenz Albe
Date:
On Wed, 2025-03-05 at 12:57 +0330, me nefcanto wrote:
> Right now this data is in MariaDB, on separate databases (schema) but on one
> server. The solution in this situation is to have a cross-database query.
> (this is the status quo of our application)
>
> Now our team has decided to migrate to Postgres. However, we realized that
> Postgres does not support cross-database queries. And if we want to do so,
> we should use FDW. So, we thought we might as well put databases on separate
> servers for scalability if we have to write more code. That's the reason
> behind this question.

In MySQL, the terms "database" and "schema" are used for the same thing.
Not so in PostgreSQL.  I think you should migrate the data into different
schemas in a single database, pretty much like you had it in MySQL.
Then you don't need a foreign data wrapper, and I bet the query can
perform as well as it did on MySQL.

Yours,
Laurenz Albe




Re: Quesion about querying distributed databases

From
me nefcanto
Date:
Dear Laurenz, 

That means a solid monolith database. We lose many goodies with that. As a real-world example, right now we can import a single database from the production to the development to test and troubleshoot data.

What if we host all databases on the same server and use FDW. What happens in that case? Does it return 100 thousand records and join in the memory?

Because in SQL Server, when you perform a cross-database query (not cross-server) the performance is extremely good, proving that it does not return 100 thousand ItemId from Taxonomy.ItemCategories to join with ProductId.

Is that the same case with Postgres too, If databases are located on one server?

Regards
Saeed

Re: Quesion about querying distributed databases

From
Laurenz Albe
Date:
On Wed, 2025-03-05 at 14:18 +0330, me nefcanto wrote:
> That means a solid monolith database. We lose many goodies with that.
> As a real-world example, right now we can import a single database
> from the production to the development to test and troubleshoot data.

Well, can't you import a single schema then?

> What if we host all databases on the same server and use FDW. What
> happens in that case? Does it return 100 thousand records and join
> in the memory?

It will do just the same thing.  The performance could be better
because of the reduced latency.

> Because in SQL Server, when you perform a cross-database query
> (not cross-server) the performance is extremely good, proving that
> it does not return 100 thousand ItemId from Taxonomy.ItemCategories
> to join with ProductId.
>
> Is that the same case with Postgres too, If databases are located
> on one server?

No, you cannot perform cross-database queries without a foreign
data wrapper.  I don't see a reason why the statement shouldn't
perform as well as in SQL Server if you use schemas instead of
databases.

Yours,
Laurenz Albe



Re: Quesion about querying distributed databases

From
me nefcanto
Date:
Dear Laurenz, the point is that I think if we put all databases into one database, then we have blocked our growth in the future.
A monolith database can be scaled only vertically. We have had huge headaches in the past with SQL Server on Windows and a single database.
But when you divide bounded contexts into different databases, then you have the chance to deploy each database on a separate physical machine. That means a lot in terms of performance. Please correct me if I am wrong.

Let's put this physical restriction on ourselves that we have different databases. What options do we have? One option that comes to my mind, is to store the ID of the categories in the Products table. This means that I don't need FDW anymore. And databases can be on separate machines. I first query the categories database first, get the category IDs, and then add a where clause to limit the product search. That could be an option. Array data type in Postgres is something that I think other RDBMSs do not have. Will that work? And how about attributes? Because attributes are more than a single ID. I should store the attribute key, alongside its value. It's a key-value pair. What can I do for that?

Thank you for sharing your time. I really appreciate it.
Saeed





On Wed, Mar 5, 2025 at 3:18 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Wed, 2025-03-05 at 14:18 +0330, me nefcanto wrote:
> That means a solid monolith database. We lose many goodies with that.
> As a real-world example, right now we can import a single database
> from the production to the development to test and troubleshoot data.

Well, can't you import a single schema then?

> What if we host all databases on the same server and use FDW. What
> happens in that case? Does it return 100 thousand records and join
> in the memory?

It will do just the same thing.  The performance could be better
because of the reduced latency.

> Because in SQL Server, when you perform a cross-database query
> (not cross-server) the performance is extremely good, proving that
> it does not return 100 thousand ItemId from Taxonomy.ItemCategories
> to join with ProductId.
>
> Is that the same case with Postgres too, If databases are located
> on one server?

No, you cannot perform cross-database queries without a foreign
data wrapper.  I don't see a reason why the statement shouldn't
perform as well as in SQL Server if you use schemas instead of
databases.

Yours,
Laurenz Albe

Re: Quesion about querying distributed databases

From
Laurenz Albe
Date:
On Wed, 2025-03-05 at 15:45 +0330, me nefcanto wrote:
> Dear Laurenz, the point is that I think if we put all databases into one database,
> then we have blocked our growth in the future.

Hard to say.
If you want to shard for horizontal scaling, that usually only works well
if there are few interconnections between the different shards.  If you end
up joining data from different shards, you usually lose.

> A monolith database can be scaled only vertically. We have had huge headaches
> in the past with SQL Server on Windows and a single database.

Without knowing what the headaches were, it is hard to answer something here.

> But when you divide bounded contexts into different databases, then you have
> the chance to deploy each database on a separate physical machine. That means
> a lot in terms of performance. Please correct me if I am wrong.

I don't know if you are wrong.  But it seems like it is not working well,
is it?  Perhaps you can explain how splitting up the data might result in
better performance.  Is that just a guess or do you have a reason to think so?

> Let's put this physical restriction on ourselves that we have different
> databases. What options do we have?

I don't know.  Based on what you showed: perhaps a complete re-design?

Yours,
Laurenz Albe



Re: Quesion about querying distributed databases

From
Greg Sabino Mullane
Date:
On Wed, Mar 5, 2025 at 7:15 AM me nefcanto <sn.1361@gmail.com> wrote:
I think if we put all databases into one database, then we have blocked our growth in the future.

I think this is premature optimization. Your products table has 100,000 rows. That's very tiny for the year 2025. Try putting everything on one box with good indexes and you might be surprised at the performance.
 
A monolith database can be scaled only vertically.

Postgres scales well vertically. Plus, you can have streaming replicas to distribute the read queries (like the one given here) across many machines.
 
We have had huge headaches in the past with SQL Server on Windows and a single database.
But when you divide bounded contexts into different databases, then you have the chance to deploy each database on a separate physical machine. That means a lot in terms of performance.

I get your concern, but if the data is inter-related, it really is best to have them on the same server (and same database, and same schema). Then Postgres can devise a really efficient plan. You can also use Citus to start sharding things across multiple physical servers if your database gets very large.

Let's put this physical restriction on ourselves that we have different databases. What options do we have?

Your main option is FDW, which will never perform as well as a single server. Plus, you have the additional headache of trying to coordinate data updates atomically across different servers. The other option is to have the application do the work, e.g. pull a list of things from one server, use that to build a query against another one, etc. Definitely not ideal.

Cheers,
Greg

--
Enterprise Postgres Software Products & Tech Support

Re: Quesion about querying distributed databases

From
Adrian Klaver
Date:
On 3/5/25 04:15, me nefcanto wrote:
> Dear Laurenz, the point is that I think if we put all databases into one 
> database, then we have blocked our growth in the future.

How?

> A monolith database can be scaled only vertically. We have had huge 
> headaches in the past with SQL Server on Windows and a single database.
> But when you divide bounded contexts into different databases, then you 
> have the chance to deploy each database on a separate physical machine. 
> That means a lot in terms of performance. Please correct me if I am wrong.

And you add the complexity of talking across machines, as well as 
maintaining separate machines.

> 
> Let's put this physical restriction on ourselves that we have different 
> databases. What options do we have? One option that comes to my mind, is 
> to store the ID of the categories in the Products table. This means that 
> I don't need FDW anymore. And databases can be on separate machines. I 
> first query the categories database first, get the category IDs, and 
> then add a where clause to limit the product search. That could be an 
> option. Array data type in Postgres is something that I think other 
> RDBMSs do not have. Will that work? And how about attributes? Because 
> attributes are more than a single ID. I should store the attribute key, 
> alongside its value. It's a key-value pair. What can I do for that?

You seem to be going out of the way to make your life more complicated.

The only way you are going to find an answer is set up test cases and 
experiment. My bet is a single server with a single database and 
multiple schemas is where you end up, after all that is where you are 
starting from.


> 
> Thank you for sharing your time. I really appreciate it.
> Saeed
> 
> 
> 
> 
> 
> On Wed, Mar 5, 2025 at 3:18 PM Laurenz Albe <laurenz.albe@cybertec.at 
> <mailto:laurenz.albe@cybertec.at>> wrote:
> 
>     On Wed, 2025-03-05 at 14:18 +0330, me nefcanto wrote:
>      > That means a solid monolith database. We lose many goodies with that.
>      > As a real-world example, right now we can import a single database
>      > from the production to the development to test and troubleshoot data.
> 
>     Well, can't you import a single schema then?
> 
>      > What if we host all databases on the same server and use FDW. What
>      > happens in that case? Does it return 100 thousand records and join
>      > in the memory?
> 
>     It will do just the same thing.  The performance could be better
>     because of the reduced latency.
> 
>      > Because in SQL Server, when you perform a cross-database query
>      > (not cross-server) the performance is extremely good, proving that
>      > it does not return 100 thousand ItemId from Taxonomy.ItemCategories
>      > to join with ProductId.
>      >
>      > Is that the same case with Postgres too, If databases are located
>      > on one server?
> 
>     No, you cannot perform cross-database queries without a foreign
>     data wrapper.  I don't see a reason why the statement shouldn't
>     perform as well as in SQL Server if you use schemas instead of
>     databases.
> 
>     Yours,
>     Laurenz Albe
> 

-- 
Adrian Klaver
adrian.klaver@aklaver.com




Re: Quesion about querying distributed databases

From
me nefcanto
Date:
I once worked with a monolithic SQL Server database with more than 10 billion records and about 8 Terabytes of data. A single backup took us more than 21 days. It was a nightmare. Almost everybody knows that scaling up has a ceiling, but scaling out has no boundaries.

Therefore I will never choose a monolithic database design unless it's a small project. But my examples are just examples. We predict 100 million records per year. So we have to design accordingly. And it's not just sales records. Many applications have requirements that are cheap data but vast in multitude. Consider a language-learning app that wants to store the known words of any learner. 10 thousand learners each knowing 2 thousand words means 20 million records. Convert that to 100 thousand learners each knowing 7 thousand words and now you almost have a billion records. Cheap, but necessary. Let's not dive into telemetry or time-series data.

We initially chose to break the database into smaller databases, because it seemed natural for our modularized monolith architecture. And it worked great for SQL Server. If you're small, we host them all on one server. If you get bigger, we can put heavy databases on separate machines.

However, I don't have experience working with other types of database scaling. I have used table partitioning, but I have never used sharding.

Anyway, that's why I asked you guys. However, encouraging me to go back to monolith without giving solutions on how to scale, is not helping. To be honest, I'm somehow disappointed by how the most advanced open source database does not support cross-database querying just like how SQL Server does. But if it doesn't, it doesn't. Our team should either drop it as a choice or find a way (by asking the experts who built it or use it) how to design based on its features. That's why I'm asking.

One thing that comes to my mind, is to use custom types. Instead of storing data in ItemCategories and ItemAttributes, store them as arrays in the relevant tables in the same database. But then it seems to me that in this case, Mongo would become a better choice because I lose the relational nature and normalization somehow. What drawbacks have you experienced in that sense?

Regards
Saeed

On Wed, Mar 5, 2025 at 7:38 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 3/5/25 04:15, me nefcanto wrote:
> Dear Laurenz, the point is that I think if we put all databases into one
> database, then we have blocked our growth in the future.

How?

> A monolith database can be scaled only vertically. We have had huge
> headaches in the past with SQL Server on Windows and a single database.
> But when you divide bounded contexts into different databases, then you
> have the chance to deploy each database on a separate physical machine.
> That means a lot in terms of performance. Please correct me if I am wrong.

And you add the complexity of talking across machines, as well as
maintaining separate machines.

>
> Let's put this physical restriction on ourselves that we have different
> databases. What options do we have? One option that comes to my mind, is
> to store the ID of the categories in the Products table. This means that
> I don't need FDW anymore. And databases can be on separate machines. I
> first query the categories database first, get the category IDs, and
> then add a where clause to limit the product search. That could be an
> option. Array data type in Postgres is something that I think other
> RDBMSs do not have. Will that work? And how about attributes? Because
> attributes are more than a single ID. I should store the attribute key,
> alongside its value. It's a key-value pair. What can I do for that?

You seem to be going out of the way to make your life more complicated.

The only way you are going to find an answer is set up test cases and
experiment. My bet is a single server with a single database and
multiple schemas is where you end up, after all that is where you are
starting from.


>
> Thank you for sharing your time. I really appreciate it.
> Saeed
>
>
>
>
>
> On Wed, Mar 5, 2025 at 3:18 PM Laurenz Albe <laurenz.albe@cybertec.at
> <mailto:laurenz.albe@cybertec.at>> wrote:
>
>     On Wed, 2025-03-05 at 14:18 +0330, me nefcanto wrote:
>      > That means a solid monolith database. We lose many goodies with that.
>      > As a real-world example, right now we can import a single database
>      > from the production to the development to test and troubleshoot data.
>
>     Well, can't you import a single schema then?
>
>      > What if we host all databases on the same server and use FDW. What
>      > happens in that case? Does it return 100 thousand records and join
>      > in the memory?
>
>     It will do just the same thing.  The performance could be better
>     because of the reduced latency.
>
>      > Because in SQL Server, when you perform a cross-database query
>      > (not cross-server) the performance is extremely good, proving that
>      > it does not return 100 thousand ItemId from Taxonomy.ItemCategories
>      > to join with ProductId.
>      >
>      > Is that the same case with Postgres too, If databases are located
>      > on one server?
>
>     No, you cannot perform cross-database queries without a foreign
>     data wrapper.  I don't see a reason why the statement shouldn't
>     perform as well as in SQL Server if you use schemas instead of
>     databases.
>
>     Yours,
>     Laurenz Albe
>

--
Adrian Klaver
adrian.klaver@aklaver.com

Re: Quesion about querying distributed databases

From
Igor Korot
Date:
Hi,

On Wed, Mar 5, 2025, 8:44 PM me nefcanto <sn.1361@gmail.com> wrote:
I once worked with a monolithic SQL Server database with more than 10 billion records and about 8 Terabytes of data. A single backup took us more than 21 days. It was a nightmare. Almost everybody knows that scaling up has a ceiling, but scaling out has no boundaries.

But then you did the backup incrementally correct?

That should not take the same amount of time...



Therefore I will never choose a monolithic database design unless it's a small project. But my examples are just examples. We predict 100 million records per year. So we have to design accordingly. And it's not just sales records. Many applications have requirements that are cheap data but vast in multitude. Consider a language-learning app that wants to store the known words of any learner. 10 thousand learners each knowing 2 thousand words means 20 million records. Convert that to 100 thousand learners each knowing 7 thousand words and now you almost have a billion records. Cheap, but necessary. Let's not dive into telemetry or time-series data.

Can you try and see if 1 server with 3 different databases will do?

Having 1 table per database per server is too ugly.

Also please understand - every databae is different. And so it works and operates differently. What work good in one may not work good in another...

Thank you.


We initially chose to break the database into smaller databases, because it seemed natural for our modularized monolith architecture. And it worked great for SQL Server. If you're small, we host them all on one server. If you get bigger, we can put heavy databases on separate machines.

However, I don't have experience working with other types of database scaling. I have used table partitioning, but I have never used sharding.

Anyway, that's why I asked you guys. However, encouraging me to go back to monolith without giving solutions on how to scale, is not helping. To be honest, I'm somehow disappointed by how the most advanced open source database does not support cross-database querying just like how SQL Server does. But if it doesn't, it doesn't. Our team should either drop it as a choice or find a way (by asking the experts who built it or use it) how to design based on its features. That's why I'm asking.

One thing that comes to my mind, is to use custom types. Instead of storing data in ItemCategories and ItemAttributes, store them as arrays in the relevant tables in the same database. But then it seems to me that in this case, Mongo would become a better choice because I lose the relational nature and normalization somehow. What drawbacks have you experienced in that sense?

Regards
Saeed

On Wed, Mar 5, 2025 at 7:38 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 3/5/25 04:15, me nefcanto wrote:
> Dear Laurenz, the point is that I think if we put all databases into one
> database, then we have blocked our growth in the future.

How?

> A monolith database can be scaled only vertically. We have had huge
> headaches in the past with SQL Server on Windows and a single database.
> But when you divide bounded contexts into different databases, then you
> have the chance to deploy each database on a separate physical machine.
> That means a lot in terms of performance. Please correct me if I am wrong.

And you add the complexity of talking across machines, as well as
maintaining separate machines.

>
> Let's put this physical restriction on ourselves that we have different
> databases. What options do we have? One option that comes to my mind, is
> to store the ID of the categories in the Products table. This means that
> I don't need FDW anymore. And databases can be on separate machines. I
> first query the categories database first, get the category IDs, and
> then add a where clause to limit the product search. That could be an
> option. Array data type in Postgres is something that I think other
> RDBMSs do not have. Will that work? And how about attributes? Because
> attributes are more than a single ID. I should store the attribute key,
> alongside its value. It's a key-value pair. What can I do for that?

You seem to be going out of the way to make your life more complicated.

The only way you are going to find an answer is set up test cases and
experiment. My bet is a single server with a single database and
multiple schemas is where you end up, after all that is where you are
starting from.


>
> Thank you for sharing your time. I really appreciate it.
> Saeed
>
>
>
>
>
> On Wed, Mar 5, 2025 at 3:18 PM Laurenz Albe <laurenz.albe@cybertec.at
> <mailto:laurenz.albe@cybertec.at>> wrote:
>
>     On Wed, 2025-03-05 at 14:18 +0330, me nefcanto wrote:
>      > That means a solid monolith database. We lose many goodies with that.
>      > As a real-world example, right now we can import a single database
>      > from the production to the development to test and troubleshoot data.
>
>     Well, can't you import a single schema then?
>
>      > What if we host all databases on the same server and use FDW. What
>      > happens in that case? Does it return 100 thousand records and join
>      > in the memory?
>
>     It will do just the same thing.  The performance could be better
>     because of the reduced latency.
>
>      > Because in SQL Server, when you perform a cross-database query
>      > (not cross-server) the performance is extremely good, proving that
>      > it does not return 100 thousand ItemId from Taxonomy.ItemCategories
>      > to join with ProductId.
>      >
>      > Is that the same case with Postgres too, If databases are located
>      > on one server?
>
>     No, you cannot perform cross-database queries without a foreign
>     data wrapper.  I don't see a reason why the statement shouldn't
>     perform as well as in SQL Server if you use schemas instead of
>     databases.
>
>     Yours,
>     Laurenz Albe
>

--
Adrian Klaver
adrian.klaver@aklaver.com

Re: Quesion about querying distributed databases

From
Rob Sargent
Date:


On Mar 5, 2025, at 8:03 PM, Igor Korot jnit worked great for SQL Server. If you're small, we host them all on one server. If you get bigger, we can put heavy databases on separate machines.

However, I don't have experience working with other types of database scaling. I have used table partitioning, but I have never used sharding.

Anyway, that's why I asked you guys. However, encouraging me to go back to monolith without giving solutions on how to scale, is not helping. To be honest, I'm somehow disappointed by how the most advanced open source database does not support cross-database querying just like how SQL Server does. But if it doesn't, it doesn't. Our team should either drop it as a choice or find a way (by asking the experts who built it or use it) how to design based on its features. That's why I'm asking.


Cross-database on MSSQL is identical to cross schema on postgres. If you truly need cross server support (versus say beefier hardware) how did you come to choose postgres?  The numbers you present are impressive but not unheard of on this list. 

Re: Quesion about querying distributed databases

From
Ron Johnson
Date:
On Wed, Mar 5, 2025 at 9:44 PM me nefcanto <sn.1361@gmail.com> wrote:
I once worked with a monolithic SQL Server database with more than 10 billion records and about 8 Terabytes of data. A single backup took us more than 21 days. It was a nightmare.

25 years ago (meaning much slower hardware), I managed a 1TB database.  Backups took about 4 hours.  Could have gotten it down to two hours if I'd wanted to use more tape drives.

Right now, I manage a 5TB database.  Backups take 110 minutes, and that's when using one channel for all IO, writing to not the fastest NAS, and other 3+TB databases backing up to it at the same time.
 
Almost everybody knows that scaling up has a ceiling

And that ceiling is much, much higher than you think it is.
 
, but scaling out has no boundaries.

Except for complexity and fragility. I bet I could get good scaled up performance out of the amount of hardware you're using to scale out.

--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

Re: Quesion about querying distributed databases

From
me nefcanto
Date:
I appreciate your time guys. Thank you very much.

Having 1 table per database per server is too ugly.

Our databases are not one table per database. They are mapped to DDD's bounded contexts and usually by one table per domain entity.
For example, we have these databases:

  • Contacts
  • Courses
  • Seo
  • Payment
  • Forms
  • Geo
  • Sales
  • Media
  • Taxonomy
  • ...
These are the tables we have in the Contacts database:

  • Addresses
  • AddressTypes
  • Attributes
  • BankAccounts
  • ContactContents
  • Contacts
  • Emails
  • Genders
  • JobTitles
  • JuridicalPersons
  • NaturalPersonRelations
  • NaturalPersons
  • Persons
  • Phones
  • PhoneTypes
  • Relations
  • RelationTypes
  • SocialNetworks
  • SocialProfiles
  • Titles
And, these are the tables we have in the Geo database:

  • AdministrativeDivisions
  • AdministrativeDivisionTypes
  • Cities
  • CityDivisions
  • Countries
  • Locations
  • SpatialDataItems
  • TelephonePrefixes
  • TimeZones
But we also do have databases that only have one table in them. The number of tables is not our criteria to break them. The business semantics is our criteria.

Cross-database on MSSQL is identical to the cross schema on Postgres.

Cross-database query in SQL Server is not equivalent to cross-schema queries in Postgres. Because SQL Server also has the concept of schemas. In other words, both SQL Server and Postgres let you create databases, create schemas inside them, and create tables inside schemas. So SQL Server's cross-schema query equals Postgres's cross-schema query.

If you truly need cross server support (versus say beefier hardware) how did you come to choose postgres?

We chose Postgres for these reasons that we did R&D about:

  • Native array per column support
  • Not having multiple storage engines like MariaDB to be confused about
  • Supporting expressions in unique constraints
  • It's usually considered one of the best when it comes to performance, especially in GIS we intend to develop more upon
  • As it claims on its website, it's the most advanced open-source database engine (but to be honest, we saw many serious drawbacks to that statement)
But here's the deal. We don't have one project only. We don't need cross-server queries for all of our projects. But we tend to keep our architecture the same across projects as much as we can. We chose Postgres because we had experience with SQL Server and MariaDB and assumed that cross-database query on the same server is something natural. Both of them support that. And both are very performant on that. On MariaDB all you have to do is to use `db_name.table_name` and on SQL Server all you have to do is to use `database_name.schema_name.table_name`. So we thought, for projects that do not need more than one server, we keep databases on the same server. When it needed more resources, we start by taking heavy databases onto their own servers, and we start implementing table partitinong on them.

But we have experienced some amazing improvements too in our initial tests. For example, creating all databases and tables and database objects on MariaDB takes more than 400 seconds, while the same took 80 seconds on Postgres. So amazing performance on DDL.
Also, 1 million records in bulk insertion take almost one-sixth to on-fourth of the time on MariaDB. These are valuable numbers. They warmed our hearts to keep digging as much as we can to see if we can perform this migration.

Regards
Saeed

On Thu, Mar 6, 2025 at 7:14 AM Rob Sargent <robjsargent@gmail.com> wrote:


On Mar 5, 2025, at 8:03 PM, Igor Korot jnit worked great for SQL Server. If you're small, we host them all on one server. If you get bigger, we can put heavy databases on separate machines.

However, I don't have experience working with other types of database scaling. I have used table partitioning, but I have never used sharding.

Anyway, that's why I asked you guys. However, encouraging me to go back to monolith without giving solutions on how to scale, is not helping. To be honest, I'm somehow disappointed by how the most advanced open source database does not support cross-database querying just like how SQL Server does. But if it doesn't, it doesn't. Our team should either drop it as a choice or find a way (by asking the experts who built it or use it) how to design based on its features. That's why I'm asking.


Cross-database on MSSQL is identical to cross schema on postgres. If you truly need cross server support (versus say beefier hardware) how did you come to choose postgres?  The numbers you present are impressive but not unheard of on this list. 

Re: Quesion about querying distributed databases

From
Laurenz Albe
Date:
On Thu, 2025-03-06 at 06:13 +0330, me nefcanto wrote:
> I once worked with a monolithic SQL Server database with more than 10 billion
> records and about 8 Terabytes of data. A single backup took us more than 21 days.
> It was a nightmare. Almost everybody knows that scaling up has a ceiling, but
> scaling out has no boundaries.

I hear you, and I agree with that.


> We initially chose to break the database into smaller databases, because it
> seemed natural for our modularized monolith architecture. And it worked great
> for SQL Server. If you're small, we host them all on one server. If you get
> bigger, we can put heavy databases on separate machines.

So you mean that you had those databases on different servers?
How would a cross-database query work in that case?  It must be something
akin to foreign data in PostgreSQL.

If that worked well, then it should also work well with PostgreSQL and
foreign data wrappers.  Look at the execution plan you got on SQL Server
and see where PostgreSQL chooses a different plan.  Then try to improve that.
We can try to help if we see actual plans.

> However, I don't have experience working with other types of database
> scaling. I have used table partitioning, but I have never used sharding.

Well, if you split the data into several databases, that *was* sharding.

> Anyway, that's why I asked you guys. However, encouraging me to go back to
> monolith without giving solutions on how to scale, is not helping. To be
> honest, I'm somehow disappointed by how the most advanced open source
> database does not support cross-database querying just like how SQL Server
> does. But if it doesn't, it doesn't. Our team should either drop it as a
> choice or find a way (by asking the experts who built it or use it) how
> to design based on its features. That's why I'm asking.

Excluding options from the start is limiting yourself.  Consider using
other, better databases than PostgreSQL (if you can find them).

It is difficult to come up with a concrete design based on the information
you provided.  Perhaps you should get a consultant; the mailing list does
not seem to be the right format for that request.

Typically, you split the data in a ways that they have few interconnections,
for example per customer, so that you don't regularly end up joining data
from different databases (shards).

> One thing that comes to my mind, is to use custom types. Instead of storing
> data in ItemCategories and ItemAttributes, store them as arrays in the
> relevant tables in the same database.

Don't ever store arrays in the database.  It will be a nightmare.
You seem to be drawn to questionable data design...

Yours,
Laurenz Albe



Re: Quesion about querying distributed databases

From
Achilleas Mantzios - cloud
Date:
On 3/5/25 11:55, Laurenz Albe wrote:
> On Wed, 2025-03-05 at 12:57 +0330, me nefcanto wrote:
>> Right now this data is in MariaDB, on separate databases (schema) but on one
>> server. The solution in this situation is to have a cross-database query.
>> (this is the status quo of our application)
>>
>> Now our team has decided to migrate to Postgres. However, we realized that
>> Postgres does not support cross-database queries. And if we want to do so,
>> we should use FDW. So, we thought we might as well put databases on separate
>> servers for scalability if we have to write more code. That's the reason
>> behind this question.
> In MySQL, the terms "database" and "schema" are used for the same thing.
> Not so in PostgreSQL.  I think you should migrate the data into different
> schemas in a single database, pretty much like you had it in MySQL.
> Then you don't need a foreign data wrapper, and I bet the query can
> perform as well as it did on MySQL.
Same thing in MS SQL, as of version 2017 or so.
>
> Yours,
> Laurenz Albe
>
>
>



Re: Quesion about querying distributed databases

From
me nefcanto
Date:
Dear Laurenz

I hear you, and I agree with that.

Thank you. Such a relief.

If that worked well, then it should also work well with PostgreSQL and
foreign data wrappers.

You're right. We had problems with cross-server queries on SQL Server and MariaDB too. It seems that cross-server queries are not solved by any engine. But we had no problem with cross-database queries. That's where it worked well both on SQL Server and MariaDB. It seems that for cross-database queries, Postgres returns the entire result set from the other database to this database and then performs joins locally. It seems that for Postgres it's not different if the foreign database is on the same machine, or it's on another machine. I just say so by seeing the queries and asking questions about them. I have not performed a test yet.

Well, if you split the data into several databases, that *was* sharding.

The way I understood it, sharding is when you split the database by rows, not by tables. Examples choose a column like Tenant or User or Date as the base of sharding. Never have I seen an example that stores Orders on one database and Customers on another database and call it sharding. I don't know, but we might call it distributed databases.

 > Consider using other, better databases than PostgreSQL (if you can find them).

That's the point here. If we can't design a good thing on Postgres, then we stick back to MariaDB. That's why we're researching and testing. As I mentioned above, Postgres is amazing at some points but lacks some simple things that other engines expose out of the box.

Perhaps you should get a consultant; the mailing list does not seem to be the right format for that request.

We have done that over the last decade. For SQL Server and then for MariaDB. We have come up with some very practical and useful designs. Separating CLOBs from main tables, storing UUID only as the name of files to match the cloud storage, storing date-times as UTC, using bigint everywhere even for small tables for consistency, denormalizing enum storage (storing text instead of numeric value) even in large tables, etc. etc.

But to choose a technology, we do have enough literacy and experience. It's just some simple questions and answers. If I know that FDW works differently for same-server databases, then I know that we will migrate.

Don't ever store arrays in the database.  It will be a nightmare.

This is a very interesting claim. May I ask you to share its problems and your experience?


On Thu, Mar 6, 2025 at 11:34 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Thu, 2025-03-06 at 06:13 +0330, me nefcanto wrote:
> I once worked with a monolithic SQL Server database with more than 10 billion
> records and about 8 Terabytes of data. A single backup took us more than 21 days.
> It was a nightmare. Almost everybody knows that scaling up has a ceiling, but
> scaling out has no boundaries.

I hear you, and I agree with that.


> We initially chose to break the database into smaller databases, because it
> seemed natural for our modularized monolith architecture. And it worked great
> for SQL Server. If you're small, we host them all on one server. If you get
> bigger, we can put heavy databases on separate machines.

So you mean that you had those databases on different servers?
How would a cross-database query work in that case?  It must be something
akin to foreign data in PostgreSQL.

If that worked well, then it should also work well with PostgreSQL and
foreign data wrappers.  Look at the execution plan you got on SQL Server
and see where PostgreSQL chooses a different plan.  Then try to improve that.
We can try to help if we see actual plans.

> However, I don't have experience working with other types of database
> scaling. I have used table partitioning, but I have never used sharding.

Well, if you split the data into several databases, that *was* sharding.

> Anyway, that's why I asked you guys. However, encouraging me to go back to
> monolith without giving solutions on how to scale, is not helping. To be
> honest, I'm somehow disappointed by how the most advanced open source
> database does not support cross-database querying just like how SQL Server
> does. But if it doesn't, it doesn't. Our team should either drop it as a
> choice or find a way (by asking the experts who built it or use it) how
> to design based on its features. That's why I'm asking.

Excluding options from the start is limiting yourself.  Consider using
other, better databases than PostgreSQL (if you can find them).

It is difficult to come up with a concrete design based on the information
you provided.  Perhaps you should get a consultant; the mailing list does
not seem to be the right format for that request.

Typically, you split the data in a ways that they have few interconnections,
for example per customer, so that you don't regularly end up joining data
from different databases (shards).

> One thing that comes to my mind, is to use custom types. Instead of storing
> data in ItemCategories and ItemAttributes, store them as arrays in the
> relevant tables in the same database.

Don't ever store arrays in the database.  It will be a nightmare.
You seem to be drawn to questionable data design...

Yours,
Laurenz Albe

Re: Quesion about querying distributed databases

From
Igor Korot
Date:
Hi,

On Thu, Mar 6, 2025, 1:44 AM me nefcanto <sn.1361@gmail.com> wrote:
I appreciate your time guys. Thank you very much.

Having 1 table per database per server is too ugly.

Our databases are not one table per database. They are mapped to DDD's bounded contexts and usually by one table per domain entity.
For example, we have these databases:

  • Contacts
  • Courses
  • Seo
  • Payment
  • Forms
  • Geo
  • Sales
  • Media
  • Taxonomy
  • ...
These are the tables we have in the Contacts database:

  • Addresses
  • AddressTypes
  • Attributes
  • BankAccounts
  • ContactContents
  • Contacts
  • Emails
  • Genders
  • JobTitles
  • JuridicalPersons
  • NaturalPersonRelations
  • NaturalPersons
  • Persons
  • Phones
  • PhoneTypes
  • Relations
  • RelationTypes
  • SocialNetworks
  • SocialProfiles
  • Titles
And, these are the tables we have in the Geo database:

  • AdministrativeDivisions
  • AdministrativeDivisionTypes
  • Cities
  • CityDivisions
  • Countries
  • Locations
  • SpatialDataItems
  • TelephonePrefixes
  • TimeZones
But we also do have databases that only have one table in them. The number of tables is not our criteria to break them. The business semantics is our criteria.

Cross-database on MSSQL is identical to the cross schema on Postgres.

Cross-database query in SQL Server is not equivalent to cross-schema queries in Postgres. Because SQL Server also has the concept of schemas. In other words, both SQL Server and Postgres let you create databases, create schemas inside them, and create tables inside schemas. So SQL Server's cross-schema query equals Postgres's cross-schema query.

If you truly need cross server support (versus say beefier hardware) how did you come to choose postgres?

We chose Postgres for these reasons that we did R&D about:

  • Native array per column support
  • Not having multiple storage engines like MariaDB to be confused about
  • Supporting expressions in unique constraints
  • It's usually considered one of the best when it comes to performance, especially in GIS we intend to develop more upon
  • As it claims on its website, it's the most advanced open-source database engine (but to be honest, we saw many serious drawbacks to that statement)
But here's the deal. We don't have one project only. We don't need cross-server queries for all of our projects. But we tend to keep our architecture the same across projects as much as we can. We chose Postgres because we had experience with SQL Server and MariaDB and assumed that cross-database query on the same server is something natural. Both of them support that. And both are very performant on that. On MariaDB all you have to do is to use `db_name.table_name` and on SQL Server all you have to do is to use `database_name.schema_name.table_name`. So we thought, for projects that do not need more than one server, we keep databases on the same server. When it needed more resources, we start by taking heavy databases onto their own servers, and we start implementing table partitinong on them.

But why?
Remember - multiple servers means more traffic which might be performance wise.
And especially if thise servers are located on different hardware as your OP implied.

Thank you. 


But we have experienced some amazing improvements too in our initial tests. For example, creating all databases and tables and database objects on MariaDB takes more than 400 seconds, while the same took 80 seconds on Postgres. So amazing performance on DDL.
Also, 1 million records in bulk insertion take almost one-sixth to on-fourth of the time on MariaDB. These are valuable numbers. They warmed our hearts to keep digging as much as we can to see if we can perform this migration.

Regards
Saeed

On Thu, Mar 6, 2025 at 7:14 AM Rob Sargent <robjsargent@gmail.com> wrote:


On Mar 5, 2025, at 8:03 PM, Igor Korot jnit worked great for SQL Server. If you're small, we host them all on one server. If you get bigger, we can put heavy databases on separate machines.

However, I don't have experience working with other types of database scaling. I have used table partitioning, but I have never used sharding.

Anyway, that's why I asked you guys. However, encouraging me to go back to monolith without giving solutions on how to scale, is not helping. To be honest, I'm somehow disappointed by how the most advanced open source database does not support cross-database querying just like how SQL Server does. But if it doesn't, it doesn't. Our team should either drop it as a choice or find a way (by asking the experts who built it or use it) how to design based on its features. That's why I'm asking.


Cross-database on MSSQL is identical to cross schema on postgres. If you truly need cross server support (versus say beefier hardware) how did you come to choose postgres?  The numbers you present are impressive but not unheard of on this list. 

Re: Quesion about querying distributed databases

From
Laurenz Albe
Date:
On Thu, 2025-03-06 at 12:15 +0330, me nefcanto wrote:
> We had problems with cross-server queries on SQL Server and MariaDB too.
> It seems that cross-server queries are not solved by any engine. But we
> had no problem with cross-database queries. That's where it worked well
> both on SQL Server and MariaDB.

But then you always worked with a monolithic system.  Splitting over several
"databases" (we call them schemas) does not allow you to scale horizontally.

> > Well, if you split the data into several databases, that *was* sharding.
>
> The way I understood it, sharding is when you split the database by rows,
> not by tables. Examples choose a column like Tenant or User or Date as the
> base of sharding. Never have I seen an example that stores Orders on one
> database and Customers on another database and call it sharding.

Right.  And I don't think that your data model is good.  It won't scale
well, because you don't get more tables as you get more data.


> But to choose a technology, we do have enough literacy and experience.
> It's just some simple questions and answers. If I know that FDW works
> differently for same-server databases, then I know that we will migrate.

It doesn't work any differently.

> > Don't ever store arrays in the database.  It will be a nightmare.
>
> This is a very interesting claim. May I ask you to share its problems
> and your experience?

If you store foreign keys in arrays, you have no referential integrity,
and any join over the array will be complicated and slow.

If you store many values in an array and you want to update one of these
values, you have to retrieve the whole array, construct a new array and
store that.  You cannot modify individual array elements, only the whole
thing.

Of course there are ways to use arrays that are not harmful.  It's just
that almost always when I see someone use an array as table column, it
is an abuse of technology.

Yours,
Laurenz Albe



Re: Quesion about querying distributed databases

From
Greg Sabino Mullane
Date:
On Wed, Mar 5, 2025 at 9:44 PM me nefcanto <sn.1361@gmail.com> wrote:
Anyway, that's why I asked you guys. However, encouraging me to go back to monolith without giving solutions on how to scale, is not helping.

We did. In addition to the ongoing FDW discussion, I mentioned read-only replicas and Citus. As far as *how* to scale vertically, we can offer general advice (more hardware resources, ramdisks for temp stuff, OS-level tuning, separate disk mounts). But a lot of it is tuning Postgres for your specific situation and your specific bottlenecks. Which we are happy to help with. Once we convince you to not throw the baby out with the bathwater. :)

8 Terabytes of data. A single backup took us more than 21 days

Something was fundamentally wrong there. 

Cheers,
Greg

--
Enterprise Postgres Software Products & Tech Support

Re: Quesion about querying distributed databases

From
Igor Korot
Date:
Hi,

On Thu, Mar 6, 2025, 7:32 AM Greg Sabino Mullane <htamfids@gmail.com> wrote:
On Wed, Mar 5, 2025 at 9:44 PM me nefcanto <sn.1361@gmail.com> wrote:
Anyway, that's why I asked you guys. However, encouraging me to go back to monolith without giving solutions on how to scale, is not helping.

We did. In addition to the ongoing FDW discussion, I mentioned read-only replicas and Citus. As far as *how* to scale vertically, we can offer general advice (more hardware resources, ramdisks for temp stuff, OS-level tuning, separate disk mounts). But a lot of it is tuning Postgres for your specific situation and your specific bottlenecks. Which we are happy to help with. Once we convince you to not throw the baby out with the bathwater. :)

8 Terabytes of data. A single backup took us more than 21 days

Something was fundamentally wrong there. 

It could happen on an old and drained hardware... 😀

Thank you.


Cheers,
Greg

--
Enterprise Postgres Software Products & Tech Support

Re: Quesion about querying distributed databases

From
Ron Johnson
Date:
On Thu, Mar 6, 2025 at 10:47 AM Igor Korot <ikorot01@gmail.com> wrote:
Hi,

On Thu, Mar 6, 2025, 7:32 AM Greg Sabino Mullane <htamfids@gmail.com> wrote:
On Wed, Mar 5, 2025 at 9:44 PM me nefcanto <sn.1361@gmail.com> wrote:
Anyway, that's why I asked you guys. However, encouraging me to go back to monolith without giving solutions on how to scale, is not helping.

We did. In addition to the ongoing FDW discussion, I mentioned read-only replicas and Citus. As far as *how* to scale vertically, we can offer general advice (more hardware resources, ramdisks for temp stuff, OS-level tuning, separate disk mounts). But a lot of it is tuning Postgres for your specific situation and your specific bottlenecks. Which we are happy to help with. Once we convince you to not throw the baby out with the bathwater. :)

8 Terabytes of data. A single backup took us more than 21 days

Something was fundamentally wrong there. 

It could happen on an old and drained hardware... 😀

8TB databases existed 20+ years ago.  Like always, the hardware must fit the application.

21 days to backup a database absolutely means many things were improperly sized and configured.

--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

Re: Quesion about querying distributed databases

From
Kevin Stephenson
Date:
Bumping this old thread to clarify a few points.

As an initial note on terminology, a "server" can be called a server, an instance, a server instance, or in PostgreSQL's (PG hereafter) case, a "DB Cluster." They all are used interchangeably in the wild to mean the same thing, with perhaps "server" being the most ambiguous. And that thing is a running process that is listening to (typically) a single port and handing connections off to individual threads or processes (or some other construct). The comments below use the word "instance" for consistency and brevity.

MariaDB
As mentioned by others, MariaDB (and MySQL proper, both hereafter referred to as MySQL) instances only support a single database. The CREATE DATABASE statement is a de facto synonym for CREATE SCHEMA (it is disappointing that MySQL and MariaDB docs still do not mention this significant fact). To disabuse those who believe MySQL support multiple databases, query any of the INFORMATION_SCHEMA tables and for all those that have the *_CATALOG column (CATALOG is a synonym for DATABASE), notice every row in every table that shows a catalog column, they all say "def", as in default. Further, the identifier used in CREATE DATABASE will show up in INFORMATION_SCHEMA tables in *_SCHEMA columns.

MySQL supports one and two-part object naming for DML and it supports foreign data access using the Federated (deprecated in MariaDB) and FederatedX (MariaDB) storage engines. One-part names use the currently selected schema (via USE statement) along with <object_name>. Two-part names are <schema_name>.<object_name>. Foreign data (e.g. tables) are also accessed using two-part names and there are limitations on what queries can be used with them (I believe they can be organized in a way the one-part names would work as well). I'm unaware of how advanced the pushdown optimizations have become, but the concept of pushdown is critical to understand both for the MySQL storage engines as well as for the Federated* "foreign data wrappers" (FDW).

To summarize, all MySQL instances have a single database called "def" and there is no way to create another database on a single instance at this time (MySQL 9.2, MariaDB 11.8). To have more than one database currently requires running additional instances of MySQL on a different port. The idea that there are "high performance cross-database queries" in MySQL is simply incorrect at this time. There are high-performance cross-schema queries in MySQL, as there are in Microsoft SQL Server (MSSQL hereafter) and PG.

Note: MariaDB is planning on supporting real databases in 12.0 and they will be called CATALOGs, as some other RDMSs do. See:

It's a big TBD on how well this will perform with InnoDB and other storage engines for cross-catalog queries (if that will even possible).

MSSQL (On-prem, VM, Azure SQL Managed Instance)
MSSQL supports one, two, three and four-part object naming.

One part names are the <object_name> and use the user's currently selected database (via USE <database_name> command or as set in the connection string), along with the user's default schema as a prefixes, and that schema prefix is typically dbo (similar to public in PG) to resolve object names (which behind the scenes are always three-part names: <database_name>.<schema_name>.<object_name>). The default schema can be changed, similar to the PG search path being changed..

Two-part names are <schema_name>.<object_name> and again use the user's currently selected database to get the fully qualified three-part name.

Where MSSQL differs from both MySQL and PG is in the support for three-part names. These are <database_name>.<schema_name>.<object_name>, and you can in fact run real "cross-database" MSSQL queries without FDW trickery. Historically, I believe there were some light limitations on their usage, such as in Foreign Keys and Triggers, but that may no longer be the case.

Linked Servers are the MSSQL near equivalent of FDW and are used in four-part names: <linked server name>.<database_name>.<schema_name>.<object_name>.

Two other MSSQL concepts are CONTAINED DATABASES and Azure SQL Database (with "Elastic Queries" as FDW, shockingly in "preview" for about a decade now). Those are beyond the scope of this email.

PG
On the PG side (which I know the least about), it does have real databases with schemas, and does support FDW. But since it uses a shared WAL (at least MSSQL separates logs per database, unclear on InnoDB, et. al.), and it uses a shared security system, I don't understand why it doesn't support cross-database queries. But that question would be best asked in a separate thread (any maybe it's already in the archives anyways). Putting all relevant objects in a single database with different schemas has already been mentioned as a workaround.

As far as some of the other back and forth on database (and system) architecture, that's really an enormous (and specialty) topic. Having previously worked on one of the largest MSSQL installations in the world, I have not seen anything in PG that would prevent a proper scaling strategy as long as "DB as API" was not being attempted, but rather microservices or some other pattern was used, and frequent massive JOINs cross-instance or cross-server (actual servers) weren't being attempted. All three platforms discussed here will have performance problems with that.

Perhaps the original poster would benefit from reviewing one of the gold standards on this topic, "Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems"
Kevin


From: Ron Johnson <ronljohnsonjr@gmail.com>
Sent: Thursday, March 6, 2025 10:16 AM
To: pgsql-generallists.postgresql.org <pgsql-general@lists.postgresql.org>
Subject: Re: Quesion about querying distributed databases
 
On Thu, Mar 6, 2025 at 10:47 AM Igor Korot <ikorot01@gmail.com> wrote:
Hi,

On Thu, Mar 6, 2025, 7:32 AM Greg Sabino Mullane <htamfids@gmail.com> wrote:
On Wed, Mar 5, 2025 at 9:44 PM me nefcanto <sn.1361@gmail.com> wrote:
Anyway, that's why I asked you guys. However, encouraging me to go back to monolith without giving solutions on how to scale, is not helping.

We did. In addition to the ongoing FDW discussion, I mentioned read-only replicas and Citus. As far as *how* to scale vertically, we can offer general advice (more hardware resources, ramdisks for temp stuff, OS-level tuning, separate disk mounts). But a lot of it is tuning Postgres for your specific situation and your specific bottlenecks. Which we are happy to help with. Once we convince you to not throw the baby out with the bathwater. :)

8 Terabytes of data. A single backup took us more than 21 days

Something was fundamentally wrong there. 

It could happen on an old and drained hardware... 😀

8TB databases existed 20+ years ago.  Like always, the hardware must fit the application.

21 days to backup a database absolutely means many things were improperly sized and configured.

--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

Re: Quesion about querying distributed databases

From
Adrian Klaver
Date:
On 3/29/25 02:15, Kevin Stephenson wrote:
> Bumping this old thread to clarify a few points.
> 
> As an initial note on terminology, a "server" can be called a server, an 
> instance, a server instance, or in PostgreSQL's (PG hereafter) case, a 
> "DB Cluster." They all are used interchangeably in the wild to mean the 
> same thing, with perhaps "server" being the most ambiguous. And that 
> thing is a running process that is listening to (typically) a single 
> port and handing connections off to individual threads or processes (or 
> some other construct). The comments below use the word "instance" for 
> consistency and brevity.
> 
> MariaDB
> As mentioned by others, MariaDB (and MySQL proper, both hereafter 
> referred to as MySQL) instances only support a single database. The 
> CREATE DATABASE statement is a de facto synonym for CREATE SCHEMA (it is 
> disappointing that MySQL and MariaDB docs still do not mention this 
> significant fact). To disabuse those who believe MySQL support multiple 

Actually they do:

https://dev.mysql.com/doc/refman/9.2/en/create-database.html

"CREATE DATABASE creates a database with the given name. To use this 
statement, you need the CREATE privilege for the database. CREATE SCHEMA 
is a synonym for CREATE DATABASE."

https://mariadb.com/kb/en/create-database/

"CREATE DATABASE creates a database with the given name. To use this 
statement, you need the CREATE privilege for the database. CREATE SCHEMA 
is a synonym for CREATE DATABASE."

They don't expound on what that means in real terms.


> databases, query any of the INFORMATION_SCHEMA tables and for all those 
> that have the *_CATALOG column (CATALOG is a synonym for DATABASE), 
> notice every row in every table that shows a catalog column, they all 
> say "def", as in default. Further, the identifier used in CREATE 
> DATABASE will show up in INFORMATION_SCHEMA tables in *_SCHEMA columns.


-- 
Adrian Klaver
adrian.klaver@aklaver.com