Home > mailing lists

Re: Storing thousands of csv files in postgresql - Mailing list pgsql-sql

From	Steve Midgley
Subject	Re: Storing thousands of csv files in postgresql
Date	February 15, 2022 23:58:34
Msg-id	CAJexoSJwEKFLLuD=LHfHX-+yiaOBD3a8MWG3f969T_FWirj36A@mail.gmail.com Whole thread Raw
In response to	Re: Storing thousands of csv files in postgresql (Ion Alberdi <ion.alberdi@pricemoov.com>)
Responses	Re: Storing thousands of csv files in postgresql (Ion Alberdi <ion.alberdi@pricemoov.com>) Re: Storing thousands of csv files in postgresql (Rob Sargent <robjsargent@gmail.com>)
List	pgsql-sql

Tree view

On Tue, Feb 15, 2022 at 11:38 AM Ion Alberdi <ion.alberdi@pricemoov.com> wrote:

Thanks for these precious insights Steve!
>Given that no matter what answer the community can give you about the number of tables per DB is reasonable, if your project is successful, you'll probably exceed that limit eventually.
Indeed!

>Why not plan for federation at the start (a little, at low cost) by including the PG server URL and DB name where the CSV is stored in your CSV table schema store?
So far I'd hope that https://www.citusdata.com/ would have features to do so. Reading the docs, they do not seem to provide such a federation though,
I'll send them an email to be sure. Thanks again!

Le mar. 15 févr. 2022 à 17:20, Steve Midgley <science@misuse.org> a écrit :

On Tue, Feb 15, 2022 at 12:15 AM Ion Alberdi <ion.alberdi@pricemoov.com> wrote:
Hello to all,

One of the use cases we need to implement requires
storing and query-ing thousands (and more as the product grows) of csv files
that have different schema-s (by schema we mean column names and their type).

These csv would then need to be maintained with operations like:
- add column,
- add row,
- delete row,
- read: filter/sort/paginate,
- write: edit column values.

Let's assume that we store the definition of each schema in a dedicated table,
with the schema defined in a json column. With this schema we'll be able translate the read/write/update queries to these imported csv files into related SQL queries.

The remaining question is how to store the data of each file in the DB.

As suggested by https://www.postgresql.org/docs/10/sql-copy.html there is a way to import a csv in its own table. By using this approach for each csv-s we see:

Pros:
- All postgresql types available: https://www.postgresql.org/docs/9.5/datatype.html,
- Constraints on columns, among others unicity constraints,
that makes the DB guarantee rows will not duplicated (relevant to the add row use case),
- Debuggability: enables using standard SQL to browse csv data,
- Can reuse existing libraries to generate dynamic SQL queries [1]

Cons:
- Need to have as many tables as different schemas.

Another solution could consist of implementing a document store in postgresql,
by storing all columns of a row in a single jsonb column.

Pros:
- Single table to store all different imported csv-s.

Cons:
- Less types available https://www.postgresql.org/docs/9.4/datatype-json.html,
- No constraint on columns, (no unicity or data validation constraints
that should be delegated to the application),
- Ramp-up on json* functions, (and I wonder whether there are libraries
to safely generate dynamic SQL queries on json columns),
(- Debuggability: this is not such a big con as json_to_record enables
going back to a standard SQL experience)

Based on this first pro/con list, we're wondering about the scalability
limits faced by postgresql instances getting more tables in a given DB.

Browsing the web, we saw two main issues:
- One related to the OS "you may see some performance degradation associated
with databases containing many tables. PostgreSQL may use a large number of
files for storing the table data, and performance may suffer if the operating
system does not cope well with many files in a single directory." [1]
- Related to that, the fact that some operations like autovacuum are O(N) on number of tables [3]

On the other hand, reading timescaledb's architecture https://docs.timescale.com/timescaledb/latest/overview/core-concepts/hypertables-and-chunks/#partitioning-in-hypertables-with-chunks
"Each chunk is implemented using a standard database table."
it seems that their platform took such a direction, which may have proved the scalability of such an approach.

My question is thus the following:
how many of such tables can a single postgresql instance handle without trouble [4]?

Any challenge/addition to the pro/cons list described above would be very welcome too.

Given that no matter what answer the community can give you about the number of tables per DB is reasonable, if your project is successful, you'll probably exceed that limit eventually. Why not plan for federation at the start (a little, at low cost) by including the PG server URL and DB name where the CSV is stored in your CSV table schema store? That way, for now, you just store CSVs in the current PG server/DB, and should it get overwhelmed, it's relatively easy to just point accessors to a different server and/or DB in the future for some CSV resources? The main upfront increased cost is that you'll need to solve for credential management for the various PG servers. If you're in AWS, "Secrets Manager" would work - but there are lots of equivalent solutions out there.

FWIW, I think your analysis of the pros and cons of tables vs documents is excellent but slightly incomplete. In my experience with document DB, I only postpone all the downsides, in order to get the immediate benefits (kind of like a "sugar high"). Eventually you have to solve for everything you solve for with the table-based solution. You just don't have to solve for it upfront, like in the table approach. And, at least on the project where I got bit by a document db architecture, it's a lot harder to solve for many of these problems when you solve for them later in your project, so it's better just to build using structured tables up front for a project with meaningful structures and lots of data.

I don't think you need a "federated" postgres network like Citus at all - I think this solves a different use case. For your design problem, I think that having a bunch of independent Pg servers would be fine - as long as you don't need to run searches across CSV tables stored across different databases (in which case you do need index/search federation of some kind).

Regarding Erik Brandsberg's point about XFS, I think this is a useful alternative approach, if I understand the idea. Instead of storing your CSV files in Postgres, just store them as CSV files on the file system. You can still store the schemas in Pg, but each schema would just point to a file in the file system and you'd manipulate the files in the filesystem using whatever language is appropriate (I find ruby to be excellent for managing CSV files). If you need to index those files to run searches against them, I'd direct your attention to https://prestodb.io/ which is the core technology that runs Amazon Athena. This allows you to search CSV files with various schema (among other data bindings). So you might find that Pg as your schema storage, XFS (or any modern FS) to store large numbers of CSV files, and Presto/Athena to index/search those files, along with some CSV management language (like Ruby or something even higher level) to manage the data.

I think if I were dealing with less than 10k CSV files (and therefore Pg tables), I might use Pg, and if I were dealing with 10k+ files, I'd start looking at file systems + Presto. But that's a WAG.

Steve

On Tue, Feb 15, 2022 at 11:38 AM Ion Alberdi <ion.alberdi@pricemoov.com> wrote:

Thanks for these precious insights Steve!
>Given that no matter what answer the community can give you about the number of tables per DB is reasonable, if your project is successful, you'll probably exceed that limit eventually.
Indeed!

>Why not plan for federation at the start (a little, at low cost) by including the PG server URL and DB name where the CSV is stored in your CSV table schema store?
So far I'd hope that https://www.citusdata.com/ would have features to do so. Reading the docs, they do not seem to provide such a federation though,
I'll send them an email to be sure. Thanks again!

Le mar. 15 févr. 2022 à 17:20, Steve Midgley <science@misuse.org> a écrit :

On Tue, Feb 15, 2022 at 12:15 AM Ion Alberdi <ion.alberdi@pricemoov.com> wrote:
Hello to all,

One of the use cases we need to implement requires
storing and query-ing thousands (and more as the product grows) of csv files
that have different schema-s (by schema we mean column names and their type).

These csv would then need to be maintained with operations like:
- add column,
- add row,
- delete row,
- read: filter/sort/paginate,
- write: edit column values.

Let's assume that we store the definition of each schema in a dedicated table,
with the schema defined in a json column. With this schema we'll be able translate the read/write/update queries to these imported csv files into related SQL queries.

The remaining question is how to store the data of each file in the DB.

As suggested by https://www.postgresql.org/docs/10/sql-copy.html there is a way to import a csv in its own table. By using this approach for each csv-s we see:

Pros:
- All postgresql types available: https://www.postgresql.org/docs/9.5/datatype.html,
- Constraints on columns, among others unicity constraints,
that makes the DB guarantee rows will not duplicated (relevant to the add row use case),
- Debuggability: enables using standard SQL to browse csv data,
- Can reuse existing libraries to generate dynamic SQL queries [1]

Cons:
- Need to have as many tables as different schemas.

Another solution could consist of implementing a document store in postgresql,
by storing all columns of a row in a single jsonb column.

Pros:
- Single table to store all different imported csv-s.

Cons:
- Less types available https://www.postgresql.org/docs/9.4/datatype-json.html,
- No constraint on columns, (no unicity or data validation constraints
that should be delegated to the application),
- Ramp-up on json* functions, (and I wonder whether there are libraries
to safely generate dynamic SQL queries on json columns),
(- Debuggability: this is not such a big con as json_to_record enables
going back to a standard SQL experience)

Based on this first pro/con list, we're wondering about the scalability
limits faced by postgresql instances getting more tables in a given DB.

Browsing the web, we saw two main issues:
- One related to the OS "you may see some performance degradation associated
with databases containing many tables. PostgreSQL may use a large number of
files for storing the table data, and performance may suffer if the operating
system does not cope well with many files in a single directory." [1]
- Related to that, the fact that some operations like autovacuum are O(N) on number of tables [3]

On the other hand, reading timescaledb's architecture https://docs.timescale.com/timescaledb/latest/overview/core-concepts/hypertables-and-chunks/#partitioning-in-hypertables-with-chunks
"Each chunk is implemented using a standard database table."
it seems that their platform took such a direction, which may have proved the scalability of such an approach.

My question is thus the following:
how many of such tables can a single postgresql instance handle without trouble [4]?

Any challenge/addition to the pro/cons list described above would be very welcome too.

Given that no matter what answer the community can give you about the number of tables per DB is reasonable, if your project is successful, you'll probably exceed that limit eventually. Why not plan for federation at the start (a little, at low cost) by including the PG server URL and DB name where the CSV is stored in your CSV table schema store? That way, for now, you just store CSVs in the current PG server/DB, and should it get overwhelmed, it's relatively easy to just point accessors to a different server and/or DB in the future for some CSV resources? The main upfront increased cost is that you'll need to solve for credential management for the various PG servers. If you're in AWS, "Secrets Manager" would work - but there are lots of equivalent solutions out there.

FWIW, I think your analysis of the pros and cons of tables vs documents is excellent but slightly incomplete. In my experience with document DB, I only postpone all the downsides, in order to get the immediate benefits (kind of like a "sugar high"). Eventually you have to solve for everything you solve for with the table-based solution. You just don't have to solve for it upfront, like in the table approach. And, at least on the project where I got bit by a document db architecture, it's a lot harder to solve for many of these problems when you solve for them later in your project, so it's better just to build using structured tables up front for a project with meaningful structures and lots of data.

Steve

pgsql-sql by date:

From: Ion Alberdi
Date: 15 February 2022, 23:43:42
Subject: Re: Storing thousands of csv files in postgresql

From: Ion Alberdi
Date: 16 February 2022, 00:12:28
Subject: Re: Storing thousands of csv files in postgresql

Re: Storing thousands of csv files in postgresql - Mailing list pgsql-sql

Previous

Next