Home > mailing lists

Re: Storing thousands of csv files in postgresql - Mailing list pgsql-sql

From	Erik Brandsberg
Subject	Re: Storing thousands of csv files in postgresql
Date	February 15, 2022 23:42:58
Msg-id	CAFcck8EFY0R0hLHy8VPa7xYqre3EH_p13c0+dKb+cuzsqwWuvA@mail.gmail.com Whole thread Raw
In response to	Re: Storing thousands of csv files in postgresql (Ion Alberdi <ion.alberdi@pricemoov.com>)
List	pgsql-sql

Tree view

I was just about to call out in a followup you may not have control over the filesystem if you are in say RDS then. In this case, another question--is there a core set of columns that will be used for all tables, and then a variable set for each? It may make sense to use one table with a "table id" column, and then the common tables, then just use json storage for the variable columns. More information on the nature of the data may help elicit a better answer however.

On Tue, Feb 15, 2022 at 3:40 PM Ion Alberdi <ion.alberdi@pricemoov.com> wrote:

>"What filesystem is best suited to having many files in the same directory" and let the technology deal with the problem. Postgres is just dependent on the filesystem for this behavior. And to that answer, I believe, is XFS.
Given that we use AWS RDS instances, I don't think we have the option to choose the filesystem (at least there is no such info at https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/)
Still, we'll keep that in mind thanks Erik!

Le mar. 15 févr. 2022 à 21:11, Erik Brandsberg <erik@heimdalldata.com> a écrit :
I'm going to challenge that the question is not one for Postgres, but you should be asking "What filesystem is best suited to having many files in the same directory" and let the technology deal with the problem. Postgres is just dependent on the filesystem for this behavior. And to that answer, I believe, is XFS.

On Tue, Feb 15, 2022 at 3:15 AM Ion Alberdi <ion.alberdi@pricemoov.com> wrote:
Hello to all,

One of the use cases we need to implement requires
storing and query-ing thousands (and more as the product grows) of csv files
that have different schema-s (by schema we mean column names and their type).

These csv would then need to be maintained with operations like:
- add column,
- add row,
- delete row,
- read: filter/sort/paginate,
- write: edit column values.

Let's assume that we store the definition of each schema in a dedicated table,
with the schema defined in a json column. With this schema we'll be able translate the read/write/update queries to these imported csv files into related SQL queries.

The remaining question is how to store the data of each file in the DB.

As suggested by https://www.postgresql.org/docs/10/sql-copy.html there is a way to import a csv in its own table. By using this approach for each csv-s we see:

Pros:
- All postgresql types available: https://www.postgresql.org/docs/9.5/datatype.html,
- Constraints on columns, among others unicity constraints,
that makes the DB guarantee rows will not duplicated (relevant to the add row use case),
- Debuggability: enables using standard SQL to browse csv data,
- Can reuse existing libraries to generate dynamic SQL queries [1]

Cons:
- Need to have as many tables as different schemas.

Another solution could consist of implementing a document store in postgresql,
by storing all columns of a row in a single jsonb column.

Pros:
- Single table to store all different imported csv-s.

Cons:
- Less types available https://www.postgresql.org/docs/9.4/datatype-json.html,
- No constraint on columns, (no unicity or data validation constraints
that should be delegated to the application),
- Ramp-up on json* functions, (and I wonder whether there are libraries
to safely generate dynamic SQL queries on json columns),
(- Debuggability: this is not such a big con as json_to_record enables
going back to a standard SQL experience)

Based on this first pro/con list, we're wondering about the scalability
limits faced by postgresql instances getting more tables in a given DB.

Browsing the web, we saw two main issues:
- One related to the OS "you may see some performance degradation associated
with databases containing many tables. PostgreSQL may use a large number of
files for storing the table data, and performance may suffer if the operating
system does not cope well with many files in a single directory." [1]
- Related to that, the fact that some operations like autovacuum are O(N) on number of tables [3]

On the other hand, reading timescaledb's architecture https://docs.timescale.com/timescaledb/latest/overview/core-concepts/hypertables-and-chunks/#partitioning-in-hypertables-with-chunks
"Each chunk is implemented using a standard database table."
it seems that their platform took such a direction, which may have proved the scalability of such an approach.

My question is thus the following:
how many of such tables can a single postgresql instance handle without trouble [4]?

Any challenge/addition to the pro/cons list described above would be very welcome too.

Best regards,
Ion

[1]: Like https://www.psycopg.org/docs/sql.html
[2]: https://link.springer.com/content/pdf/bbm%3A978-1-4302-0018-5%2F1.pdf
[3]: https://stackoverflow.com/questions/22395883/postgresql-what-is-the-maximum-number-of-tables-can-store-in-postgresql-databas
[4]: We use RDS instances in AWS

pgsql-sql by date:

From: Ion Alberdi
Date: 15 February 2022, 23:40:20
Subject: Re: Storing thousands of csv files in postgresql

From: Ion Alberdi
Date: 15 February 2022, 23:43:42
Subject: Re: Storing thousands of csv files in postgresql

Re: Storing thousands of csv files in postgresql - Mailing list pgsql-sql

Previous

Next