Storing thousands of csv files in postgresql - Mailing list pgsql-sql

From Ion Alberdi
Subject Storing thousands of csv files in postgresql
Date
Msg-id CANbgw4BfmAUsMTsh0z2Y-97jusZObaC+Ez7Mm=papd-WUDd89A@mail.gmail.com
Whole thread Raw
Responses Re: Storing thousands of csv files in postgresql  (Steve Midgley <science@misuse.org>)
Re: Storing thousands of csv files in postgresql  (Erik Brandsberg <erik@heimdalldata.com>)
List pgsql-sql
Hello to all,

One of the use cases we need to implement requires
storing and query-ing thousands (and more as the product grows) of csv files
that have different schema-s (by schema we mean column names and their type).

These csv would then need to be maintained with operations like:
- add column,
- add row,
- delete row,
- read: filter/sort/paginate,
- write: edit column values.

Let's assume that we store the definition of each schema in a dedicated table,
with the schema defined in a json column. With this schema we'll be able translate the read/write/update queries to these imported csv files into related SQL queries.

The remaining question is how to store the data of each file in the DB.

As suggested by https://www.postgresql.org/docs/10/sql-copy.html there is a way to import a csv in its own table. By using this approach for each csv-s we see:

Pros:
- All postgresql types available: https://www.postgresql.org/docs/9.5/datatype.html,
- Constraints on columns, among others unicity constraints,
  that makes the DB guarantee rows will not duplicated (relevant to the add row use case),
- Debuggability: enables using standard SQL to browse csv data,
- Can reuse existing libraries to generate dynamic SQL queries [1]

Cons:
- Need to have as many tables as different schemas.

Another solution could consist of implementing a document store in postgresql,
by storing all columns of a row in a single jsonb column.

Pros:
- Single table to store all different imported csv-s.

Cons:
- Less types available https://www.postgresql.org/docs/9.4/datatype-json.html,
- No constraint on columns, (no unicity or data validation constraints
that should be delegated to the application),
- Ramp-up on json* functions, (and I wonder whether there are libraries
to safely generate dynamic SQL queries on json columns),
(- Debuggability: this is not such a big con as json_to_record enables
going back to a standard SQL experience)

Based on this first pro/con list, we're wondering about the scalability
limits faced by postgresql instances getting more tables in a given DB.

Browsing the web, we saw two main issues:
- One related to the OS "you may see some performance degradation associated
  with databases containing many tables. PostgreSQL may use a large number of
  files for storing the table data, and performance may suffer if the operating
  system does not cope well with many files in a single directory." [1]
- Related to that, the fact that some operations like autovacuum are O(N) on number of tables [3]

On the other hand, reading timescaledb's architecture https://docs.timescale.com/timescaledb/latest/overview/core-concepts/hypertables-and-chunks/#partitioning-in-hypertables-with-chunks
"Each chunk is implemented using a standard database table."
it seems that their platform took such a direction, which may have proved the scalability of such an approach.

My question is thus the following:
how many of such tables can a single postgresql instance handle without trouble [4]?

Any challenge/addition to the pro/cons list described above would be very welcome too.

Best regards,
Ion

[1]: Like https://www.psycopg.org/docs/sql.html
[2]: https://link.springer.com/content/pdf/bbm%3A978-1-4302-0018-5%2F1.pdf
[3]: https://stackoverflow.com/questions/22395883/postgresql-what-is-the-maximum-number-of-tables-can-store-in-postgresql-databas
[4]: We use RDS instances in AWS

pgsql-sql by date:

Previous
From: Ion Alberdi
Date:
Subject: Re: Re: Turn a json column into a table
Next
From: Karsten Hilbert
Date:
Subject: Aw: Re: Re: Turn a json column into a table