Re: Storing thousands of csv files in postgresql - Mailing list pgsql-sql

From Steve Midgley
Subject Re: Storing thousands of csv files in postgresql
Date
Msg-id CAJexoSL-RADZAFFtW6NgGNg+SiomNwvcApWpogwuW3EsdGHJXA@mail.gmail.com
Whole thread Raw
In response to Storing thousands of csv files in postgresql  (Ion Alberdi <ion.alberdi@pricemoov.com>)
Responses Re: Storing thousands of csv files in postgresql  (Ion Alberdi <ion.alberdi@pricemoov.com>)
List pgsql-sql


On Tue, Feb 15, 2022 at 12:15 AM Ion Alberdi <ion.alberdi@pricemoov.com> wrote:
Hello to all,

One of the use cases we need to implement requires
storing and query-ing thousands (and more as the product grows) of csv files
that have different schema-s (by schema we mean column names and their type).

These csv would then need to be maintained with operations like:
- add column,
- add row,
- delete row,
- read: filter/sort/paginate,
- write: edit column values.

Let's assume that we store the definition of each schema in a dedicated table,
with the schema defined in a json column. With this schema we'll be able translate the read/write/update queries to these imported csv files into related SQL queries.

The remaining question is how to store the data of each file in the DB.

As suggested by https://www.postgresql.org/docs/10/sql-copy.html there is a way to import a csv in its own table. By using this approach for each csv-s we see:

Pros:
- All postgresql types available: https://www.postgresql.org/docs/9.5/datatype.html,
- Constraints on columns, among others unicity constraints,
  that makes the DB guarantee rows will not duplicated (relevant to the add row use case),
- Debuggability: enables using standard SQL to browse csv data,
- Can reuse existing libraries to generate dynamic SQL queries [1]

Cons:
- Need to have as many tables as different schemas.

Another solution could consist of implementing a document store in postgresql,
by storing all columns of a row in a single jsonb column.

Pros:
- Single table to store all different imported csv-s.

Cons:
- Less types available https://www.postgresql.org/docs/9.4/datatype-json.html,
- No constraint on columns, (no unicity or data validation constraints
that should be delegated to the application),
- Ramp-up on json* functions, (and I wonder whether there are libraries
to safely generate dynamic SQL queries on json columns),
(- Debuggability: this is not such a big con as json_to_record enables
going back to a standard SQL experience)

Based on this first pro/con list, we're wondering about the scalability
limits faced by postgresql instances getting more tables in a given DB.

Browsing the web, we saw two main issues:
- One related to the OS "you may see some performance degradation associated
  with databases containing many tables. PostgreSQL may use a large number of
  files for storing the table data, and performance may suffer if the operating
  system does not cope well with many files in a single directory." [1]
- Related to that, the fact that some operations like autovacuum are O(N) on number of tables [3]

On the other hand, reading timescaledb's architecture https://docs.timescale.com/timescaledb/latest/overview/core-concepts/hypertables-and-chunks/#partitioning-in-hypertables-with-chunks
"Each chunk is implemented using a standard database table."
it seems that their platform took such a direction, which may have proved the scalability of such an approach.

My question is thus the following:
how many of such tables can a single postgresql instance handle without trouble [4]?

Any challenge/addition to the pro/cons list described above would be very welcome too.

 
Given that no matter what answer the community can give you about the number of tables per DB is reasonable, if your project is successful, you'll probably exceed that limit eventually. Why not plan for federation at the start (a little, at low cost) by including the PG server URL and DB name where the CSV is stored in your CSV table schema store? That way, for now, you just store CSVs in the current PG server/DB, and should it get overwhelmed, it's relatively easy to just point accessors to a different server and/or DB in the future for some CSV resources? The main upfront increased cost is that you'll need to solve for credential management for the various PG servers. If you're in AWS, "Secrets Manager" would work - but there are lots of equivalent solutions out there.

FWIW, I think your analysis of the pros and cons of tables vs documents is excellent but slightly incomplete. In my experience with document DB, I only postpone all the downsides, in order to get the immediate benefits (kind of like a "sugar high"). Eventually you have to solve for everything you solve for with the table-based solution. You just don't have to solve for it upfront, like in the table approach. And, at least on the project where I got bit by a document db architecture, it's a lot harder to solve for many of these problems when you solve for them later in your project, so it's better just to build using structured tables up front for a project with meaningful structures and lots of data.

Steve

pgsql-sql by date:

Previous
From: Karsten Hilbert
Date:
Subject: Aw: Re: Re: Turn a json column into a table
Next
From: Ion Alberdi
Date:
Subject: Re: Storing thousands of csv files in postgresql