Home > mailing lists

Re: Storing thousands of csv files in postgresql - Mailing list pgsql-sql

From	Ion Alberdi
Subject	Re: Storing thousands of csv files in postgresql
Date	February 16, 2022 00:21:59
Msg-id	CANbgw4BVLcWUefih_mes6xwnojzWVHtd-iUnuhYOAW08+yiEDw@mail.gmail.com Whole thread Raw
In response to	Re: Storing thousands of csv files in postgresql (Rob Sargent <robjsargent@gmail.com>)
List	pgsql-sql

Tree view

>I think the add/remove column requirement alone justifies NOT using files. The CSV approach will temp the system to handle some versioning nonsense. Using tables also provides some protection against the inevitable garbage data in the CSVs.

Indeed, in addition to that all the features related to types/constraints and links to other tables* available in postgresql would be lost (that would need to be re-implemented in the app layer).
Thanks Rob!

*: on the pros, I forgot to add the "on delete cascade on foreign keys" that is very welcome too.

Le mar. 15 févr. 2022 à 22:13, Rob Sargent <robjsargent@gmail.com> a écrit :

I don't think you need a "federated" postgres network like Citus at all - I think this solves a different use case. For your design problem, I think that having a bunch of independent Pg servers would be fine - as long as you don't need to run searches across CSV tables stored across different databases (in which case you do need index/search federation of some kind).

Regarding Erik Brandsberg's point about XFS, I think this is a useful alternative approach, if I understand the idea. Instead of storing your CSV files in Postgres, just store them as CSV files on the file system. You can still store the schemas in Pg, but each schema would just point to a file in the file system and you'd manipulate the files in the filesystem using whatever language is appropriate (I find ruby to be excellent for managing CSV files). If you need to index those files to run searches against them, I'd direct your attention to https://prestodb.io/ which is the core technology that runs Amazon Athena. This allows you to search CSV files with various schema (among other data bindings). So you might find that Pg as your schema storage, XFS (or any modern FS) to store large numbers of CSV files, and Presto/Athena to index/search those files, along with some CSV management language (like Ruby or something even higher level) to manage the data.

I think if I were dealing with less than 10k CSV files (and therefore Pg tables), I might use Pg, and if I were dealing with 10k+ files, I'd start looking at file systems + Presto. But that's a WAG.

Steve

I think the add/remove column requirement alone justifies NOT using files. The CSV approach will temp the system to handle some versioning nonsense. Using tables also provides some protection against the inevitable garbage data in the CSVs.

pgsql-sql by date:

From: Steve Midgley
Date: 16 February 2022, 00:21:46
Subject: Re: Storing thousands of csv files in postgresql

From: Ion Alberdi
Date: 16 February 2022, 00:30:37
Subject: Re: Storing thousands of csv files in postgresql

Re: Storing thousands of csv files in postgresql - Mailing list pgsql-sql

Previous

Next