Is there a way to speed up Spark ingestion (postgres -> spark -> snowflake) which does not have issues such as alignment or inefficient json file format? I favor spark since it simplifies the ingest path of postgres -> local json.gz -> s3 -> snowflake stage -> snowflake table. Hope this makes sense.
On Wed, 31 May 2023 at 16:26, Lian Jiang <jiangok2006@gmail.com> wrote: > I am using psql to periodically dump the postgres tables into json files which are imported into snowflake. For large tables (e.g. 70M rows), it takes hours for psql to complete. Using spark to read the postgres table seems not to work as the postgres read only replication is the bottleneck so spark cluster never uses >1 worker node and the working node timeout or out of memory. > > Will vertical scaling the postgres db speed up psql? Or any thread related parameter of psql can help? Thanks for any hints.
This is probably slow due to psql's alignment. It needs to read the entire result to know how much to whitespace to pad columns with before it can output anything. Naturally, that requires quite a bit of memory when large tables are queried and also lots of length checking.
As mentioned by Rob, you'll probably want to use COPY, or you could look at using pg_dump. Something like:
pg_dump --inserts --rows-per-insert=100 --table=<name of your table to export> --data-only <database name>
should give you something that you might have some luck directly importing to Snowflake without any transformation tool required.
However, if you do need to do some transformation before loading, then you might want to do that in PostgreSQL. For that, something like:
psql -c "copy (select col1,col2,col3+col4 from your_table) to stdout" <name of database>
would allow you to run a query, which you could maybe do your transformations in before importing into Snowflake