Re: Breaking up a PostgreSQL COPY command into chunks? - Mailing list pgsql-general

From Francisco Olarte
Subject Re: Breaking up a PostgreSQL COPY command into chunks?
Date
Msg-id CA+bJJbxo33FkPwBbifhDyRyH1+OYY+vkdcY6UmBWp1cCzeC9aQ@mail.gmail.com
Whole thread Raw
In response to Re: Breaking up a PostgreSQL COPY command into chunks?  (Victor Hooi <victorhooi@yahoo.com>)
List pgsql-general
Hi Victor:

> You're right, our internet connection is going to be the limiting factor.
...

Good to know this. Then you have 1 restriction, your bandwidth.......

> Essentially, the PostgreSQL server is in a datacentre, the server we're
> dumping to is in the office.
> Running a script on the PostgreSQL server in the datacentre is going to be
> tricky (not so much technically, just from a procedures/security point of
> view).

... another restriction, you cannot run code on the server.

An aside here, you know postgres can do server side copy, and even
pipe the results to a program, so if you have 600Mb spare disk, or
about 150 and access to gzip, or access to ssh CLIENT in the server
and ssh SERVER in the office you could transfer the file easily. Ie,
if you have a directory which you can access from the server account,
you could do

    COPY the_table_or_query TO '/the/dir/the/filename' FORMAT .....

or

    COPY the_table_or_query TO PROGRAM 'gzip -c >
/the/dir/the/filename.czip' FORMAT .....

and then transfer the appropiate file to the office at your leisure
using whichever method you are used to.

In fact, if you have an ssh server in your office accesible from the
server you could transfer it there directly, or use a listening socket
in your program and netcat, but in this case you'll run in the same
timeout problems ( except you can compress and reduce the compression
time ).

> Dumping to a spare table seems like an interesting point - so we'd just
> create the table, COPY the results to that table, then use LIMIT/OFFSET to
> paginate through that, then drop the table afterwards?

That is more or less it. You do not copy, just create the table with
the proper data, i.e., instead of "COPY (my query) ..." you do a
'CREATE TABLE tmpxxx as SELECT...'

> Currently, I'm doing a quick hack where we download an ordered list of the
> ids (auto-incrementing integer) into Python, chunk it up into groups of ids,
> then use a WHERE IN clause to download each chunk via COPY.

This has a problem, you need to wrap everything in a single
transaction to avoid the update / delete / insert in the middle
problem you commented previously, so your transaction time is going to
be even bigger. Also,if the integers are autoincrementing, you can do
better. First get min and max  ( I do not remember if postgres already
optimized them or you'll need the 'order by limit 1' trick ). Then
just loop in appropiately sized steps ( I suppose table is dense, but
you can go to the stats, and you should know your data patterns ).

> Would dumping to a spare table and paginating a better approach? Reasons?
> (Not challenging it, I just want to understand everything).

The table is created as the single operation on a single transaction,
which has no delay problems as it does not generate output.

From there on the table is read-only, so you can use a transaction for
each chunk, and you know the ids do not change. Even if a chunk copy
fails due to the internet connection farting a bit, you just retry it.

If you are going to use limit/offset you'll need an index on the field
anyway, but you can do the index on a separate transaction after
copying ( so you do not interfere with anyone ).

You can even do everything in autocommit mode, saving server roundtrips.

Anyway, this can be done, but if you can do the server side copy and
ssh thing, that is much better. I do not know how to do it in another
OSs, but in unix, if you have a ssh-accessible account with enough
quota on the server you could do something like create a directory in
the user home, give adequate access to the server user to it, let the
server dump the data there.

Francisco Olarte.


pgsql-general by date:

Previous
From: Marc Mamin
Date:
Subject: Re: pg_upgrade / Checking for presence of required libraries (SOLVED)
Next
From: si24
Date:
Subject: Re: Explanantion on pgbouncer please