Thread: customizing pg_dump together with copy.c's DoCopy function

customizing pg_dump together with copy.c's DoCopy function

From

"lynnsettle@yahoo.com"

Date:

11 July 2006, 10:55:18

My first ever newsgroup PostgreSQL question... I want to move data
between some very large databases (100+ gb) of different schema at our
customer sites. I cannot expect there to be much working partition
space, so the databases cannot exist simultaneously. I am also
restricted to hours, not days, to move the data.

I felt that pg_dump/pg_restore with the compressed format would do the
job for me. I was able to create a modified pg_dump program without any
difficulty. But I need to customize the speedy COPY FROM and COPY TO
commands to perform my necessary schema and data content changes. I
tried copying /src/backend/copy.c/h to my customized pg_dump project,
renamed them and their DoCopy function, and added it to my makefile.
This created conflicts between libpq-fe.h and libpq.h. For example:

postgresql-7.4.13/src/interfaces/libpq/libpq-fe.h:191: error:
conflicting types for `PQArgBlock'
postgresql-7.4.13/src/include/libpq/libpq.h:39: error: previous
declaration of `PQArgBlock'

Is it possible to compile-link together frontend pg_dump code with
backend code from copy.c? If not, how can I go about customizing
pg_dump to have low-level control over the speedy but inflexible COPY
TO/FROM commands? I already tried all this with regular sql calls and
it was unnacceptably far too slow.

thanks -Lynn

Re: customizing pg_dump together with copy.c's DoCopy function

From

"lynnsettle@yahoo.com"

Date:

11 July 2006, 10:55:23

After further reading, I'm wondering if I should instead try to use
libpq calls like PQgetCopyData, PQputCopyData, and PQputCopyEnd.

Would they be a viable alternative to provide both the speed and
flexibility I'm looking for?

-Lynn

Re: customizing pg_dump together with copy.c's DoCopy function

From

Tom Lane

Date:

11 July 2006, 11:17:50

"lynnsettle@yahoo.com" <lynnsettle@yahoo.com> writes:
> Is it possible to compile-link together frontend pg_dump code with
> backend code from copy.c?

No.  Why do you think you need to modify pg_dump at all?

            regards, tom lane

Re: customizing pg_dump together with copy.c's DoCopy function

From

"lynnsettle@yahoo.com"

Date:

11 July 2006, 19:35:12

> > Is it possible to compile-link together frontend pg_dump code with
> > backend code from copy.c?
>
> No.  Why do you think you need to modify pg_dump at all?
>

pg_dump and pg_restore provide important advantages for upgrading a
customer's database on site:

They are fast. I want to minimize downtime.
They allow compression. I often will have relatively little free disk
space to work with.
My concept is "customized dump", drop database, create new schema
database, "customized restore".

My upgrade requires many schema and data content changes. I've tried
using standard SQL statements in perl scripts to do all of it, but even
with no indexes on inserts, later creating indexes for the lookup work,
and every other optimization I know of, a 100gb database requires
several days to turn our old database into a new one. I was hoping that
I could modify the speedy pg_dump/pg_restore utilities to make these
changes "on the fly". It gets tricky because I have to restore some of
the data to different tables having varying schema and also change the
table linking. But this is all doable as long as I can "massage" the
SQL statements and data both when it goes into the dump file and when
it is getting restored back out.

Or am I trying to do the impossible?
-Lynn

Re: customizing pg_dump together with copy.c's DoCopy function

From

"Brian Mathis"

Date:

16 July 2006, 17:17:25

pg_dump by default dumps to STDOUT, which you should use in a pipeline to perform any modifications. To me this seems pretty tricky, but should be doable. Modifying pg_dump really strikes me as the wrong way to go about it. Pipelines operate in memory, and should be very fast, depending on how you write the filtering program. You would need to dump the data without compression, then compress it coming out the other end (maybe split it up too). Something like this:
pg_dump | myfilter | gzip | split --bytes=2000M - mydump.

Also, you can't expect to have speed if you have no disk space. Reading/writing to the same disk will kill you. If you could set up some temp space over NFS on the local network, that should gain you some speed.

On 11 Jul 2006 08:43:17 -0700, lynnsettle@yahoo.com <lynnsettle@yahoo.com > wrote:

> > Is it possible to compile-link together frontend pg_dump code with
> > backend code from copy.c?
>
> No. Why do you think you need to modify pg_dump at all?
>

pg_dump and pg_restore provide important advantages for upgrading a
customer's database on site:

They are fast. I want to minimize downtime.
They allow compression. I often will have relatively little free disk
space to work with.
My concept is "customized dump", drop database, create new schema
database, "customized restore".

My upgrade requires many schema and data content changes. I've tried
using standard SQL statements in perl scripts to do all of it, but even
with no indexes on inserts, later creating indexes for the lookup work,
and every other optimization I know of, a 100gb database requires
several days to turn our old database into a new one. I was hoping that
I could modify the speedy pg_dump/pg_restore utilities to make these
changes "on the fly". It gets tricky because I have to restore some of
the data to different tables having varying schema and also change the
table linking. But this is all doable as long as I can "massage" the
SQL statements and data both when it goes into the dump file and when
it is getting restored back out.

Or am I trying to do the impossible?
-Lynn

Re: customizing pg_dump together with copy.c's DoCopy function

From

"lynnsettle@yahoo.com"

Date:

20 July 2006, 16:54:27

Brian,

Those are very interesting ideas. Thanks. I've been playing around with
pg_dump. Modifying it to selectively dump/restore tables and columns is
pretty easy. But as you say, changing the data content within the data
buffers to reflect varying column values, changed column types, and new
columns seems tricky. I wonder if anyone knows of any example code
around somewhere...
-Lynn

"Brian Mathis" wrote:
> pg_dump by default dumps to STDOUT, which you should use in a pipeline to
> perform any modifications.  To me this seems pretty tricky, but should be
> doable.  Modifying pg_dump really strikes me as the wrong way to go about
> it.  Pipelines operate in memory, and should be very fast, depending on how
> you write the filtering program.  You would need to dump the data without
> compression, then compress it coming out the other end (maybe split it up
> too).  Something like this:
>     pg_dump | myfilter | gzip | split --bytes=2000M - mydump.
>
> Also, you can't expect to have speed if you have no disk space.
> Reading/writing to the same disk will kill you.  If you could set up some
> temp space over NFS on the local network, that should gain you some speed.
>
>