Thread: Modifying COPY TO

Modifying COPY TO

From
"Dave Held"
Date:
<p><font face="Arial" size="2">I am interested in hacking COPY TO such that one can specify that</font><br /><font
face="Arial"size="2">rows are copied in a certain index order.  I got as far as </font><br /><font face="Arial"
size="2">src/backend/commands/copy.c:CopyTo(),and it looks like I would need</font><br /><font face="Arial" size="2">to
modifythe call to heap_beginscan() so that it uses a key.  However,</font><br /><font face="Arial" size="2">I couldn't
figureout how to provide one, or if I'm even looking at the</font><br /><font face="Arial" size="2">right area. 
Ideally,this behavior would be specified with a flag,</font><br /><font face="Arial" size="2">perhaps: "WITH INDEX
<index_name>"or "WITH PRIMARY KEY"</font><br /><font face="Arial" size="2">or something similar.</font><p><font
face="Arial"size="2">The motivation for this change is as follows.  I have a fairly large</font><br /><font
face="Arial"size="2">database (10 million+ records) that mirrors the data in a proprietary</font><br /><font
face="Arial"size="2">system.  The only access to that data is through exported flat files.</font><br /><font
face="Arial"size="2">Currently, those flat files are copied directly into a staging area in the </font><br /><font
face="Arial"size="2">db via a COPY FROM, the actual tables are truncated, and the</font><br /><font face="Arial"
size="2">stagingdata is inserted into the live tables.  Since the data is read-only,</font><br /><font face="Arial"
size="2">itdoesn't matter that it is recreated every day.  However, as you</font><br /><font face="Arial" size="2">can
imagine,the import process takes quite a while (several hours).</font><br /><font face="Arial" size="2">Also,
rebuildingthe db from scratch every day loses any statistical</font><br /><font face="Arial" size="2">information
gatheredfrom the execution of queries during the day.</font><p><font face="Arial" size="2">A possibility that I would
liketo pursue is to keep the staging data</font><br /><font face="Arial" size="2">from the previous day, do a COPY TO,
importthe new data into</font><br /><font face="Arial" size="2">another staging table with a COPY FROM, then export the
fresh</font><br/><font face="Arial" size="2">data with another COPY TO.  Then, I can write a fast C/C++</font><br
/><fontface="Arial" size="2">program to do a line-by-line comparison of each record, isolating</font><br /><font
face="Arial"size="2">the ones that have changed from the previous day.  I can then</font><br /><font face="Arial"
size="2">emitthose records in a change file that should be relatively small</font><br /><font face="Arial" size="2">and
easyto update.  Of course, this scheme can only work if</font><br /><font face="Arial" size="2">COPY TO emits the
recordsin a reliable order.</font><p><font face="Arial" size="2">Any assistance on this project would be greatly
appreciated. The</font><br /><font face="Arial" size="2">best I can see, I'm stuck on line 1053 from
copy.c:</font><p><fontface="Arial" size="2">    scandesc = heap_beginscan(rel, mySnapshot, 0, NULL);</font><p><font
face="Arial"size="2">I suspect that I want it to look like this:</font><p><font face="Arial" size="2">    scandesc =
heap_beginscan(rel,mySnapshot, 1, key);</font><p><font face="Arial" size="2">where 'key' is an appropriately
constructedScanKey.  It looks</font><br /><font face="Arial" size="2">like I want to call ScanKeyEntryInitialize(), but
I'mnot sure what</font><br /><font face="Arial" size="2">parameters I need to pass to it to get an index or the
primary</font><br/><font face="Arial" size="2">key.  I mostly need help building the ScanKey object.  I think I
</font><br/><font face="Arial" size="2">can figure out how to hack the custom option, etc.  I should </font><br /><font
face="Arial"size="2">mention that I am using the 7.4.7 codebase on Linux 2.4.</font><p><font face="Courier New"
size="2">__</font><br/><font face="Courier New" size="2">David B. Held</font><br /><font face="Courier New"
size="2">SoftwareEngineer/Array Services Group</font><br /><font face="Courier New" size="2">200 14th Ave. East, 
Sartell,MN 56377</font><br /><font face="Courier New" size="2">320.534.3637 320.253.7800 800.752.8129</font><br /> 

Re: Modifying COPY TO

From
Tom Lane
Date:
"Dave Held" <dave.held@arrayservicesgrp.com> writes:
> I am interested in hacking COPY TO such that one can specify that
> rows are copied in a certain index order.  I got as far as=20
> src/backend/commands/copy.c:CopyTo(), and it looks like I would need
> to modify the call to heap_beginscan() so that it uses a key.

Actually you'd need to change it to be an index_beginscan call.

Offhand I don't think you need a scan key; the point of a scan key is to
filter the returned rows, and it doesn't sound like that's what you
want.  So it should work to do
scandesc = index_beginscan(rel, idx, mySnapshot, 0, NULL);

(plus appropriate changes to the usage of the scandesc)
which leaves only the problem of opening the proper index.
        regards, tom lane


Re: Modifying COPY TO

From
Stephen Frost
Date:
* Dave Held (dave.held@arrayservicesgrp.com) wrote:
> A possibility that I would like to pursue is to keep the staging data
> from the previous day, do a COPY TO, import the new data into
> another staging table with a COPY FROM, then export the fresh
> data with another COPY TO.  Then, I can write a fast C/C++
> program to do a line-by-line comparison of each record, isolating
> the ones that have changed from the previous day.  I can then
> emit those records in a change file that should be relatively small
> and easy to update.  Of course, this scheme can only work if
> COPY TO emits the records in a reliable order.

sort -u old new > changes

?
Stephen

Re: Modifying COPY TO

From
"Jim C. Nasby"
Date:
Instead of just being able to specify an index to use, how hard would it
be to allow COPY TO to work from a generic query instead of just a
table? That way you wouldn't need to do a diff, you could just export
only new records. ISTM there's far more use for copying from a query
result that just copying by index order.
-- 
Jim C. Nasby, Database Consultant               decibel@decibel.org 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


Re: Modifying COPY TO

From
Bruce Momjian
Date:
Jim C. Nasby wrote:
> Instead of just being able to specify an index to use, how hard would it
> be to allow COPY TO to work from a generic query instead of just a
> table? That way you wouldn't need to do a diff, you could just export
> only new records. ISTM there's far more use for copying from a query
> result that just copying by index order.

The reason we don't support queries with COPY is because COPY reads from
the raw heap rather than going through the executor.  We could
special-case it so queries go through the executor but would take a
little work.

The only trick I can think of for now is to use SELECT ... INTO
TEMPORARY tab ... oRDER BY and then use COPY to dump the table.  It will
then dump in the order of the ORDER BY.


--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Modifying COPY TO

From
Andrew - Supernews
Date:
On 2005-02-25, "Dave Held" <dave.held@arrayservicesgrp.com> wrote:
> A possibility that I would like to pursue is to keep the staging data
> from the previous day, do a COPY TO, import the new data into
> another staging table with a COPY FROM, then export the fresh
> data with another COPY TO.  Then, I can write a fast C/C++
> program to do a line-by-line comparison of each record, isolating
> the ones that have changed from the previous day.  I can then
> emit those records in a change file that should be relatively small
> and easy to update.

I have an application that does something like this, but rather than use an
external program, I do the comparison in the database itself:
 - import data from external system into a temporary table - compare the temporary table against the live data (a full
outerjoin   is a convenient way of doing this - I create an index on the temp table   first) - perform
insert/update/deletefor each record that was added, changed   or removed
 

In my case the compare/update is in a pl/pgsql function. My data is only
2-3 million rows, a bit smaller than yours, but I have to update hourly,
not daily, and spend no more than 5-10 minutes on each update (currently
I can do it in 5: 2 to load the data, 3 to do the compare/update).

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services


Re: Modifying COPY TO

From
Tom Lane
Date:
"Jim C. Nasby" <decibel@decibel.org> writes:
> ... ISTM there's far more use for copying from a query
> result that just copying by index order.

Yeah.  The other point is that it's entirely likely that an external
sort will be faster than using an indexscan to produce the sorted order.
If you instead create a command like COPY FROM (SELECT whatever FROM foo ORDER BY something)
then you give the optimizer a chance at deciding what to do ... not that
I promise it will get it right, but a fixed choice is certain to be
wrong.
        regards, tom lane