Thread: Modifying COPY TO
<p><font face="Arial" size="2">I am interested in hacking COPY TO such that one can specify that</font><br /><font face="Arial"size="2">rows are copied in a certain index order. I got as far as </font><br /><font face="Arial" size="2">src/backend/commands/copy.c:CopyTo(),and it looks like I would need</font><br /><font face="Arial" size="2">to modifythe call to heap_beginscan() so that it uses a key. However,</font><br /><font face="Arial" size="2">I couldn't figureout how to provide one, or if I'm even looking at the</font><br /><font face="Arial" size="2">right area. Ideally,this behavior would be specified with a flag,</font><br /><font face="Arial" size="2">perhaps: "WITH INDEX <index_name>"or "WITH PRIMARY KEY"</font><br /><font face="Arial" size="2">or something similar.</font><p><font face="Arial"size="2">The motivation for this change is as follows. I have a fairly large</font><br /><font face="Arial"size="2">database (10 million+ records) that mirrors the data in a proprietary</font><br /><font face="Arial"size="2">system. The only access to that data is through exported flat files.</font><br /><font face="Arial"size="2">Currently, those flat files are copied directly into a staging area in the </font><br /><font face="Arial"size="2">db via a COPY FROM, the actual tables are truncated, and the</font><br /><font face="Arial" size="2">stagingdata is inserted into the live tables. Since the data is read-only,</font><br /><font face="Arial" size="2">itdoesn't matter that it is recreated every day. However, as you</font><br /><font face="Arial" size="2">can imagine,the import process takes quite a while (several hours).</font><br /><font face="Arial" size="2">Also, rebuildingthe db from scratch every day loses any statistical</font><br /><font face="Arial" size="2">information gatheredfrom the execution of queries during the day.</font><p><font face="Arial" size="2">A possibility that I would liketo pursue is to keep the staging data</font><br /><font face="Arial" size="2">from the previous day, do a COPY TO, importthe new data into</font><br /><font face="Arial" size="2">another staging table with a COPY FROM, then export the fresh</font><br/><font face="Arial" size="2">data with another COPY TO. Then, I can write a fast C/C++</font><br /><fontface="Arial" size="2">program to do a line-by-line comparison of each record, isolating</font><br /><font face="Arial"size="2">the ones that have changed from the previous day. I can then</font><br /><font face="Arial" size="2">emitthose records in a change file that should be relatively small</font><br /><font face="Arial" size="2">and easyto update. Of course, this scheme can only work if</font><br /><font face="Arial" size="2">COPY TO emits the recordsin a reliable order.</font><p><font face="Arial" size="2">Any assistance on this project would be greatly appreciated. The</font><br /><font face="Arial" size="2">best I can see, I'm stuck on line 1053 from copy.c:</font><p><fontface="Arial" size="2"> scandesc = heap_beginscan(rel, mySnapshot, 0, NULL);</font><p><font face="Arial"size="2">I suspect that I want it to look like this:</font><p><font face="Arial" size="2"> scandesc = heap_beginscan(rel,mySnapshot, 1, key);</font><p><font face="Arial" size="2">where 'key' is an appropriately constructedScanKey. It looks</font><br /><font face="Arial" size="2">like I want to call ScanKeyEntryInitialize(), but I'mnot sure what</font><br /><font face="Arial" size="2">parameters I need to pass to it to get an index or the primary</font><br/><font face="Arial" size="2">key. I mostly need help building the ScanKey object. I think I </font><br/><font face="Arial" size="2">can figure out how to hack the custom option, etc. I should </font><br /><font face="Arial"size="2">mention that I am using the 7.4.7 codebase on Linux 2.4.</font><p><font face="Courier New" size="2">__</font><br/><font face="Courier New" size="2">David B. Held</font><br /><font face="Courier New" size="2">SoftwareEngineer/Array Services Group</font><br /><font face="Courier New" size="2">200 14th Ave. East, Sartell,MN 56377</font><br /><font face="Courier New" size="2">320.534.3637 320.253.7800 800.752.8129</font><br />
"Dave Held" <dave.held@arrayservicesgrp.com> writes: > I am interested in hacking COPY TO such that one can specify that > rows are copied in a certain index order. I got as far as=20 > src/backend/commands/copy.c:CopyTo(), and it looks like I would need > to modify the call to heap_beginscan() so that it uses a key. Actually you'd need to change it to be an index_beginscan call. Offhand I don't think you need a scan key; the point of a scan key is to filter the returned rows, and it doesn't sound like that's what you want. So it should work to do scandesc = index_beginscan(rel, idx, mySnapshot, 0, NULL); (plus appropriate changes to the usage of the scandesc) which leaves only the problem of opening the proper index. regards, tom lane
* Dave Held (dave.held@arrayservicesgrp.com) wrote: > A possibility that I would like to pursue is to keep the staging data > from the previous day, do a COPY TO, import the new data into > another staging table with a COPY FROM, then export the fresh > data with another COPY TO. Then, I can write a fast C/C++ > program to do a line-by-line comparison of each record, isolating > the ones that have changed from the previous day. I can then > emit those records in a change file that should be relatively small > and easy to update. Of course, this scheme can only work if > COPY TO emits the records in a reliable order. sort -u old new > changes ? Stephen
Instead of just being able to specify an index to use, how hard would it be to allow COPY TO to work from a generic query instead of just a table? That way you wouldn't need to do a diff, you could just export only new records. ISTM there's far more use for copying from a query result that just copying by index order. -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
Jim C. Nasby wrote: > Instead of just being able to specify an index to use, how hard would it > be to allow COPY TO to work from a generic query instead of just a > table? That way you wouldn't need to do a diff, you could just export > only new records. ISTM there's far more use for copying from a query > result that just copying by index order. The reason we don't support queries with COPY is because COPY reads from the raw heap rather than going through the executor. We could special-case it so queries go through the executor but would take a little work. The only trick I can think of for now is to use SELECT ... INTO TEMPORARY tab ... oRDER BY and then use COPY to dump the table. It will then dump in the order of the ORDER BY. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On 2005-02-25, "Dave Held" <dave.held@arrayservicesgrp.com> wrote: > A possibility that I would like to pursue is to keep the staging data > from the previous day, do a COPY TO, import the new data into > another staging table with a COPY FROM, then export the fresh > data with another COPY TO. Then, I can write a fast C/C++ > program to do a line-by-line comparison of each record, isolating > the ones that have changed from the previous day. I can then > emit those records in a change file that should be relatively small > and easy to update. I have an application that does something like this, but rather than use an external program, I do the comparison in the database itself: - import data from external system into a temporary table - compare the temporary table against the live data (a full outerjoin is a convenient way of doing this - I create an index on the temp table first) - perform insert/update/deletefor each record that was added, changed or removed In my case the compare/update is in a pl/pgsql function. My data is only 2-3 million rows, a bit smaller than yours, but I have to update hourly, not daily, and spend no more than 5-10 minutes on each update (currently I can do it in 5: 2 to load the data, 3 to do the compare/update). -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services
"Jim C. Nasby" <decibel@decibel.org> writes: > ... ISTM there's far more use for copying from a query > result that just copying by index order. Yeah. The other point is that it's entirely likely that an external sort will be faster than using an indexscan to produce the sorted order. If you instead create a command like COPY FROM (SELECT whatever FROM foo ORDER BY something) then you give the optimizer a chance at deciding what to do ... not that I promise it will get it right, but a fixed choice is certain to be wrong. regards, tom lane