Thread: SQL Diff ?
Hi List; I have a very large table (52million rows) - I'm creating a copy of it to rid it of 35G worth of dead space, then I'll do a sync, drop the original table and rename table2. Once I have the table2 as a copy of table1 what's the best way to select all rows that have been changed, modified in table1 since the initial laod from table1 into table2? Also I'll need to delete any rows in table2 that no longer remain in table1. There is no change date column I could do something like select * from table1 where col1 || col2 || col3 etc not in (select col1 || col2 || col3 etc from table2) but this would be ineffecient & slow. Anyone have a suggestion to do this in an efficient manner? Thanks in advance /Kevin
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 08/25/07 21:51, Kevin Kempter wrote: > Hi List; > > I have a very large table (52million rows) - I'm creating a copy of it to rid > it of 35G worth of dead space, then I'll do a sync, drop the original table > and rename table2. What is your definition of "dead space"? Bad rows, duplicate rows, old rows? Something else? > Once I have the table2 as a copy of table1 what's the best way to select all > rows that have been changed, modified in table1 since the initial laod from > table1 into table2? > > Also I'll need to delete any rows in table2 that no longer remain in table1. > > There is no change date column > I could do something like select * from table1 where col1 || col2 || col3 etc > not in (select col1 || col2 || col3 etc from table2) > > but this would be ineffecient & slow. > > Anyone have a suggestion to do this in an efficient manner? > > Thanks in advance - -- Ron Johnson, Jr. Jefferson LA USA Give a man a fish, and he eats for a day. Hit him with a fish, and he goes away for good! -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFG0O8bS9HxQb37XmcRAsWdAJoDI/WvdaGSO7CUkUa74xHoRgycIwCguLXv Msw0TdJyYI4keoq+ofu4j3c= =Gi/f -----END PGP SIGNATURE-----
On Saturday 25 August 2007 21:10:19 Ron Johnson wrote: > On 08/25/07 21:51, Kevin Kempter wrote: > > Hi List; > > > > I have a very large table (52million rows) - I'm creating a copy of it to > > rid it of 35G worth of dead space, then I'll do a sync, drop the original > > table and rename table2. > > What is your definition of "dead space"? > > Bad rows, duplicate rows, old rows? Something else? deleted rows that should have been cleaned up with vacuum, problem is the client let it go so long that now I cant get a vacuum to finish cause it impacts the day2day operations too much. Long story, see my recent questions on the performance list for more info. > > > Once I have the table2 as a copy of table1 what's the best way to select > > all rows that have been changed, modified in table1 since the initial > > laod from table1 into table2? > > > > Also I'll need to delete any rows in table2 that no longer remain in > > table1. > > > > There is no change date column > > I could do something like select * from table1 where col1 || col2 || col3 > > etc not in (select col1 || col2 || col3 etc from table2) > > > > but this would be ineffecient & slow. > > > > Anyone have a suggestion to do this in an efficient manner? > > > > Thanks in advance
Kevin Kempter wrote: > Hi List; > > I have a very large table (52million rows) - I'm creating a copy of it to rid > it of 35G worth of dead space, then I'll do a sync, drop the original table > and rename table2. > > Once I have the table2 as a copy of table1 what's the best way to select all > rows that have been changed, modified in table1 since the initial laod from > table1 into table2? > > Also I'll need to delete any rows in table2 that no longer remain in table1. > Set log_statement to 'mod' (or 'ddl' if you expect any changes to the schema itself) in postgresql.conf. Then it's just a matter of parsing out the mods in the log.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 08/25/07 22:21, Kevin Kempter wrote: > On Saturday 25 August 2007 21:10:19 Ron Johnson wrote: >> On 08/25/07 21:51, Kevin Kempter wrote: >>> Hi List; >>> >>> I have a very large table (52million rows) - I'm creating a copy of it to >>> rid it of 35G worth of dead space, then I'll do a sync, drop the original >>> table and rename table2. >> What is your definition of "dead space"? >> >> Bad rows, duplicate rows, old rows? Something else? > > deleted rows that should have been cleaned up with vacuum, problem is the > client let it go so long that now I cant get a vacuum to finish cause it > impacts the day2day operations too much. Long story, see my recent questions > on the performance list for more info. OK. > >>> Once I have the table2 as a copy of table1 what's the best way to select >>> all rows that have been changed, modified in table1 since the initial >>> laod from table1 into table2? Is this a 24x7 database? >>> Also I'll need to delete any rows in table2 that no longer remain in >>> table1. >>> >>> There is no change date column >>> I could do something like select * from table1 where col1 || col2 || col3 >>> etc not in (select col1 || col2 || col3 etc from table2) >>> >>> but this would be ineffecient & slow. >>> >>> Anyone have a suggestion to do this in an efficient manner? >>> >>> Thanks in advance - -- Ron Johnson, Jr. Jefferson LA USA Give a man a fish, and he eats for a day. Hit him with a fish, and he goes away for good! -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFG0RRzS9HxQb37XmcRAtuhAJ0TEBIHJwvcR58iU4MvjV2CYMvvfQCdFnkJ ThDbGY8dXJ2MoqOaNHInqdU= =GF05 -----END PGP SIGNATURE-----
On 8/26/07, Kevin Kempter <kevin@kevinkempterllc.com> wrote: > On Saturday 25 August 2007 21:10:19 Ron Johnson wrote: > > On 08/25/07 21:51, Kevin Kempter wrote: > > > Hi List; > > > > > > I have a very large table (52million rows) - I'm creating a copy of it to > > > rid it of 35G worth of dead space, then I'll do a sync, drop the original > > > table and rename table2. > > > > What is your definition of "dead space"? > > > > Bad rows, duplicate rows, old rows? Something else? > > deleted rows that should have been cleaned up with vacuum, problem is the > client let it go so long that now I cant get a vacuum to finish cause it > impacts the day2day operations too much. Long story, see my recent questions > on the performance list for more info. In your place I would do something like Slony-I does, when it replicates the tables. Create on insert/update/delete triggers on table1 which will log operations on table1 to some table1_log table. Then copy table1 to table2. Then replay table1_log on table2, then BEGIN;LOCK tablel1;finish replaying the lock;DROP table1; alter table rename...;commit; Or perhaps actually use Slony-I for the above steps? Should work quite nicely... Or perhaps use SkyTools for it (I've never used it)? Regarda, Dawid
On Saturday 25 August 2007 23:49:39 Ron Johnson wrote: > On 08/25/07 22:21, Kevin Kempter wrote: > > On Saturday 25 August 2007 21:10:19 Ron Johnson wrote: > >> On 08/25/07 21:51, Kevin Kempter wrote: > >>> Hi List; > >>> > >>> I have a very large table (52million rows) - I'm creating a copy of it > >>> to rid it of 35G worth of dead space, then I'll do a sync, drop the > >>> original table and rename table2. > >> > >> What is your definition of "dead space"? > >> > >> Bad rows, duplicate rows, old rows? Something else? > > > > deleted rows that should have been cleaned up with vacuum, problem is the > > client let it go so long that now I cant get a vacuum to finish cause it > > impacts the day2day operations too much. Long story, see my recent > > questions on the performance list for more info. > > OK. > > >>> Once I have the table2 as a copy of table1 what's the best way to > >>> select all rows that have been changed, modified in table1 since the > >>> initial laod from table1 into table2? > > Is this a 24x7 database? Yes. with little room for extra overhead > > >>> Also I'll need to delete any rows in table2 that no longer remain in > >>> table1. > >>> > >>> There is no change date column > >>> I could do something like select * from table1 where col1 || col2 || > >>> col3 etc not in (select col1 || col2 || col3 etc from table2) > >>> > >>> but this would be ineffecient & slow. > >>> > >>> Anyone have a suggestion to do this in an efficient manner? > >>> > >>> Thanks in advance
On Aug 26, 2007, at 9:02 AM, Dawid Kuroczko wrote: > On 8/26/07, Kevin Kempter <kevin@kevinkempterllc.com> wrote: >> On Saturday 25 August 2007 21:10:19 Ron Johnson wrote: >>> On 08/25/07 21:51, Kevin Kempter wrote: >>>> Hi List; >>>> >>>> I have a very large table (52million rows) - I'm creating a copy >>>> of it to >>>> rid it of 35G worth of dead space, then I'll do a sync, drop the >>>> original >>>> table and rename table2. >>> >>> What is your definition of "dead space"? >>> >>> Bad rows, duplicate rows, old rows? Something else? >> >> deleted rows that should have been cleaned up with vacuum, problem >> is the >> client let it go so long that now I cant get a vacuum to finish >> cause it >> impacts the day2day operations too much. Long story, see my >> recent questions >> on the performance list for more info. > > In your place I would do something like Slony-I does, when > it replicates the tables. Create on insert/update/delete triggers > on table1 which will log operations on table1 to some table1_log > table. Then copy table1 to table2. Then replay table1_log on > table2, then BEGIN;LOCK tablel1;finish replaying the lock;DROP table1; > alter table rename...;commit; > > Or perhaps actually use Slony-I for the above steps? Should work > quite nicely... Or perhaps use SkyTools for it (I've never used it)? > Yeah, for trigger based replication it'd be simpler to just use Slony- I or Skytools. However, if you're on 8.2, with row-wise comparisons, you could do something like: begin; lock table1; insert into table2 select * from table1 where id not in (select id from test2); drop table1; alter table2 rename to table1; commit; Here id is your primary key. Note that if your ids are generated by a sequence you'll need to use setval on the sequence to get it "caught up" before that commit or you'll get duplicate key errors immediately. Erik Jones Software Developer | Emma® erik@myemma.com 800.595.4401 or 615.292.5888 615.292.0777 (fax) Emma helps organizations everywhere communicate & market in style. Visit us online at http://www.myemma.com
Kevin Kempter wrote: > Hi List; > > I have a very large table (52million rows) - I'm creating a copy of it to rid > it of 35G worth of dead space, then I'll do a sync, drop the original table > and rename table2. > > Once I have the table2 as a copy of table1 what's the best way to select all > rows that have been changed, modified in table1 since the initial laod from > table1 into table2? I think you could get smart having a few rules for insert/update/delete on 'table' that "keep track" of what happens during your work on table2. -- Alban Hertroys alban@magproductions.nl magproductions b.v. T: ++31(0)534346874 F: ++31(0)534346876 M: I: www.magproductions.nl A: Postbus 416 7500 AK Enschede // Integrate Your World //