Thread: Adding new and changed data

Adding new and changed data

From
Hendrik Visage
Date:
Hi there,

Below is a snippet of data of summary (24hours, ie ~86400seconds) and
detail (~300seconds), that gets generated every day and then posted to
a central server that needs to load these data into a (Preferably
PostgreSQL) database.

Now, my problem is that the majority of the previous data (except the
first entry) is the same for the files of the different days, as the
extraction is a tad brain dead/difficult to get done per previous day,
and we prefer to get those extra data for in case.

Now my problem is importing the data in bulk, and to keep the version
of the record that has the longest interval value (the third field in
the CSV below). Refer to the entries of 03/29 of the *.gs files. The
*.gd have the same, however there would be a single entry that might
be shorter, though we'll be adding a full day's worth of extra ~5min
interval data.

If I set a unique key on date,time,interval_length (or
secondsSince1970, interval_length), I can reject the similar entries,
but then I'm left with duplicates on date,time (or secondsSince1970)
and would need to clean up with duplicate detection SQL code, and I'd
prefer to prevent that at load time.

Any advice/ideas as to the K.I.S.S. to use/implement  insert/update
instead of doing the select/delete search for duplicates?


(Columns: Date, Time, secondSince1970, interval_length in seconds,
performance data...)
[visagehe@tsysl01 capacity]$ tail -n4  daily/dnba01-201203*gs
==> daily/dnba01-20120329.gs <==
03/26/2012,00:00:00,1332720000,86411.1,  4.10,  0.18, 53.36,         93,
03/27/2012,00:00:00,1332806400,86406.9,  4.23,  0.21, 53.63,         96,
03/28/2012,00:00:00,1332892800,86409.0,  4.15,  0.18, 53.94,         93,
03/29/2012,00:00:00,1332979200,57606.0,  4.17,  0.17, 54.19,         78,

==> daily/dnba01-20120330.gs <==
03/27/2012,00:00:00,1332806400,86406.9,  4.23,  0.21, 53.63,         96,
03/28/2012,00:00:00,1332892800,86409.0,  4.15,  0.18, 53.94,         93,
03/29/2012,00:00:00,1332979200,86409.5,  4.16,  0.20, 54.18,         93,
03/30/2012,00:00:00,1333065600, 300.4, 14.73,  5.22, 54.64,         79,
[visagehe@tsysl01 capacity]$

[visagehe@tsysl01 capacity]$ tail -n4  daily/dnba01-201203*gd
==> daily/dnba01-20120329.gd <==
03/29/2012,15:40:00,1333035600, 300.1,  4.35,  0.06, 54.12,         77,
03/29/2012,15:45:00,1333035900, 300.0,  3.62,  0.18, 54.14,         80,
03/29/2012,15:50:00,1333036200, 300.0,  3.67,  0.16, 54.17,         79,
03/29/2012,15:55:00,1333036500, 299.0,  5.27,  0.39, 54.54,         82,

==> daily/dnba01-20120330.gd <==
03/29/2012,23:45:00,1333064700, 300.1,  3.60,  0.03, 54.29,         73,
03/29/2012,23:50:00,1333065000, 300.0,  3.71,  0.04, 54.30,         77,
03/29/2012,23:55:00,1333065300, 300.0,  5.17,  0.05, 54.68,         76,
03/30/2012,00:00:00,1333065600, 300.4, 14.73,  5.22, 54.64,         79,
[visagehe@tsysl01 capacity]$

Re: Adding new and changed data

From
"Greg Sabino Mullane"
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160


> Now my problem is importing the data in bulk, and to keep the version
> of the record that has the longest interval value (the third field in
> the CSV below). Refer to the entries of 03/29 of the *.gs files. The
...
> Any advice/ideas as to the K.I.S.S. to use/implement  insert/update
> instead of doing the select/delete search for duplicates?

One approach would be to load into a temporary table, add an index
to it on date,time,interval and then insert back into the main
table with a group by:

INSERT INTO realtable
  SELECT DISTINCT t1.*
  FROM temptable t1,
  (SELECT date,time,MAX(interval_length) AS imax
   FROM temptable t2
   GROUP BY 1,2
  ) AS t2
  WHERE t1.date=t2.date AND t1.time=t2.time AND t1.interval_length=t2.imax;

- --
Greg Sabino Mullane greg@turnstep.com
End Point Corporation http://www.endpoint.com/
PGP Key: 0x14964AC8 201204032320
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----

iEYEAREDAAYFAk97vfMACgkQvJuQZxSWSsh4kQCeKQbDE74iio288KOBp/5Z5qOc
F2MAoJCE3uR3MkDJ+dghp2XKCQnpAjPB
=FTry
-----END PGP SIGNATURE-----