Thread: COPY command details
Hi, If I have a PostgreSQL table with records and logical indexes already created, if I use COPY to load additional data, does the COPY update the indexes during, after, or not at all? Benjamin
am Thu, dem 29.03.2007, um 0:13:09 -0700 mailte Benjamin Arai folgendes: > Hi, > > If I have a PostgreSQL table with records and logical indexes already > created, if I use COPY to load additional data, does the COPY update > the indexes during, after, or not at all? after, i think. test=# create table foo (id int primary key, name text); NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "foo_pkey" for table "foo" CREATE TABLE test=*# copy foo from stdin; Enter data to be copied followed by a newline. End with a backslash and a period on a line by itself. >> 1 a >> 2 b >> 2 c >> \. ERROR: duplicate key violates unique constraint "foo_pkey" CONTEXT: COPY foo, line 3: "2 c" test=*# I can type the wrong key and the error occurs later with the finaly \. Andreas -- Andreas Kretschmer Kontakt: Heynitz: 035242/47150, D1: 0160/7141639 (mehr: -> Header) GnuPG-ID: 0x3FFF606C, privat 0x7F4584DA http://wwwkeys.de.pgp.net
"A. Kretschmer" <andreas.kretschmer@schollglas.com> writes: > am Thu, dem 29.03.2007, um 0:13:09 -0700 mailte Benjamin Arai folgendes: >> If I have a PostgreSQL table with records and logical indexes already >> created, if I use COPY to load additional data, does the COPY update >> the indexes during, after, or not at all? > after, i think. > test=# create table foo (id int primary key, name text); > NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "foo_pkey" for table "foo" > CREATE TABLE > test=*# copy foo from stdin; > Enter data to be copied followed by a newline. > End with a backslash and a period on a line by itself. > 1 a > 2 b > 2 c > \. > ERROR: duplicate key violates unique constraint "foo_pkey" > CONTEXT: COPY foo, line 3: "2 c" > test=*# > I can type the wrong key and the error occurs later with the finaly \. No, "during" is the right answer. The above only demonstrates that libpq buffers COPY data in larger-than-one-line units --- once the data gets to the backend it's inserted and checked a row at a time. regards, tom lane
So, is there a way to defer the index updating until a later period of time. More specifically, I would like to do several COPIES to a running database, then afterward force a update on the index via a vacuum or something similar. Benjamin On Mar 29, 2007, at 1:03 AM, A. Kretschmer wrote: > am Thu, dem 29.03.2007, um 0:13:09 -0700 mailte Benjamin Arai > folgendes: >> Hi, >> >> If I have a PostgreSQL table with records and logical indexes already >> created, if I use COPY to load additional data, does the COPY update >> the indexes during, after, or not at all? > > after, i think. > > test=# create table foo (id int primary key, name text); > NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index > "foo_pkey" for table "foo" > CREATE TABLE > test=*# copy foo from stdin; > Enter data to be copied followed by a newline. > End with a backslash and a period on a line by itself. >>> 1 a >>> 2 b >>> 2 c >>> \. > ERROR: duplicate key violates unique constraint "foo_pkey" > CONTEXT: COPY foo, line 3: "2 c" > test=*# > > > I can type the wrong key and the error occurs later with the finaly \. > > > Andreas > -- > Andreas Kretschmer > Kontakt: Heynitz: 035242/47150, D1: 0160/7141639 (mehr: -> Header) > GnuPG-ID: 0x3FFF606C, privat 0x7F4584DA http://wwwkeys.de.pgp.net > > ---------------------------(end of > broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that > your > message can get through to the mailing list cleanly >
am Thu, dem 29.03.2007, um 10:02:49 -0700 mailte Benjamin Arai folgendes: > So, is there a way to defer the index updating until a later period > of time. More specifically, I would like to do several COPIES to a > running database, then afterward force a update on the index via a > vacuum or something similar. Drop the index(es), do the COPIES, recreate the index(es). You can do this within a transaction. Andreas -- Andreas Kretschmer Kontakt: Heynitz: 035242/47150, D1: 0160/7141639 (mehr: -> Header) GnuPG-ID: 0x3FFF606C, privat 0x7F4584DA http://wwwkeys.de.pgp.net
So, is there a way to defer the index updating until a later period of time. More specifically, I would like to do several COPIES to a running database, then afterward force a update on the index via a vacuum or something similar. Benjamin On Mar 29, 2007, at 1:03 AM, A. Kretschmer wrote: > am Thu, dem 29.03.2007, um 0:13:09 -0700 mailte Benjamin Arai > folgendes: >> Hi, >> >> If I have a PostgreSQL table with records and logical indexes already >> created, if I use COPY to load additional data, does the COPY update >> the indexes during, after, or not at all? > > after, i think. > > test=# create table foo (id int primary key, name text); > NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index > "foo_pkey" for table "foo" > CREATE TABLE > test=*# copy foo from stdin; > Enter data to be copied followed by a newline. > End with a backslash and a period on a line by itself. >>> 1 a >>> 2 b >>> 2 c >>> \. > ERROR: duplicate key violates unique constraint "foo_pkey" > CONTEXT: COPY foo, line 3: "2 c" > test=*# > > > I can type the wrong key and the error occurs later with the finaly \. > > > Andreas > -- > Andreas Kretschmer > Kontakt: Heynitz: 035242/47150, D1: 0160/7141639 (mehr: -> Header) > GnuPG-ID: 0x3FFF606C, privat 0x7F4584DA http://wwwkeys.de.pgp.net > > ---------------------------(end of > broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that > your > message can get through to the mailing list cleanly >
Benjamin Arai wrote: > So, is there a way to defer the index updating until a later period > of time. More specifically, I would like to do several COPIES to a > running database, then afterward force a update on the index via a > vacuum or something similar. Sure, drop the index, do the COPY, and then recreate the index. That is done often. --------------------------------------------------------------------------- > > Benjamin > > On Mar 29, 2007, at 1:03 AM, A. Kretschmer wrote: > > > am Thu, dem 29.03.2007, um 0:13:09 -0700 mailte Benjamin Arai > > folgendes: > >> Hi, > >> > >> If I have a PostgreSQL table with records and logical indexes already > >> created, if I use COPY to load additional data, does the COPY update > >> the indexes during, after, or not at all? > > > > after, i think. > > > > test=# create table foo (id int primary key, name text); > > NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index > > "foo_pkey" for table "foo" > > CREATE TABLE > > test=*# copy foo from stdin; > > Enter data to be copied followed by a newline. > > End with a backslash and a period on a line by itself. > >>> 1 a > >>> 2 b > >>> 2 c > >>> \. > > ERROR: duplicate key violates unique constraint "foo_pkey" > > CONTEXT: COPY foo, line 3: "2 c" > > test=*# > > > > > > I can type the wrong key and the error occurs later with the finaly \. > > > > > > Andreas > > -- > > Andreas Kretschmer > > Kontakt: Heynitz: 035242/47150, D1: 0160/7141639 (mehr: -> Header) > > GnuPG-ID: 0x3FFF606C, privat 0x7F4584DA http://wwwkeys.de.pgp.net > > > > ---------------------------(end of > > broadcast)--------------------------- > > TIP 1: if posting/reading through Usenet, please send an appropriate > > subscribe-nomail command to majordomo@postgresql.org so that > > your > > message can get through to the mailing list cleanly > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
I would prefer not to drop the index because the database is several hundred gigs. I would prefer to incrementally add to the index. Benjamin Bruce Momjian wrote: > Benjamin Arai wrote: > >> So, is there a way to defer the index updating until a later period >> of time. More specifically, I would like to do several COPIES to a >> running database, then afterward force a update on the index via a >> vacuum or something similar. >> > > Sure, drop the index, do the COPY, and then recreate the index. That is > done often. > > --------------------------------------------------------------------------- > > > >> Benjamin >> >> On Mar 29, 2007, at 1:03 AM, A. Kretschmer wrote: >> >> >>> am Thu, dem 29.03.2007, um 0:13:09 -0700 mailte Benjamin Arai >>> folgendes: >>> >>>> Hi, >>>> >>>> If I have a PostgreSQL table with records and logical indexes already >>>> created, if I use COPY to load additional data, does the COPY update >>>> the indexes during, after, or not at all? >>>> >>> after, i think. >>> >>> test=# create table foo (id int primary key, name text); >>> NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index >>> "foo_pkey" for table "foo" >>> CREATE TABLE >>> test=*# copy foo from stdin; >>> Enter data to be copied followed by a newline. >>> End with a backslash and a period on a line by itself. >>> >>>>> 1 a >>>>> 2 b >>>>> 2 c >>>>> \. >>>>> >>> ERROR: duplicate key violates unique constraint "foo_pkey" >>> CONTEXT: COPY foo, line 3: "2 c" >>> test=*# >>> >>> >>> I can type the wrong key and the error occurs later with the finaly \. >>> >>> >>> Andreas >>> -- >>> Andreas Kretschmer >>> Kontakt: Heynitz: 035242/47150, D1: 0160/7141639 (mehr: -> Header) >>> GnuPG-ID: 0x3FFF606C, privat 0x7F4584DA http://wwwkeys.de.pgp.net >>> >>> ---------------------------(end of >>> broadcast)--------------------------- >>> TIP 1: if posting/reading through Usenet, please send an appropriate >>> subscribe-nomail command to majordomo@postgresql.org so that >>> your >>> message can get through to the mailing list cleanly >>> >>> >> ---------------------------(end of broadcast)--------------------------- >> TIP 2: Don't 'kill -9' the postmaster >> > >
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 03/29/07 14:41, Bruce Momjian wrote: > Benjamin Arai wrote: >> So, is there a way to defer the index updating until a later period >> of time. More specifically, I would like to do several COPIES to a >> running database, then afterward force a update on the index via a >> vacuum or something similar. > > Sure, drop the index, do the COPY, and then recreate the index. That is > done often. ALTER INDEX xxxx MAINTENANCE IS DISABLED would be helpful, so that you don't have to be dragging around (possibly changing) SQL around "everywhere". So, an example would be from an evening load job would be: ALTER INDEX foobar_idx MAINTENANCE IS DISABLED; COPY foo FROM 'blarg'; REINDEX INDEX foobar_idx; So if the DBA decides that foobar_idx needs different fields, you don't have to edit the evening load job when the index changes. - -- Ron Johnson, Jr. Jefferson LA USA Give a man a fish, and he eats for a day. Hit him with a fish, and he goes away for good! -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFGDB7xS9HxQb37XmcRAqzuAJwK9LATewVE6GwJg/us6p5KzznWAgCfSZ9J xtqWwHsVMvjuoSYP+/rEfNE= =nJ+F -----END PGP SIGNATURE-----
Benjamin Arai wrote: > I would prefer not to drop the index because the database is several > hundred gigs. I would prefer to incrementally add to the index. I know of now way to do that in a batch, unless you go with partitioned tables. --------------------------------------------------------------------------- > > Benjamin > > Bruce Momjian wrote: > > Benjamin Arai wrote: > > > >> So, is there a way to defer the index updating until a later period > >> of time. More specifically, I would like to do several COPIES to a > >> running database, then afterward force a update on the index via a > >> vacuum or something similar. > >> > > > > Sure, drop the index, do the COPY, and then recreate the index. That is > > done often. > > > > --------------------------------------------------------------------------- > > > > > > > >> Benjamin > >> > >> On Mar 29, 2007, at 1:03 AM, A. Kretschmer wrote: > >> > >> > >>> am Thu, dem 29.03.2007, um 0:13:09 -0700 mailte Benjamin Arai > >>> folgendes: > >>> > >>>> Hi, > >>>> > >>>> If I have a PostgreSQL table with records and logical indexes already > >>>> created, if I use COPY to load additional data, does the COPY update > >>>> the indexes during, after, or not at all? > >>>> > >>> after, i think. > >>> > >>> test=# create table foo (id int primary key, name text); > >>> NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index > >>> "foo_pkey" for table "foo" > >>> CREATE TABLE > >>> test=*# copy foo from stdin; > >>> Enter data to be copied followed by a newline. > >>> End with a backslash and a period on a line by itself. > >>> > >>>>> 1 a > >>>>> 2 b > >>>>> 2 c > >>>>> \. > >>>>> > >>> ERROR: duplicate key violates unique constraint "foo_pkey" > >>> CONTEXT: COPY foo, line 3: "2 c" > >>> test=*# > >>> > >>> > >>> I can type the wrong key and the error occurs later with the finaly \. > >>> > >>> > >>> Andreas > >>> -- > >>> Andreas Kretschmer > >>> Kontakt: Heynitz: 035242/47150, D1: 0160/7141639 (mehr: -> Header) > >>> GnuPG-ID: 0x3FFF606C, privat 0x7F4584DA http://wwwkeys.de.pgp.net > >>> > >>> ---------------------------(end of > >>> broadcast)--------------------------- > >>> TIP 1: if posting/reading through Usenet, please send an appropriate > >>> subscribe-nomail command to majordomo@postgresql.org so that > >>> your > >>> message can get through to the mailing list cleanly > >>> > >>> > >> ---------------------------(end of broadcast)--------------------------- > >> TIP 2: Don't 'kill -9' the postmaster > >> > > > > -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 03/29/07 14:51, Benjamin Arai wrote: > I would prefer not to drop the index because the database is several > hundred gigs. I would prefer to incrementally add to the index. Some RDBMSs (well, one that I know of) has the ability to defer index updates during data load, and it actually works very well. The down side is that if there's a unique value constraint failure, you don't know which record it failed on. - -- Ron Johnson, Jr. Jefferson LA USA Give a man a fish, and he eats for a day. Hit him with a fish, and he goes away for good! -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFGDCClS9HxQb37XmcRAo0pAKCwqYwXOAPIXK25L9zhWhtGMFi/hwCgtA+x zgc5Bz8wrVQ5UGocGe5v3s4= =aFmR -----END PGP SIGNATURE-----
Benjamin Arai <benjamin@araisoft.com> writes: > I would prefer not to drop the index because the database is several > hundred gigs. I would prefer to incrementally add to the index. This may well be false economy. I don't have numbers at hand, but a full rebuild can be substantially faster than adding a large number of rows to the index incrementally. Also, you don't end up with a physically disordered index, so there is some ongoing performance benefit. regards, tom lane
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 03/29/07 18:35, Tom Lane wrote: > Benjamin Arai <benjamin@araisoft.com> writes: >> I would prefer not to drop the index because the database is several >> hundred gigs. I would prefer to incrementally add to the index. > > This may well be false economy. I don't have numbers at hand, but a > full rebuild can be substantially faster than adding a large number > of rows to the index incrementally. Also, you don't end up with a > physically disordered index, so there is some ongoing performance > benefit. But deferring the index updates allows you to play games with the the index input data, such as presorting it in order to take advantage of locality of data. - -- Ron Johnson, Jr. Jefferson LA USA Give a man a fish, and he eats for a day. Hit him with a fish, and he goes away for good! -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFGDGC5S9HxQb37XmcRAk1nAJwNb72P1ZBFxA8jv2d7eo2GOMTvYQCgukr7 QbOAq/Sd88ZHeOTOt+pAgcM= =A1+E -----END PGP SIGNATURE-----
I agree, this is true if I cannot defer index updates. But if it is possible to defer index updates until the end then I should be able to achieve some sort of speedup. Rebuilding an index can't be the PostgreSQL solution for all cases. I am dealing with databases in the hundreds of gigs range and I am adding about 10gigs of data a week. At some point its going to take longer than a week to rebuild all of the indexes in the database. On the other hand, if I am to partition the data into several tables then it might not be such a big deal since I am only adding and never deleting... This makes it a little more of a pain in the ass. Benjamin Tom Lane wrote: > Benjamin Arai <benjamin@araisoft.com> writes: > >> I would prefer not to drop the index because the database is several >> hundred gigs. I would prefer to incrementally add to the index. >> > > This may well be false economy. I don't have numbers at hand, but a > full rebuild can be substantially faster than adding a large number > of rows to the index incrementally. Also, you don't end up with a > physically disordered index, so there is some ongoing performance > benefit. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match > >
I have one system which I have used partitioning. For this particular case I have tons of data over about (50 years). What I did is wrote small loader that breaks data in tables based on date, so I have tables like abc_2000, abc_2001 etc. The loading script is only a couple hundred lines of code. The only part that was a little bit of work was to allow for easy access to the data for the data for the devs. I did this by writing a few PL functions to automatically union the tables and produce results. So the function like getData(startData,enddate) would run a union query for the respective date ranges. Benjamin P.S. If I am doing anything that can be improved for the data access portion please let me know and feel free to voice your opinions. I am always looking for new ways to make this particular database faster. Gerald Timothy G Quimpo wrote: > On Thu, 2007-03-29 at 21:30 -0700, Benjamin Arai wrote: > >> Rebuilding an index can't be the PostgreSQL solution for all >> cases. I am dealing with databases in the hundreds of gigs >> range and I am adding about 10gigs of data a week. At >> some point its going to take longer than a week to rebuild >> all of the indexes in the database. >> >> On the other hand, if I am to partition the data into >> several tables then it might not be such a big deal since >> I am only adding and never deleting... This makes it a >> little more of a pain in the ass. >> > > I am leaning toward a policy of always partitioning large > tables. I haven't found the time to do it properly yet, > thinking about it, hoping that someone who'se done it will > chime in with their rules of thumb. > > Like Benjamin, I have a database that is close to 600GB > for 2.25 years of data, and if I were to add the other > 4 years of data that we have archived away, will easily go > into the terabyte range. There are a few individual tables > which approach 100GB all by themselves. > > As it is, I can't afford to do reindex or even backup > (pg_dump or any other method) or other administrative tasks > on those tables since the processes take too long (there are > workarounds, i could backup single tables at slack times, > which would allow me to do a complete backup (but not > self-consistent as a set) over the course of a week or so. > > So I'm leaning toward partitioning, perhaps selecting > partition rules so that no table will be larger than > around 5GB, at which point, reindex or admin procedures > that take exclusive locks now become only minor > inconveniences rather than showstoppers. > > How do people take consistent backups of very large > databases on Linux/FreeBSD? I'm aware of PITR, but > might not be able to set aside a box with enough > drives for it. LVM Snapshot? performance issues with > LVM, etc? > > tiger > >
On Thu, 2007-03-29 at 21:30 -0700, Benjamin Arai wrote: > Rebuilding an index can't be the PostgreSQL solution for all > cases. I am dealing with databases in the hundreds of gigs > range and I am adding about 10gigs of data a week. At > some point its going to take longer than a week to rebuild > all of the indexes in the database. > > On the other hand, if I am to partition the data into > several tables then it might not be such a big deal since > I am only adding and never deleting... This makes it a > little more of a pain in the ass. I am leaning toward a policy of always partitioning large tables. I haven't found the time to do it properly yet, thinking about it, hoping that someone who'se done it will chime in with their rules of thumb. Like Benjamin, I have a database that is close to 600GB for 2.25 years of data, and if I were to add the other 4 years of data that we have archived away, will easily go into the terabyte range. There are a few individual tables which approach 100GB all by themselves. As it is, I can't afford to do reindex or even backup (pg_dump or any other method) or other administrative tasks on those tables since the processes take too long (there are workarounds, i could backup single tables at slack times, which would allow me to do a complete backup (but not self-consistent as a set) over the course of a week or so. So I'm leaning toward partitioning, perhaps selecting partition rules so that no table will be larger than around 5GB, at which point, reindex or admin procedures that take exclusive locks now become only minor inconveniences rather than showstoppers. How do people take consistent backups of very large databases on Linux/FreeBSD? I'm aware of PITR, but might not be able to set aside a box with enough drives for it. LVM Snapshot? performance issues with LVM, etc? tiger -- Gerald Timothy Quimpo gerald.quimpo@qualservcentral.com Business Systems Development, KFC/Mr Donut/Ramcar There is nothing more dreadful than imagination without taste. -- Johann Wolfgang von Goethe
On Thu, 2007-03-29 at 22:15 -0700, Benjamin Arai wrote: > I have one system which I have used partitioning. For this particular > case I have tons of data over about (50 years). What I did is wrote > small loader that breaks data in tables based on date, so I have tables > like abc_2000, abc_2001 etc. The loading script is only a couple > hundred lines of code. The only part that was a little bit of work was > to allow for easy access to the data for the data for the devs. I did > this by writing a few PL functions to automatically union the tables > and produce results. So the function like getData(startData,enddate) > would run a union query for the respective date ranges. From reading on the list, I think the current recommendations are: have a parent table and then create the partitions as descendants (INHERITS) of the parent table. Use constraints to ensure that nothing gets inserted into the parents, and that only the correct data gets inserted into the descendants (i.e., the constraints enforce that 2002 data won't insert into the 2001 partition, the insert would fail if you tried that). Turn constraint_exclusion on in postgresql.conf (helps the planner ignore partitions that don't qualify), so it doesn't need to scan partitions where there will be no matches anyway. Use rules on the parent to redirect inserts/updates/deletes to the right partition (or trigger, i'm having some problems with both rules and triggers, so I may have the loader insert straight to the right partition instead, although I'm still trying to get the rules right. The advantage of this (parent table plus descendant partitions) is that you can query from the parent, and the descendants will be automatically queried, you'll get all matching rows from any descendants that have them. No need for those unions or pl/pgsql functions to do the unions for you. That said, I need to make my current code work with smaller test data sets so I can isolate the problems I'm having with rules, etc. E.g., my original plan was to detect automatically (in a before trigger) if a partition for the to-be-loaded data set already exists, and if it doesn't, create the partitions and the rules. I then load the data into the parent table and expect the newly created rules to redirect the data into the new partitions. Either my rules are wrogn, or there's something about transaction semantics that's making that fail since the rows are inserting into the parent table (no constraint there yet to disallow inserts into the parent) instead of the right partition. To The List: Are schema changes (such as adding rules and creating child partitions) part of the same transaction or do they happen (magically) outside the current transaction, in a new transaction (e.g., so that the inserts would fail because the inserts are running in an older transaction that can't see the new rule or new partition table yet). As I said, I might just be doing something wrogn there. I've set that project aside for now since it's not urgent (the current unpartitioned system works well enough, I just avoid reindex, vacuum, vacuum full and pg_dump), but I'll start working on it again when I get some free time in a week or two (Holy Week, holiday where I am :-). tiger
The problem I'm trying to solve is pretty standard. I have a table that records measurements of different types at different times. CREATE TABLE measurements ( obj_id int4, val_type_id int4 references lu_val_type(val_type_id), val numeric, observation_date date ); I want a query as simple and fast as possible to return the latest observation of each type for each object. I sent a message to this list a while ago, and the suggestion I found to be the best compromise of clarity and speed was: a) create an index on (obj_id, val_type_id, observation_date) b) the "obvious" query becomes fast thanks to the index. select ms.* from ( select obj_id, val_type_id, max(observation_date) as observation_date from measurements group by obj_id, val_type_id ) ms_last join measurements ms using (obj_id, val_type_id, observation_date); It still bugged me a bit that this requires a JOIN, especially since in a procedural language, it would have been so easy to return the value associated with the max(observation_date). I think I've found a pretty good alternative. This at the moment works if we keep track of time with an integer, rather than a date, but it would be readily extensible. The idea is to in fact, associate the value with the max(observation_date) like so: select obj_id, val_type_id, max(array[observation_date, val]) group by obj_id, val_type_id; There are two caveats: a) array requires elements to be of the same type, so observation_type must be kept as "time from" b) a row constructor would be ideal here, but there is now max function for rowtypes. If I did have a max() function for row types, it would be clean to do this: select obj_id, val_type_id, max(row(observation_date, val)) group by obj_id, val_type_id; Now, it seems that since rowtype comparison is built in, it should be pretty easy to build a max() aggregate for it. Has anybody done this? I'd have looked at the code for max(anyarray) but I don't know how to access it. Can someone point me in the right direction? Also, has someone thought about this before? I'm wondering if there will be a speed gain coming from this. Thank you, Jaime *********************************************************************** Bear Stearns is not responsible for any recommendation, solicitation, offer or agreement or any information about any transaction, customer account or account activity contained in this communication. Bear Stearns does not provide tax, legal or accounting advice. You should consult your own tax, legal and accounting advisors before engaging in any transaction. In order for Bear Stearns to comply with Internal Revenue Service Circular 230 (if applicable), you are notified that any discussion of U.S. federal tax issues contained or referred to herein is not intended or written to be used, and cannot be used, for the purpose of: (A) avoiding penalties that may be imposed under the Internal Revenue Code; nor (B) promoting, marketing or recommending to another party any transaction or matter addressed herein. ***********************************************************************
In case anyone is interested, I was able to solve this, more or less. Here's my new "Latest value" query: select obj_id, val_type_id, (max(row(observation_date, val))).val from measurements group by obj_id, val_type_id It was only necessary to define a new (date, numeric) type. Below is the code. The performance is actually slower than using a JOIN between the table and its GROUP-BY version. I guess for performance, I should code the functions in C, but at the moment, the value for me is that it simplifies a lot of my 12-way join queries! create type dtval as ( dt date, val numeric ); create function dtval_smaller(dtval, dtval) returns dtval as $$ select case when $1.dt < $2.dt then $1 else $2 end $$ language sql; create aggregate min ( sfunc = dtval_smaller, basetype = dtval, stype = dtval ); create function dtval_larger(dtval, dtval) returns dtval as $$ select case when $1.dt > $2.dt then $1 else $2 end $$ language sql; create aggregate max ( sfunc = dtval_larger, basetype = dtval, stype = dtval ); Jaime Silvela wrote: > The problem I'm trying to solve is pretty standard. I have a table > that records measurements of different types at different times. > > CREATE TABLE measurements ( > obj_id int4, > val_type_id int4 references lu_val_type(val_type_id), > val numeric, > observation_date date > ); > > I want a query as simple and fast as possible to return the latest > observation of each type for each object. > I sent a message to this list a while ago, and the suggestion I found > to be the best compromise of clarity and speed was: > a) create an index on (obj_id, val_type_id, observation_date) > b) the "obvious" query becomes fast thanks to the index. > select ms.* > from ( > select obj_id, val_type_id, max(observation_date) as > observation_date > from measurements > group by obj_id, val_type_id > ) ms_last > join measurements ms using (obj_id, val_type_id, observation_date); > > It still bugged me a bit that this requires a JOIN, especially since > in a procedural language, it would have been so easy to return the > value associated with the max(observation_date). > I think I've found a pretty good alternative. This at the moment works > if we keep track of time with an integer, rather than a date, but it > would be readily extensible. > > The idea is to in fact, associate the value with the > max(observation_date) like so: > select obj_id, val_type_id, max(array[observation_date, val]) > group by obj_id, val_type_id; > > There are two caveats: > a) array requires elements to be of the same type, so observation_type > must be kept as "time from" > b) a row constructor would be ideal here, but there is now max > function for rowtypes. > > If I did have a max() function for row types, it would be clean to do > this: > select obj_id, val_type_id, max(row(observation_date, val)) > group by obj_id, val_type_id; > > Now, it seems that since rowtype comparison is built in, it should be > pretty easy to build a max() aggregate for it. Has anybody done this? > I'd have looked at the code for max(anyarray) but I don't know how to > access it. Can someone point me in the right direction? > > Also, has someone thought about this before? I'm wondering if there > will be a speed gain coming from this. > > Thank you, > Jaime > > > *********************************************************************** > Bear Stearns is not responsible for any recommendation, solicitation, > offer or agreement or any information about any transaction, customer > account or account activity contained in this communication. > > Bear Stearns does not provide tax, legal or accounting advice. You > should consult your own tax, legal and accounting advisors before > engaging in any transaction. In order for Bear Stearns to comply with > Internal Revenue Service Circular 230 (if applicable), you are notified > that any discussion of U.S. federal tax issues contained or referred to > herein is not intended or written to be used, and cannot be used, for > the purpose of: (A) avoiding penalties that may be imposed under the > Internal Revenue Code; nor (B) promoting, marketing or recommending to > another party any transaction or matter addressed herein. > *********************************************************************** > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq > *********************************************************************** Bear Stearns is not responsible for any recommendation, solicitation, offer or agreement or any information about any transaction, customer account or account activity contained in this communication. Bear Stearns does not provide tax, legal or accounting advice. You should consult your own tax, legal and accounting advisors before engaging in any transaction. In order for Bear Stearns to comply with Internal Revenue Service Circular 230 (if applicable), you are notified that any discussion of U.S. federal tax issues contained or referred to herein is not intended or written to be used, and cannot be used, for the purpose of: (A) avoiding penalties that may be imposed under the Internal Revenue Code; nor (B) promoting, marketing or recommending to another party any transaction or matter addressed herein. ***********************************************************************
Would it speed things up siginficantly if you set the dtval_smaller() function to be immutable? Volatile is the default, so it may be redundantly evaluating things. Jaime Silvela wrote: > In case anyone is interested, I was able to solve this, more or less. > Here's my new "Latest value" query: > > select obj_id, val_type_id, (max(row(observation_date, val))).val > from measurements > group by obj_id, val_type_id > > It was only necessary to define a new (date, numeric) type. Below is the > code. The performance is actually slower than using a JOIN between the > table and its GROUP-BY version. I guess for performance, I should code > the functions in C, but at the moment, the value for me is that it > simplifies a lot of my 12-way join queries! > > create type dtval as ( > dt date, > val numeric > ); > > create function dtval_smaller(dtval, dtval) returns dtval as $$ > select case when $1.dt < $2.dt then $1 else $2 end > $$ language sql; > > create aggregate min ( > sfunc = dtval_smaller, > basetype = dtval, > stype = dtval > ); > > create function dtval_larger(dtval, dtval) returns dtval as $$ > select case when $1.dt > $2.dt then $1 else $2 end > $$ language sql; > > create aggregate max ( > sfunc = dtval_larger, > basetype = dtval, > stype = dtval > ); > > > > Jaime Silvela wrote: >> The problem I'm trying to solve is pretty standard. I have a table >> that records measurements of different types at different times. >> >> CREATE TABLE measurements ( >> obj_id int4, >> val_type_id int4 references lu_val_type(val_type_id), >> val numeric, >> observation_date date >> ); >> >> I want a query as simple and fast as possible to return the latest >> observation of each type for each object. >> I sent a message to this list a while ago, and the suggestion I found >> to be the best compromise of clarity and speed was: >> a) create an index on (obj_id, val_type_id, observation_date) >> b) the "obvious" query becomes fast thanks to the index. >> select ms.* >> from ( >> select obj_id, val_type_id, max(observation_date) as >> observation_date >> from measurements >> group by obj_id, val_type_id >> ) ms_last >> join measurements ms using (obj_id, val_type_id, observation_date); >> >> It still bugged me a bit that this requires a JOIN, especially since >> in a procedural language, it would have been so easy to return the >> value associated with the max(observation_date). >> I think I've found a pretty good alternative. This at the moment works >> if we keep track of time with an integer, rather than a date, but it >> would be readily extensible. >> >> The idea is to in fact, associate the value with the >> max(observation_date) like so: >> select obj_id, val_type_id, max(array[observation_date, val]) >> group by obj_id, val_type_id; >> >> There are two caveats: >> a) array requires elements to be of the same type, so observation_type >> must be kept as "time from" >> b) a row constructor would be ideal here, but there is now max >> function for rowtypes. >> >> If I did have a max() function for row types, it would be clean to do >> this: >> select obj_id, val_type_id, max(row(observation_date, val)) >> group by obj_id, val_type_id; >> >> Now, it seems that since rowtype comparison is built in, it should be >> pretty easy to build a max() aggregate for it. Has anybody done this? >> I'd have looked at the code for max(anyarray) but I don't know how to >> access it. Can someone point me in the right direction? >> >> Also, has someone thought about this before? I'm wondering if there >> will be a speed gain coming from this. >> >> Thank you, >> Jaime >> >> >> *********************************************************************** >> Bear Stearns is not responsible for any recommendation, solicitation, >> offer or agreement or any information about any transaction, customer >> account or account activity contained in this communication. >> >> Bear Stearns does not provide tax, legal or accounting advice. You >> should consult your own tax, legal and accounting advisors before >> engaging in any transaction. In order for Bear Stearns to comply with >> Internal Revenue Service Circular 230 (if applicable), you are notified >> that any discussion of U.S. federal tax issues contained or referred to >> herein is not intended or written to be used, and cannot be used, for >> the purpose of: (A) avoiding penalties that may be imposed under the >> Internal Revenue Code; nor (B) promoting, marketing or recommending to >> another party any transaction or matter addressed herein. >> *********************************************************************** >> >> ---------------------------(end of broadcast)--------------------------- >> TIP 3: Have you checked our extensive FAQ? >> >> http://www.postgresql.org/docs/faq >> > > > *********************************************************************** > Bear Stearns is not responsible for any recommendation, solicitation, > offer or agreement or any information about any transaction, customer > account or account activity contained in this communication. > > Bear Stearns does not provide tax, legal or accounting advice. You > should consult your own tax, legal and accounting advisors before > engaging in any transaction. In order for Bear Stearns to comply with > Internal Revenue Service Circular 230 (if applicable), you are notified > that any discussion of U.S. federal tax issues contained or referred to > herein is not intended or written to be used, and cannot be used, for > the purpose of: (A) avoiding penalties that may be imposed under the > Internal Revenue Code; nor (B) promoting, marketing or recommending to > another party any transaction or matter addressed herein. > *********************************************************************** > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org/ >
On Thu, 2007-03-29 at 21:30 -0700, Benjamin Arai wrote: > Rebuilding an index can't be the PostgreSQL solution for all > cases. I am dealing with databases in the hundreds of gigs > range and I am adding about 10gigs of data a week. At > some point its going to take longer than a week to rebuild > all of the indexes in the database. > > On the other hand, if I am to partition the data into > several tables then it might not be such a big deal since > I am only adding and never deleting... This makes it a > little more of a pain in the ass. I am leaning toward a policy of always partitioning large tables. I haven't found the time to do it properly yet, thinking about it, hoping that someone who'se done it will chime in with their rules of thumb. Like Benjamin, I have a database that is close to 600GB for 2.25 years of data, and if I were to add the other 4 years of data that we have archived away, will easily go into the terabyte range. There are a few individual tables which approach 100GB all by themselves. As it is, I can't afford to do reindex or even backup (pg_dump or any other method) or other administrative tasks on those tables since the processes take too long (there are workarounds, i could backup single tables at slack times, which would allow me to do a complete backup (but not self-consistent as a set) over the course of a week or so. So I'm leaning toward partitioning, perhaps selecting partition rules so that no table will be larger than around 5GB, at which point, reindex or admin procedures that take exclusive locks now become only minor inconveniences rather than showstoppers. How do people take consistent backups of very large databases on Linux/FreeBSD? I'm aware of PITR, but might not be able to set aside a box with enough drives for it. LVM Snapshot? performance issues with LVM, etc? tiger -- Gerald Timothy Quimpo gerald.quimpo@qualservcentral.com Business Systems Development, KFC/Mr Donut/Ramcar There is nothing more dreadful than imagination without taste. -- Johann Wolfgang von Goethe
Good idea. I tried it and got a 12% decrease in execution time! Still slower than the usual JOIN, but not by that much. William Garrison wrote: > Would it speed things up siginficantly if you set the dtval_smaller() > function to be immutable? Volatile is the default, so it may be > redundantly evaluating things. > > Jaime Silvela wrote: >> In case anyone is interested, I was able to solve this, more or less. >> Here's my new "Latest value" query: >> >> select obj_id, val_type_id, (max(row(observation_date, val))).val >> from measurements >> group by obj_id, val_type_id >> >> It was only necessary to define a new (date, numeric) type. Below is >> the code. The performance is actually slower than using a JOIN >> between the table and its GROUP-BY version. I guess for >> performance, I should code the functions in C, but at the moment, the >> value for me is that it simplifies a lot of my 12-way join queries! >> >> create type dtval as ( >> dt date, >> val numeric >> ); >> >> create function dtval_smaller(dtval, dtval) returns dtval as $$ >> select case when $1.dt < $2.dt then $1 else $2 end >> $$ language sql; >> >> create aggregate min ( >> sfunc = dtval_smaller, >> basetype = dtval, >> stype = dtval >> ); >> >> create function dtval_larger(dtval, dtval) returns dtval as $$ >> select case when $1.dt > $2.dt then $1 else $2 end >> $$ language sql; >> >> create aggregate max ( >> sfunc = dtval_larger, >> basetype = dtval, >> stype = dtval >> ); >> >> >> >> Jaime Silvela wrote: >>> The problem I'm trying to solve is pretty standard. I have a table >>> that records measurements of different types at different times. >>> >>> CREATE TABLE measurements ( >>> obj_id int4, >>> val_type_id int4 references lu_val_type(val_type_id), >>> val numeric, >>> observation_date date >>> ); >>> >>> I want a query as simple and fast as possible to return the latest >>> observation of each type for each object. >>> I sent a message to this list a while ago, and the suggestion I >>> found to be the best compromise of clarity and speed was: >>> a) create an index on (obj_id, val_type_id, observation_date) >>> b) the "obvious" query becomes fast thanks to the index. >>> select ms.* >>> from ( >>> select obj_id, val_type_id, max(observation_date) as >>> observation_date >>> from measurements >>> group by obj_id, val_type_id >>> ) ms_last >>> join measurements ms using (obj_id, val_type_id, observation_date); >>> >>> It still bugged me a bit that this requires a JOIN, especially since >>> in a procedural language, it would have been so easy to return the >>> value associated with the max(observation_date). >>> I think I've found a pretty good alternative. This at the moment >>> works if we keep track of time with an integer, rather than a date, >>> but it would be readily extensible. >>> >>> The idea is to in fact, associate the value with the >>> max(observation_date) like so: >>> select obj_id, val_type_id, max(array[observation_date, val]) >>> group by obj_id, val_type_id; >>> >>> There are two caveats: >>> a) array requires elements to be of the same type, so >>> observation_type must be kept as "time from" >>> b) a row constructor would be ideal here, but there is now max >>> function for rowtypes. >>> >>> If I did have a max() function for row types, it would be clean to >>> do this: >>> select obj_id, val_type_id, max(row(observation_date, val)) >>> group by obj_id, val_type_id; >>> >>> Now, it seems that since rowtype comparison is built in, it should >>> be pretty easy to build a max() aggregate for it. Has anybody done >>> this? I'd have looked at the code for max(anyarray) but I don't know >>> how to access it. Can someone point me in the right direction? >>> >>> Also, has someone thought about this before? I'm wondering if there >>> will be a speed gain coming from this. >>> >>> Thank you, >>> Jaime >>> >>> >>> *********************************************************************** >>> Bear Stearns is not responsible for any recommendation, solicitation, >>> offer or agreement or any information about any transaction, customer >>> account or account activity contained in this communication. >>> >>> Bear Stearns does not provide tax, legal or accounting advice. You >>> should consult your own tax, legal and accounting advisors before >>> engaging in any transaction. In order for Bear Stearns to comply with >>> Internal Revenue Service Circular 230 (if applicable), you are notified >>> that any discussion of U.S. federal tax issues contained or referred to >>> herein is not intended or written to be used, and cannot be used, for >>> the purpose of: (A) avoiding penalties that may be imposed under the >>> Internal Revenue Code; nor (B) promoting, marketing or recommending to >>> another party any transaction or matter addressed herein. >>> *********************************************************************** >>> >>> ---------------------------(end of >>> broadcast)--------------------------- >>> TIP 3: Have you checked our extensive FAQ? >>> >>> http://www.postgresql.org/docs/faq >>> >> >> >> *********************************************************************** >> Bear Stearns is not responsible for any recommendation, solicitation, >> offer or agreement or any information about any transaction, customer >> account or account activity contained in this communication. >> >> Bear Stearns does not provide tax, legal or accounting advice. You >> should consult your own tax, legal and accounting advisors before >> engaging in any transaction. In order for Bear Stearns to comply with >> Internal Revenue Service Circular 230 (if applicable), you are notified >> that any discussion of U.S. federal tax issues contained or referred to >> herein is not intended or written to be used, and cannot be used, for >> the purpose of: (A) avoiding penalties that may be imposed under the >> Internal Revenue Code; nor (B) promoting, marketing or recommending to >> another party any transaction or matter addressed herein. >> *********************************************************************** >> >> ---------------------------(end of broadcast)--------------------------- >> TIP 4: Have you searched our list archives? >> >> http://archives.postgresql.org/ >> > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq > *********************************************************************** Bear Stearns is not responsible for any recommendation, solicitation, offer or agreement or any information about any transaction, customer account or account activity contained in this communication. Bear Stearns does not provide tax, legal or accounting advice. You should consult your own tax, legal and accounting advisors before engaging in any transaction. In order for Bear Stearns to comply with Internal Revenue Service Circular 230 (if applicable), you are notified that any discussion of U.S. federal tax issues contained or referred to herein is not intended or written to be used, and cannot be used, for the purpose of: (A) avoiding penalties that may be imposed under the Internal Revenue Code; nor (B) promoting, marketing or recommending to another party any transaction or matter addressed herein. ***********************************************************************
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 03/29/07 23:56, Gerald Timothy G Quimpo wrote: [snip] > > How do people take consistent backups of very large > databases on Linux/FreeBSD? I'm aware of PITR, but > might not be able to set aside a box with enough > drives for it. LVM Snapshot? performance issues with > LVM, etc? If PG allows partitioned tables to span schemas, then I'd do that, so that only one (relatively small, say quarterly or semi-annual) schema is ever "hot" and needs to be backed up. - -- Ron Johnson, Jr. Jefferson LA USA Give a man a fish, and he eats for a day. Hit him with a fish, and he goes away for good! -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFGESHPS9HxQb37XmcRAmZFAKCkRvh6fMlU4CUj25F5BT2L56C0fgCgn5wS c+h/WQR3WHy9BvZ6lryIIqQ= =OdEt -----END PGP SIGNATURE-----