Thread: COPY command details

COPY command details

From
Benjamin Arai
Date:
Hi,

If I have a PostgreSQL table with records and logical indexes already
created, if I use COPY to load additional data, does the COPY update
the indexes during, after, or not at all?

Benjamin

Re: COPY command details

From
"A. Kretschmer"
Date:
am  Thu, dem 29.03.2007, um  0:13:09 -0700 mailte Benjamin Arai folgendes:
> Hi,
>
> If I have a PostgreSQL table with records and logical indexes already
> created, if I use COPY to load additional data, does the COPY update
> the indexes during, after, or not at all?

after, i think.

test=# create table foo (id int primary key, name text);
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "foo_pkey" for table "foo"
CREATE TABLE
test=*# copy foo from stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1    a
>> 2    b
>> 2    c
>> \.
ERROR:  duplicate key violates unique constraint "foo_pkey"
CONTEXT:  COPY foo, line 3: "2  c"
test=*#


I can type the wrong key and the error occurs later with the finaly \.


Andreas
--
Andreas Kretschmer
Kontakt:  Heynitz: 035242/47150,   D1: 0160/7141639 (mehr: -> Header)
GnuPG-ID:   0x3FFF606C, privat 0x7F4584DA   http://wwwkeys.de.pgp.net

Re: COPY command details

From
Tom Lane
Date:
"A. Kretschmer" <andreas.kretschmer@schollglas.com> writes:
> am  Thu, dem 29.03.2007, um  0:13:09 -0700 mailte Benjamin Arai folgendes:
>> If I have a PostgreSQL table with records and logical indexes already
>> created, if I use COPY to load additional data, does the COPY update
>> the indexes during, after, or not at all?

> after, i think.

> test=# create table foo (id int primary key, name text);
> NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "foo_pkey" for table "foo"
> CREATE TABLE
> test=*# copy foo from stdin;
> Enter data to be copied followed by a newline.
> End with a backslash and a period on a line by itself.
> 1    a
> 2    b
> 2    c
> \.
> ERROR:  duplicate key violates unique constraint "foo_pkey"
> CONTEXT:  COPY foo, line 3: "2  c"
> test=*#

> I can type the wrong key and the error occurs later with the finaly \.

No, "during" is the right answer.  The above only demonstrates that
libpq buffers COPY data in larger-than-one-line units --- once the
data gets to the backend it's inserted and checked a row at a time.

            regards, tom lane

Re: COPY command details

From
Benjamin Arai
Date:
So, is there a way to defer the index updating until a later period
of time.  More specifically, I would like to do several COPIES to a
running database, then afterward force a update on the index via a
vacuum or something similar.

Benjamin

On Mar 29, 2007, at 1:03 AM, A. Kretschmer wrote:

> am  Thu, dem 29.03.2007, um  0:13:09 -0700 mailte Benjamin Arai
> folgendes:
>> Hi,
>>
>> If I have a PostgreSQL table with records and logical indexes already
>> created, if I use COPY to load additional data, does the COPY update
>> the indexes during, after, or not at all?
>
> after, i think.
>
> test=# create table foo (id int primary key, name text);
> NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
> "foo_pkey" for table "foo"
> CREATE TABLE
> test=*# copy foo from stdin;
> Enter data to be copied followed by a newline.
> End with a backslash and a period on a line by itself.
>>> 1    a
>>> 2    b
>>> 2    c
>>> \.
> ERROR:  duplicate key violates unique constraint "foo_pkey"
> CONTEXT:  COPY foo, line 3: "2  c"
> test=*#
>
>
> I can type the wrong key and the error occurs later with the finaly \.
>
>
> Andreas
> --
> Andreas Kretschmer
> Kontakt:  Heynitz: 035242/47150,   D1: 0160/7141639 (mehr: -> Header)
> GnuPG-ID:   0x3FFF606C, privat 0x7F4584DA   http://wwwkeys.de.pgp.net
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that
> your
>        message can get through to the mailing list cleanly
>


Re: COPY command details

From
"A. Kretschmer"
Date:
am  Thu, dem 29.03.2007, um 10:02:49 -0700 mailte Benjamin Arai folgendes:
> So, is there a way to defer the index updating until a later period
> of time.  More specifically, I would like to do several COPIES to a
> running database, then afterward force a update on the index via a
> vacuum or something similar.

Drop the index(es), do the COPIES, recreate the index(es). You can do this
within a transaction.


Andreas
--
Andreas Kretschmer
Kontakt:  Heynitz: 035242/47150,   D1: 0160/7141639 (mehr: -> Header)
GnuPG-ID:   0x3FFF606C, privat 0x7F4584DA   http://wwwkeys.de.pgp.net

Re: COPY command details

From
Benjamin Arai
Date:
So, is there a way to defer the index updating until a later period
of time.  More specifically, I would like to do several COPIES to a
running database, then afterward force a update on the index via a
vacuum or something similar.

Benjamin

On Mar 29, 2007, at 1:03 AM, A. Kretschmer wrote:

> am  Thu, dem 29.03.2007, um  0:13:09 -0700 mailte Benjamin Arai
> folgendes:
>> Hi,
>>
>> If I have a PostgreSQL table with records and logical indexes already
>> created, if I use COPY to load additional data, does the COPY update
>> the indexes during, after, or not at all?
>
> after, i think.
>
> test=# create table foo (id int primary key, name text);
> NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
> "foo_pkey" for table "foo"
> CREATE TABLE
> test=*# copy foo from stdin;
> Enter data to be copied followed by a newline.
> End with a backslash and a period on a line by itself.
>>> 1    a
>>> 2    b
>>> 2    c
>>> \.
> ERROR:  duplicate key violates unique constraint "foo_pkey"
> CONTEXT:  COPY foo, line 3: "2  c"
> test=*#
>
>
> I can type the wrong key and the error occurs later with the finaly \.
>
>
> Andreas
> --
> Andreas Kretschmer
> Kontakt:  Heynitz: 035242/47150,   D1: 0160/7141639 (mehr: -> Header)
> GnuPG-ID:   0x3FFF606C, privat 0x7F4584DA   http://wwwkeys.de.pgp.net
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that
> your
>        message can get through to the mailing list cleanly
>


Re: COPY command details

From
Bruce Momjian
Date:
Benjamin Arai wrote:
> So, is there a way to defer the index updating until a later period
> of time.  More specifically, I would like to do several COPIES to a
> running database, then afterward force a update on the index via a
> vacuum or something similar.

Sure, drop the index, do the COPY, and then recreate the index.  That is
done often.

---------------------------------------------------------------------------


>
> Benjamin
>
> On Mar 29, 2007, at 1:03 AM, A. Kretschmer wrote:
>
> > am  Thu, dem 29.03.2007, um  0:13:09 -0700 mailte Benjamin Arai
> > folgendes:
> >> Hi,
> >>
> >> If I have a PostgreSQL table with records and logical indexes already
> >> created, if I use COPY to load additional data, does the COPY update
> >> the indexes during, after, or not at all?
> >
> > after, i think.
> >
> > test=# create table foo (id int primary key, name text);
> > NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
> > "foo_pkey" for table "foo"
> > CREATE TABLE
> > test=*# copy foo from stdin;
> > Enter data to be copied followed by a newline.
> > End with a backslash and a period on a line by itself.
> >>> 1    a
> >>> 2    b
> >>> 2    c
> >>> \.
> > ERROR:  duplicate key violates unique constraint "foo_pkey"
> > CONTEXT:  COPY foo, line 3: "2  c"
> > test=*#
> >
> >
> > I can type the wrong key and the error occurs later with the finaly \.
> >
> >
> > Andreas
> > --
> > Andreas Kretschmer
> > Kontakt:  Heynitz: 035242/47150,   D1: 0160/7141639 (mehr: -> Header)
> > GnuPG-ID:   0x3FFF606C, privat 0x7F4584DA   http://wwwkeys.de.pgp.net
> >
> > ---------------------------(end of
> > broadcast)---------------------------
> > TIP 1: if posting/reading through Usenet, please send an appropriate
> >        subscribe-nomail command to majordomo@postgresql.org so that
> > your
> >        message can get through to the mailing list cleanly
> >
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: COPY command details

From
Benjamin Arai
Date:
I would prefer not to drop the index because the database is several
hundred gigs.  I would prefer to incrementally add to the index.

Benjamin

Bruce Momjian wrote:
> Benjamin Arai wrote:
>
>> So, is there a way to defer the index updating until a later period
>> of time.  More specifically, I would like to do several COPIES to a
>> running database, then afterward force a update on the index via a
>> vacuum or something similar.
>>
>
> Sure, drop the index, do the COPY, and then recreate the index.  That is
> done often.
>
> ---------------------------------------------------------------------------
>
>
>
>> Benjamin
>>
>> On Mar 29, 2007, at 1:03 AM, A. Kretschmer wrote:
>>
>>
>>> am  Thu, dem 29.03.2007, um  0:13:09 -0700 mailte Benjamin Arai
>>> folgendes:
>>>
>>>> Hi,
>>>>
>>>> If I have a PostgreSQL table with records and logical indexes already
>>>> created, if I use COPY to load additional data, does the COPY update
>>>> the indexes during, after, or not at all?
>>>>
>>> after, i think.
>>>
>>> test=# create table foo (id int primary key, name text);
>>> NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
>>> "foo_pkey" for table "foo"
>>> CREATE TABLE
>>> test=*# copy foo from stdin;
>>> Enter data to be copied followed by a newline.
>>> End with a backslash and a period on a line by itself.
>>>
>>>>> 1    a
>>>>> 2    b
>>>>> 2    c
>>>>> \.
>>>>>
>>> ERROR:  duplicate key violates unique constraint "foo_pkey"
>>> CONTEXT:  COPY foo, line 3: "2  c"
>>> test=*#
>>>
>>>
>>> I can type the wrong key and the error occurs later with the finaly \.
>>>
>>>
>>> Andreas
>>> --
>>> Andreas Kretschmer
>>> Kontakt:  Heynitz: 035242/47150,   D1: 0160/7141639 (mehr: -> Header)
>>> GnuPG-ID:   0x3FFF606C, privat 0x7F4584DA   http://wwwkeys.de.pgp.net
>>>
>>> ---------------------------(end of
>>> broadcast)---------------------------
>>> TIP 1: if posting/reading through Usenet, please send an appropriate
>>>        subscribe-nomail command to majordomo@postgresql.org so that
>>> your
>>>        message can get through to the mailing list cleanly
>>>
>>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 2: Don't 'kill -9' the postmaster
>>
>
>

Re: COPY command details

From
Ron Johnson
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 03/29/07 14:41, Bruce Momjian wrote:
> Benjamin Arai wrote:
>> So, is there a way to defer the index updating until a later period
>> of time.  More specifically, I would like to do several COPIES to a
>> running database, then afterward force a update on the index via a
>> vacuum or something similar.
>
> Sure, drop the index, do the COPY, and then recreate the index.  That is
> done often.

ALTER INDEX xxxx MAINTENANCE IS DISABLED would be helpful, so that
you don't have to be dragging around (possibly changing) SQL around
"everywhere".

So, an example would be from an evening load job would be:
  ALTER INDEX foobar_idx MAINTENANCE IS DISABLED;
  COPY foo FROM 'blarg';
  REINDEX INDEX foobar_idx;

So if the DBA decides that foobar_idx needs different fields, you
don't have to edit the evening load job when the index changes.

- --
Ron Johnson, Jr.
Jefferson LA  USA

Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFGDB7xS9HxQb37XmcRAqzuAJwK9LATewVE6GwJg/us6p5KzznWAgCfSZ9J
xtqWwHsVMvjuoSYP+/rEfNE=
=nJ+F
-----END PGP SIGNATURE-----

Re: COPY command details

From
Bruce Momjian
Date:
Benjamin Arai wrote:
> I would prefer not to drop the index because the database is several
> hundred gigs.  I would prefer to incrementally add to the index.

I know of now way to do that in a batch, unless you go with partitioned
tables.

---------------------------------------------------------------------------


>
> Benjamin
>
> Bruce Momjian wrote:
> > Benjamin Arai wrote:
> >
> >> So, is there a way to defer the index updating until a later period
> >> of time.  More specifically, I would like to do several COPIES to a
> >> running database, then afterward force a update on the index via a
> >> vacuum or something similar.
> >>
> >
> > Sure, drop the index, do the COPY, and then recreate the index.  That is
> > done often.
> >
> > ---------------------------------------------------------------------------
> >
> >
> >
> >> Benjamin
> >>
> >> On Mar 29, 2007, at 1:03 AM, A. Kretschmer wrote:
> >>
> >>
> >>> am  Thu, dem 29.03.2007, um  0:13:09 -0700 mailte Benjamin Arai
> >>> folgendes:
> >>>
> >>>> Hi,
> >>>>
> >>>> If I have a PostgreSQL table with records and logical indexes already
> >>>> created, if I use COPY to load additional data, does the COPY update
> >>>> the indexes during, after, or not at all?
> >>>>
> >>> after, i think.
> >>>
> >>> test=# create table foo (id int primary key, name text);
> >>> NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
> >>> "foo_pkey" for table "foo"
> >>> CREATE TABLE
> >>> test=*# copy foo from stdin;
> >>> Enter data to be copied followed by a newline.
> >>> End with a backslash and a period on a line by itself.
> >>>
> >>>>> 1    a
> >>>>> 2    b
> >>>>> 2    c
> >>>>> \.
> >>>>>
> >>> ERROR:  duplicate key violates unique constraint "foo_pkey"
> >>> CONTEXT:  COPY foo, line 3: "2  c"
> >>> test=*#
> >>>
> >>>
> >>> I can type the wrong key and the error occurs later with the finaly \.
> >>>
> >>>
> >>> Andreas
> >>> --
> >>> Andreas Kretschmer
> >>> Kontakt:  Heynitz: 035242/47150,   D1: 0160/7141639 (mehr: -> Header)
> >>> GnuPG-ID:   0x3FFF606C, privat 0x7F4584DA   http://wwwkeys.de.pgp.net
> >>>
> >>> ---------------------------(end of
> >>> broadcast)---------------------------
> >>> TIP 1: if posting/reading through Usenet, please send an appropriate
> >>>        subscribe-nomail command to majordomo@postgresql.org so that
> >>> your
> >>>        message can get through to the mailing list cleanly
> >>>
> >>>
> >> ---------------------------(end of broadcast)---------------------------
> >> TIP 2: Don't 'kill -9' the postmaster
> >>
> >
> >

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: COPY command details

From
Ron Johnson
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 03/29/07 14:51, Benjamin Arai wrote:
> I would prefer not to drop the index because the database is several
> hundred gigs.  I would prefer to incrementally add to the index.

Some RDBMSs (well, one that I know of) has the ability to defer
index updates during data load, and it actually works very well.

The down side is that if there's a unique value constraint failure,
you don't know which record it failed on.

- --
Ron Johnson, Jr.
Jefferson LA  USA

Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFGDCClS9HxQb37XmcRAo0pAKCwqYwXOAPIXK25L9zhWhtGMFi/hwCgtA+x
zgc5Bz8wrVQ5UGocGe5v3s4=
=aFmR
-----END PGP SIGNATURE-----

Re: COPY command details

From
Tom Lane
Date:
Benjamin Arai <benjamin@araisoft.com> writes:
> I would prefer not to drop the index because the database is several
> hundred gigs.  I would prefer to incrementally add to the index.

This may well be false economy.  I don't have numbers at hand, but a
full rebuild can be substantially faster than adding a large number
of rows to the index incrementally.  Also, you don't end up with a
physically disordered index, so there is some ongoing performance
benefit.

            regards, tom lane

Re: COPY command details

From
Ron Johnson
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 03/29/07 18:35, Tom Lane wrote:
> Benjamin Arai <benjamin@araisoft.com> writes:
>> I would prefer not to drop the index because the database is several
>> hundred gigs.  I would prefer to incrementally add to the index.
>
> This may well be false economy.  I don't have numbers at hand, but a
> full rebuild can be substantially faster than adding a large number
> of rows to the index incrementally.  Also, you don't end up with a
> physically disordered index, so there is some ongoing performance
> benefit.

But deferring the index updates allows you to play games with the
the index input data, such as presorting it in order to take
advantage of locality of data.

- --
Ron Johnson, Jr.
Jefferson LA  USA

Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFGDGC5S9HxQb37XmcRAk1nAJwNb72P1ZBFxA8jv2d7eo2GOMTvYQCgukr7
QbOAq/Sd88ZHeOTOt+pAgcM=
=A1+E
-----END PGP SIGNATURE-----

Re: COPY command details

From
Benjamin Arai
Date:
I agree, this is true if I cannot defer index updates.  But if it is
possible to defer index updates until the end then I should be able to
achieve some sort of speedup.  Rebuilding an index can't be the
PostgreSQL solution for all cases. I am dealing with databases in the
hundreds of gigs range and I am adding about 10gigs of data a week.  At
some point its going to take longer than a week to rebuild all of the
indexes in the database.

On the other hand, if I am to partition the data into several tables
then it might not be such a big deal since I am only adding and never
deleting... This makes it a little more of a pain in the ass.

Benjamin

Tom Lane wrote:
> Benjamin Arai <benjamin@araisoft.com> writes:
>
>> I would prefer not to drop the index because the database is several
>> hundred gigs.  I would prefer to incrementally add to the index.
>>
>
> This may well be false economy.  I don't have numbers at hand, but a
> full rebuild can be substantially faster than adding a large number
> of rows to the index incrementally.  Also, you don't end up with a
> physically disordered index, so there is some ongoing performance
> benefit.
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match
>
>

Re: COPY command details

From
Benjamin Arai
Date:
I have one system which I have used partitioning.  For this particular
case I have tons of data over about (50 years).  What I did is wrote
small loader that breaks data in tables based on date, so I have tables
like abc_2000, abc_2001 etc.  The loading script is only a couple
hundred lines of code.  The only part that was a little bit of work was
to allow for easy access to the data for the data for the devs.  I did
this by writing a few PL functions to automatically union  the tables
and produce results. So the function like getData(startData,enddate)
would run a union query for the respective date ranges.

Benjamin

P.S. If I am doing anything that can be improved for the data access
portion please let me know and feel free to voice your opinions.  I am
always looking for new ways to make this particular database faster.

Gerald Timothy G Quimpo wrote:
> On Thu, 2007-03-29 at 21:30 -0700, Benjamin Arai wrote:
>
>> Rebuilding an index can't be the PostgreSQL solution for all
>> cases. I am dealing with databases in the hundreds of gigs
>> range and I am adding about 10gigs of data a week.  At
>> some point its going to take longer than a week to rebuild
>> all of the indexes in the database.
>>
>> On the other hand, if I am to partition the data into
>> several tables then it might not be such a big deal since
>> I am only adding and never deleting... This makes it a
>> little more of a pain in the ass.
>>
>
> I am leaning toward a policy of always partitioning large
> tables.  I haven't found the time to do it properly yet,
> thinking about it, hoping that someone who'se done it will
> chime in with their rules of thumb.
>
> Like Benjamin, I have a database that is close to 600GB
> for 2.25 years of data, and if I were to add the other
> 4 years of data that we have archived away, will easily go
> into the terabyte range.  There are a few individual tables
> which approach 100GB all by themselves.
>
> As it is, I can't afford to do reindex or even backup
> (pg_dump or any other method) or other administrative tasks
> on those tables since the processes take too long (there are
> workarounds, i could backup single tables at slack times,
> which would allow me to do a complete backup (but not
> self-consistent as a set) over the course of a week or so.
>
> So I'm leaning toward partitioning, perhaps selecting
> partition rules so that no table will be larger than
> around 5GB, at which point, reindex or admin procedures
> that take exclusive locks now become only minor
> inconveniences rather than showstoppers.
>
> How do people take consistent backups of very large
> databases on Linux/FreeBSD?  I'm aware of PITR, but
> might not be able to set aside a box with enough
> drives for it.  LVM Snapshot? performance issues with
> LVM, etc?
>
> tiger
>
>

Re: COPY command details

From
Tiger Quimpo
Date:
On Thu, 2007-03-29 at 21:30 -0700, Benjamin Arai wrote:
> Rebuilding an index can't be the PostgreSQL solution for all
> cases. I am dealing with databases in the hundreds of gigs
> range and I am adding about 10gigs of data a week.  At
> some point its going to take longer than a week to rebuild
> all of the indexes in the database.
>
> On the other hand, if I am to partition the data into
> several tables then it might not be such a big deal since
> I am only adding and never deleting... This makes it a
> little more of a pain in the ass.

I am leaning toward a policy of always partitioning large
tables.  I haven't found the time to do it properly yet,
thinking about it, hoping that someone who'se done it will
chime in with their rules of thumb.

Like Benjamin, I have a database that is close to 600GB
for 2.25 years of data, and if I were to add the other
4 years of data that we have archived away, will easily go
into the terabyte range.  There are a few individual tables
which approach 100GB all by themselves.

As it is, I can't afford to do reindex or even backup
(pg_dump or any other method) or other administrative tasks
on those tables since the processes take too long (there are
workarounds, i could backup single tables at slack times,
which would allow me to do a complete backup (but not
self-consistent as a set) over the course of a week or so.

So I'm leaning toward partitioning, perhaps selecting
partition rules so that no table will be larger than
around 5GB, at which point, reindex or admin procedures
that take exclusive locks now become only minor
inconveniences rather than showstoppers.

How do people take consistent backups of very large
databases on Linux/FreeBSD?  I'm aware of PITR, but
might not be able to set aside a box with enough
drives for it.  LVM Snapshot? performance issues with
LVM, etc?

tiger

--
Gerald Timothy Quimpo   gerald.quimpo@qualservcentral.com
Business Systems Development, KFC/Mr Donut/Ramcar

   There is nothing more dreadful than imagination without taste.
        -- Johann Wolfgang von Goethe


Re: COPY command details

From
Tiger Quimpo
Date:
On Thu, 2007-03-29 at 22:15 -0700, Benjamin Arai wrote:
> I have one system which I have used partitioning.  For this particular
> case I have tons of data over about (50 years).  What I did is wrote
> small loader that breaks data in tables based on date, so I have tables
> like abc_2000, abc_2001 etc.  The loading script is only a couple
> hundred lines of code.  The only part that was a little bit of work was
> to allow for easy access to the data for the data for the devs.  I did
> this by writing a few PL functions to automatically union  the tables
> and produce results. So the function like getData(startData,enddate)
> would run a union query for the respective date ranges.

From reading on the list, I think the current recommendations
are:

  have a parent table and then create the partitions as
  descendants (INHERITS) of the parent table.

  Use constraints to ensure that nothing gets inserted
  into the parents, and that only the correct data gets
  inserted into the descendants (i.e., the constraints
  enforce that 2002 data won't insert into the 2001 partition,
  the insert would fail if you tried that).

  Turn constraint_exclusion on in postgresql.conf (helps
  the planner ignore partitions that don't qualify), so
  it doesn't need to scan partitions where there will be no
  matches anyway.

  Use rules on the parent to redirect inserts/updates/deletes
  to the right partition (or trigger, i'm having some problems
  with both rules and triggers, so I may have the loader
  insert straight to the right partition instead, although
  I'm still trying to get the rules right.

The advantage of this (parent table plus descendant partitions)
is that you can query from the parent, and the descendants will
be automatically queried, you'll get all matching rows from any
descendants that have them.  No need for those unions or
pl/pgsql functions to do the unions for you.

That said, I need to make my current code work with smaller
test data sets so I can isolate the problems I'm having with
rules, etc.  E.g., my original plan was to detect automatically
(in a before trigger) if a partition for the to-be-loaded data
set already exists, and if it doesn't, create the partitions
and the rules.  I then load the data into the parent table and
expect the newly created rules to redirect the data into the
new partitions. Either my rules are wrogn, or there's something
about transaction semantics that's making that fail since the
rows are inserting into the parent table (no constraint there
yet to disallow inserts into the parent) instead of the right
partition.

To The List:
Are schema changes (such as adding rules and creating child
partitions) part of the same transaction or do they happen
(magically) outside the current transaction, in a new
transaction (e.g., so that the inserts would fail because
the inserts are running in an older transaction that can't
see the new rule or new partition table yet).

As I said, I might just be doing something wrogn there.
I've set that project aside for now since it's not
urgent (the current unpartitioned system works well enough,
I just avoid reindex, vacuum, vacuum full and pg_dump), but
I'll start working on it again when I get some free time in
a week or two (Holy Week, holiday where I am :-).

tiger




time series query

From
Jaime Silvela
Date:
The problem I'm trying to solve is pretty standard. I have a table that
records measurements of different types at different times.

CREATE TABLE measurements (
  obj_id int4,
  val_type_id int4 references lu_val_type(val_type_id),
  val numeric,
  observation_date date
);

I want a query as simple and fast as possible to return the latest
observation of each type for each object.
I sent a message to this list a while ago, and the suggestion I found to
be the best compromise of clarity and speed was:
a) create an index on (obj_id, val_type_id, observation_date)
b) the "obvious" query becomes fast thanks to the index.
    select ms.*
    from (
        select obj_id, val_type_id, max(observation_date) as
observation_date
        from measurements
       group by obj_id, val_type_id
    ) ms_last
    join measurements ms using (obj_id, val_type_id, observation_date);

It still bugged me a bit that this requires a JOIN, especially since in
a procedural language, it would have been so easy to return the value
associated with the max(observation_date).
I think I've found a pretty good alternative. This at the moment works
if we keep track of time with an integer, rather than a date, but it
would be readily extensible.

The idea is to in fact, associate the value with the
max(observation_date) like so:
select obj_id, val_type_id, max(array[observation_date, val])
group by obj_id, val_type_id;

There are two caveats:
a) array requires elements to be of the same type, so observation_type
must be kept as "time from"
b) a row constructor would be ideal here, but there is now max function
for rowtypes.

If I did have a max() function for row types, it would be clean to do this:
select obj_id, val_type_id, max(row(observation_date, val))
group by obj_id, val_type_id;

Now, it seems that since rowtype comparison is built in, it should be
pretty easy to build a max() aggregate for it. Has anybody done this?
I'd have looked at the code for max(anyarray) but I don't know how to
access it. Can someone point me in the right direction?

Also, has someone thought about this before? I'm wondering if there will
be a speed gain coming from this.

Thank you,
Jaime


***********************************************************************
Bear Stearns is not responsible for any recommendation, solicitation,
offer or agreement or any information about any transaction, customer
account or account activity contained in this communication.

Bear Stearns does not provide tax, legal or accounting advice.  You
should consult your own tax, legal and accounting advisors before
engaging in any transaction. In order for Bear Stearns to comply with
Internal Revenue Service Circular 230 (if applicable), you are notified
that any discussion of U.S. federal tax issues contained or referred to
herein is not intended or written to be used, and cannot be used, for
the purpose of:  (A) avoiding penalties that may be imposed under the
Internal Revenue Code; nor (B) promoting, marketing or recommending to
another party any transaction or matter addressed herein.
***********************************************************************

Re: time series query

From
Jaime Silvela
Date:
In case anyone is interested, I was able to solve this, more or less.
Here's my new "Latest value" query:

  select obj_id, val_type_id, (max(row(observation_date, val))).val
  from measurements
  group by obj_id, val_type_id

It was only necessary to define a new (date, numeric) type. Below is the
code. The performance is actually slower than using a JOIN between the
table and its  GROUP-BY  version. I guess for performance, I should code
the functions in C, but at the moment, the value for me is that it
simplifies a lot of my 12-way join queries!

create type dtval as (
  dt date,
  val numeric
);

create  function dtval_smaller(dtval, dtval) returns dtval as $$
  select case when $1.dt < $2.dt then $1 else $2 end
$$ language sql;

create aggregate min (
  sfunc = dtval_smaller,
  basetype = dtval,
  stype = dtval
);

create  function dtval_larger(dtval, dtval) returns dtval as $$
  select case when $1.dt > $2.dt then $1 else $2 end
$$ language sql;

create aggregate max (
  sfunc = dtval_larger,
  basetype = dtval,
  stype = dtval
);



Jaime Silvela wrote:
> The problem I'm trying to solve is pretty standard. I have a table
> that records measurements of different types at different times.
>
> CREATE TABLE measurements (
>  obj_id int4,
>  val_type_id int4 references lu_val_type(val_type_id),
>  val numeric,
>  observation_date date
> );
>
> I want a query as simple and fast as possible to return the latest
> observation of each type for each object.
> I sent a message to this list a while ago, and the suggestion I found
> to be the best compromise of clarity and speed was:
> a) create an index on (obj_id, val_type_id, observation_date)
> b) the "obvious" query becomes fast thanks to the index.
>    select ms.*
>    from (
>        select obj_id, val_type_id, max(observation_date) as
> observation_date
>        from measurements
>       group by obj_id, val_type_id
>    ) ms_last
>    join measurements ms using (obj_id, val_type_id, observation_date);
>
> It still bugged me a bit that this requires a JOIN, especially since
> in a procedural language, it would have been so easy to return the
> value associated with the max(observation_date).
> I think I've found a pretty good alternative. This at the moment works
> if we keep track of time with an integer, rather than a date, but it
> would be readily extensible.
>
> The idea is to in fact, associate the value with the
> max(observation_date) like so:
> select obj_id, val_type_id, max(array[observation_date, val])
> group by obj_id, val_type_id;
>
> There are two caveats:
> a) array requires elements to be of the same type, so observation_type
> must be kept as "time from"
> b) a row constructor would be ideal here, but there is now max
> function for rowtypes.
>
> If I did have a max() function for row types, it would be clean to do
> this:
> select obj_id, val_type_id, max(row(observation_date, val))
> group by obj_id, val_type_id;
>
> Now, it seems that since rowtype comparison is built in, it should be
> pretty easy to build a max() aggregate for it. Has anybody done this?
> I'd have looked at the code for max(anyarray) but I don't know how to
> access it. Can someone point me in the right direction?
>
> Also, has someone thought about this before? I'm wondering if there
> will be a speed gain coming from this.
>
> Thank you,
> Jaime
>
>
> ***********************************************************************
> Bear Stearns is not responsible for any recommendation, solicitation,
> offer or agreement or any information about any transaction, customer
> account or account activity contained in this communication.
>
> Bear Stearns does not provide tax, legal or accounting advice.  You
> should consult your own tax, legal and accounting advisors before
> engaging in any transaction. In order for Bear Stearns to comply with
> Internal Revenue Service Circular 230 (if applicable), you are notified
> that any discussion of U.S. federal tax issues contained or referred to
> herein is not intended or written to be used, and cannot be used, for
> the purpose of:  (A) avoiding penalties that may be imposed under the
> Internal Revenue Code; nor (B) promoting, marketing or recommending to
> another party any transaction or matter addressed herein.
> ***********************************************************************
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>               http://www.postgresql.org/docs/faq
>


***********************************************************************
Bear Stearns is not responsible for any recommendation, solicitation,
offer or agreement or any information about any transaction, customer
account or account activity contained in this communication.

Bear Stearns does not provide tax, legal or accounting advice.  You
should consult your own tax, legal and accounting advisors before
engaging in any transaction. In order for Bear Stearns to comply with
Internal Revenue Service Circular 230 (if applicable), you are notified
that any discussion of U.S. federal tax issues contained or referred to
herein is not intended or written to be used, and cannot be used, for
the purpose of:  (A) avoiding penalties that may be imposed under the
Internal Revenue Code; nor (B) promoting, marketing or recommending to
another party any transaction or matter addressed herein.
***********************************************************************

Re: time series query

From
William Garrison
Date:
Would it speed things up siginficantly if you set the dtval_smaller()
function to be immutable?  Volatile is the default, so it may be
redundantly evaluating things.

Jaime Silvela wrote:
> In case anyone is interested, I was able to solve this, more or less.
> Here's my new "Latest value" query:
>
>  select obj_id, val_type_id, (max(row(observation_date, val))).val
>  from measurements
>  group by obj_id, val_type_id
>
> It was only necessary to define a new (date, numeric) type. Below is the
> code. The performance is actually slower than using a JOIN between the
> table and its  GROUP-BY  version. I guess for performance, I should code
> the functions in C, but at the moment, the value for me is that it
> simplifies a lot of my 12-way join queries!
>
> create type dtval as (
>  dt date,
>  val numeric
> );
>
> create  function dtval_smaller(dtval, dtval) returns dtval as $$
>  select case when $1.dt < $2.dt then $1 else $2 end
> $$ language sql;
>
> create aggregate min (
>  sfunc = dtval_smaller,
>  basetype = dtval,
>  stype = dtval
> );
>
> create  function dtval_larger(dtval, dtval) returns dtval as $$
>  select case when $1.dt > $2.dt then $1 else $2 end
> $$ language sql;
>
> create aggregate max (
>  sfunc = dtval_larger,
>  basetype = dtval,
>  stype = dtval
> );
>
>
>
> Jaime Silvela wrote:
>> The problem I'm trying to solve is pretty standard. I have a table
>> that records measurements of different types at different times.
>>
>> CREATE TABLE measurements (
>>  obj_id int4,
>>  val_type_id int4 references lu_val_type(val_type_id),
>>  val numeric,
>>  observation_date date
>> );
>>
>> I want a query as simple and fast as possible to return the latest
>> observation of each type for each object.
>> I sent a message to this list a while ago, and the suggestion I found
>> to be the best compromise of clarity and speed was:
>> a) create an index on (obj_id, val_type_id, observation_date)
>> b) the "obvious" query becomes fast thanks to the index.
>>    select ms.*
>>    from (
>>        select obj_id, val_type_id, max(observation_date) as
>> observation_date
>>        from measurements
>>       group by obj_id, val_type_id
>>    ) ms_last
>>    join measurements ms using (obj_id, val_type_id, observation_date);
>>
>> It still bugged me a bit that this requires a JOIN, especially since
>> in a procedural language, it would have been so easy to return the
>> value associated with the max(observation_date).
>> I think I've found a pretty good alternative. This at the moment works
>> if we keep track of time with an integer, rather than a date, but it
>> would be readily extensible.
>>
>> The idea is to in fact, associate the value with the
>> max(observation_date) like so:
>> select obj_id, val_type_id, max(array[observation_date, val])
>> group by obj_id, val_type_id;
>>
>> There are two caveats:
>> a) array requires elements to be of the same type, so observation_type
>> must be kept as "time from"
>> b) a row constructor would be ideal here, but there is now max
>> function for rowtypes.
>>
>> If I did have a max() function for row types, it would be clean to do
>> this:
>> select obj_id, val_type_id, max(row(observation_date, val))
>> group by obj_id, val_type_id;
>>
>> Now, it seems that since rowtype comparison is built in, it should be
>> pretty easy to build a max() aggregate for it. Has anybody done this?
>> I'd have looked at the code for max(anyarray) but I don't know how to
>> access it. Can someone point me in the right direction?
>>
>> Also, has someone thought about this before? I'm wondering if there
>> will be a speed gain coming from this.
>>
>> Thank you,
>> Jaime
>>
>>
>> ***********************************************************************
>> Bear Stearns is not responsible for any recommendation, solicitation,
>> offer or agreement or any information about any transaction, customer
>> account or account activity contained in this communication.
>>
>> Bear Stearns does not provide tax, legal or accounting advice.  You
>> should consult your own tax, legal and accounting advisors before
>> engaging in any transaction. In order for Bear Stearns to comply with
>> Internal Revenue Service Circular 230 (if applicable), you are notified
>> that any discussion of U.S. federal tax issues contained or referred to
>> herein is not intended or written to be used, and cannot be used, for
>> the purpose of:  (A) avoiding penalties that may be imposed under the
>> Internal Revenue Code; nor (B) promoting, marketing or recommending to
>> another party any transaction or matter addressed herein.
>> ***********************************************************************
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 3: Have you checked our extensive FAQ?
>>
>>               http://www.postgresql.org/docs/faq
>>
>
>
> ***********************************************************************
> Bear Stearns is not responsible for any recommendation, solicitation,
> offer or agreement or any information about any transaction, customer
> account or account activity contained in this communication.
>
> Bear Stearns does not provide tax, legal or accounting advice.  You
> should consult your own tax, legal and accounting advisors before
> engaging in any transaction. In order for Bear Stearns to comply with
> Internal Revenue Service Circular 230 (if applicable), you are notified
> that any discussion of U.S. federal tax issues contained or referred to
> herein is not intended or written to be used, and cannot be used, for
> the purpose of:  (A) avoiding penalties that may be imposed under the
> Internal Revenue Code; nor (B) promoting, marketing or recommending to
> another party any transaction or matter addressed herein.
> ***********************************************************************
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>               http://archives.postgresql.org/
>


Re: COPY command details

From
Gerald Timothy G Quimpo
Date:
On Thu, 2007-03-29 at 21:30 -0700, Benjamin Arai wrote:
> Rebuilding an index can't be the PostgreSQL solution for all
> cases. I am dealing with databases in the hundreds of gigs
> range and I am adding about 10gigs of data a week.  At
> some point its going to take longer than a week to rebuild
> all of the indexes in the database.
>
> On the other hand, if I am to partition the data into
> several tables then it might not be such a big deal since
> I am only adding and never deleting... This makes it a
> little more of a pain in the ass.

I am leaning toward a policy of always partitioning large
tables.  I haven't found the time to do it properly yet,
thinking about it, hoping that someone who'se done it will
chime in with their rules of thumb.

Like Benjamin, I have a database that is close to 600GB
for 2.25 years of data, and if I were to add the other
4 years of data that we have archived away, will easily go
into the terabyte range.  There are a few individual tables
which approach 100GB all by themselves.

As it is, I can't afford to do reindex or even backup
(pg_dump or any other method) or other administrative tasks
on those tables since the processes take too long (there are
workarounds, i could backup single tables at slack times,
which would allow me to do a complete backup (but not
self-consistent as a set) over the course of a week or so.

So I'm leaning toward partitioning, perhaps selecting
partition rules so that no table will be larger than
around 5GB, at which point, reindex or admin procedures
that take exclusive locks now become only minor
inconveniences rather than showstoppers.

How do people take consistent backups of very large
databases on Linux/FreeBSD?  I'm aware of PITR, but
might not be able to set aside a box with enough
drives for it.  LVM Snapshot? performance issues with
LVM, etc?

tiger

--
Gerald Timothy Quimpo   gerald.quimpo@qualservcentral.com
Business Systems Development, KFC/Mr Donut/Ramcar

   There is nothing more dreadful than imagination without taste.
        -- Johann Wolfgang von Goethe


Re: time series query

From
Jaime Silvela
Date:
Good idea. I tried it and got a 12% decrease in execution time!
Still slower than the usual JOIN, but not by that much.

William Garrison wrote:
> Would it speed things up siginficantly if you set the dtval_smaller()
> function to be immutable?  Volatile is the default, so it may be
> redundantly evaluating things.
>
> Jaime Silvela wrote:
>> In case anyone is interested, I was able to solve this, more or less.
>> Here's my new "Latest value" query:
>>
>>  select obj_id, val_type_id, (max(row(observation_date, val))).val
>>  from measurements
>>  group by obj_id, val_type_id
>>
>> It was only necessary to define a new (date, numeric) type. Below is
>> the code. The performance is actually slower than using a JOIN
>> between the table and its  GROUP-BY  version. I guess for
>> performance, I should code the functions in C, but at the moment, the
>> value for me is that it simplifies a lot of my 12-way join queries!
>>
>> create type dtval as (
>>  dt date,
>>  val numeric
>> );
>>
>> create  function dtval_smaller(dtval, dtval) returns dtval as $$
>>  select case when $1.dt < $2.dt then $1 else $2 end
>> $$ language sql;
>>
>> create aggregate min (
>>  sfunc = dtval_smaller,
>>  basetype = dtval,
>>  stype = dtval
>> );
>>
>> create  function dtval_larger(dtval, dtval) returns dtval as $$
>>  select case when $1.dt > $2.dt then $1 else $2 end
>> $$ language sql;
>>
>> create aggregate max (
>>  sfunc = dtval_larger,
>>  basetype = dtval,
>>  stype = dtval
>> );
>>
>>
>>
>> Jaime Silvela wrote:
>>> The problem I'm trying to solve is pretty standard. I have a table
>>> that records measurements of different types at different times.
>>>
>>> CREATE TABLE measurements (
>>>  obj_id int4,
>>>  val_type_id int4 references lu_val_type(val_type_id),
>>>  val numeric,
>>>  observation_date date
>>> );
>>>
>>> I want a query as simple and fast as possible to return the latest
>>> observation of each type for each object.
>>> I sent a message to this list a while ago, and the suggestion I
>>> found to be the best compromise of clarity and speed was:
>>> a) create an index on (obj_id, val_type_id, observation_date)
>>> b) the "obvious" query becomes fast thanks to the index.
>>>    select ms.*
>>>    from (
>>>        select obj_id, val_type_id, max(observation_date) as
>>> observation_date
>>>        from measurements
>>>       group by obj_id, val_type_id
>>>    ) ms_last
>>>    join measurements ms using (obj_id, val_type_id, observation_date);
>>>
>>> It still bugged me a bit that this requires a JOIN, especially since
>>> in a procedural language, it would have been so easy to return the
>>> value associated with the max(observation_date).
>>> I think I've found a pretty good alternative. This at the moment
>>> works if we keep track of time with an integer, rather than a date,
>>> but it would be readily extensible.
>>>
>>> The idea is to in fact, associate the value with the
>>> max(observation_date) like so:
>>> select obj_id, val_type_id, max(array[observation_date, val])
>>> group by obj_id, val_type_id;
>>>
>>> There are two caveats:
>>> a) array requires elements to be of the same type, so
>>> observation_type must be kept as "time from"
>>> b) a row constructor would be ideal here, but there is now max
>>> function for rowtypes.
>>>
>>> If I did have a max() function for row types, it would be clean to
>>> do this:
>>> select obj_id, val_type_id, max(row(observation_date, val))
>>> group by obj_id, val_type_id;
>>>
>>> Now, it seems that since rowtype comparison is built in, it should
>>> be pretty easy to build a max() aggregate for it. Has anybody done
>>> this? I'd have looked at the code for max(anyarray) but I don't know
>>> how to access it. Can someone point me in the right direction?
>>>
>>> Also, has someone thought about this before? I'm wondering if there
>>> will be a speed gain coming from this.
>>>
>>> Thank you,
>>> Jaime
>>>
>>>
>>> ***********************************************************************
>>> Bear Stearns is not responsible for any recommendation, solicitation,
>>> offer or agreement or any information about any transaction, customer
>>> account or account activity contained in this communication.
>>>
>>> Bear Stearns does not provide tax, legal or accounting advice.  You
>>> should consult your own tax, legal and accounting advisors before
>>> engaging in any transaction. In order for Bear Stearns to comply with
>>> Internal Revenue Service Circular 230 (if applicable), you are notified
>>> that any discussion of U.S. federal tax issues contained or referred to
>>> herein is not intended or written to be used, and cannot be used, for
>>> the purpose of:  (A) avoiding penalties that may be imposed under the
>>> Internal Revenue Code; nor (B) promoting, marketing or recommending to
>>> another party any transaction or matter addressed herein.
>>> ***********************************************************************
>>>
>>> ---------------------------(end of
>>> broadcast)---------------------------
>>> TIP 3: Have you checked our extensive FAQ?
>>>
>>>               http://www.postgresql.org/docs/faq
>>>
>>
>>
>> ***********************************************************************
>> Bear Stearns is not responsible for any recommendation, solicitation,
>> offer or agreement or any information about any transaction, customer
>> account or account activity contained in this communication.
>>
>> Bear Stearns does not provide tax, legal or accounting advice.  You
>> should consult your own tax, legal and accounting advisors before
>> engaging in any transaction. In order for Bear Stearns to comply with
>> Internal Revenue Service Circular 230 (if applicable), you are notified
>> that any discussion of U.S. federal tax issues contained or referred to
>> herein is not intended or written to be used, and cannot be used, for
>> the purpose of:  (A) avoiding penalties that may be imposed under the
>> Internal Revenue Code; nor (B) promoting, marketing or recommending to
>> another party any transaction or matter addressed herein.
>> ***********************************************************************
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 4: Have you searched our list archives?
>>
>>               http://archives.postgresql.org/
>>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>               http://www.postgresql.org/docs/faq
>


***********************************************************************
Bear Stearns is not responsible for any recommendation, solicitation,
offer or agreement or any information about any transaction, customer
account or account activity contained in this communication.

Bear Stearns does not provide tax, legal or accounting advice.  You
should consult your own tax, legal and accounting advisors before
engaging in any transaction. In order for Bear Stearns to comply with
Internal Revenue Service Circular 230 (if applicable), you are notified
that any discussion of U.S. federal tax issues contained or referred to
herein is not intended or written to be used, and cannot be used, for
the purpose of:  (A) avoiding penalties that may be imposed under the
Internal Revenue Code; nor (B) promoting, marketing or recommending to
another party any transaction or matter addressed herein.
***********************************************************************

Re: COPY command details

From
Ron Johnson
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 03/29/07 23:56, Gerald Timothy G Quimpo wrote:
[snip]
>
> How do people take consistent backups of very large
> databases on Linux/FreeBSD?  I'm aware of PITR, but
> might not be able to set aside a box with enough
> drives for it.  LVM Snapshot? performance issues with
> LVM, etc?

If PG allows partitioned tables to span schemas, then I'd do that,
so that only one (relatively small, say quarterly or semi-annual)
schema is ever "hot" and needs to be backed up.

- --
Ron Johnson, Jr.
Jefferson LA  USA

Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFGESHPS9HxQb37XmcRAmZFAKCkRvh6fMlU4CUj25F5BT2L56C0fgCgn5wS
c+h/WQR3WHy9BvZ6lryIIqQ=
=OdEt
-----END PGP SIGNATURE-----