Thread: Large DB

Large DB

From

"Mooney, Ryan"

Date:

30 March 2004, 21:48:18

Hello,

I have a single table that just went over 234GB in size with about 290M+
rows.  I think that I'm starting to approach some limits since things
have gotten quite a bit slower over the last couple days.  The table is
really simple and I'm mostly doing simple data mining queries like the
query included below.  These have gone from taking a under a few minutes
to taking tens of minutes over the last week or so (a query like the
below would generally return 1-6 million rows from the base table).  The
queries use the indexes fairly well, although I suspect that the order
of host/starttime is suboptimal (fewer hosts than starttime, and the
table is naturally in starttime order).  I'm going to try adding an
index on just starttime (and later just host) and see if I can tune the
queries on that more.  I never delete rows from the table, only do
inserts (up to around 11,000/minute mostly in one big burst every
minute, this is anticipated to go up some over time).  There are about
32 processes doing the inserts (on the same machine - yeah I know it'd
be happier if they moved); I think it might help if there was only one,
but for architectural reasons that won't happen for a while.

This is on a dual 3Ghz xenon with 4G Ram and an IDE-SCSI raid array
(ACNC) I'm running RH Fedora with kernel 2.4.22-1.2115.nptlsmp (we'd
tried FBSD 4/5 early on, but the insert speeds were actually better with
RH9 by a ~10% or so - this was pre fbsd 5.2, but it's a bit late to
migrate easily now).

I'm trying to figure out ways to squeak another ounce or two of
performance out of this machine, I've included the things I've tuned so
far below.

The query processes are mostly stuck in D state so I expect that I'm
hitting some hw limitations, but I'm only doing sub 15MB from the disk
array (from iostat) and I know it can do in the 40-60MB range when we
tested the raw speed, and only 400 or so tps which is also well under
the arrays limits so I suspect that its thrashing a bit, this is also
indicated by the contrast between rrqm/s (read requests merged per
second) which is pushing 2000 and the actual r/s (read requests that
were issued to the device) at around 400 or so (same as tps).  I suspect
that a lot of the time is spent reading indexes, so a "better" indexing
scheme may be my best bet.

Estimating the table size
-------------------------

stats=> select relfilenode,relname from pg_class where
relfilenode=37057796;
 relfilenode | relname
-------------+---------
    37057796 | tp3

du -sc 37057796* | grep total
234002372       total

However the two indexes are also - large (which may be part of the
problem, which is why I'm trying just starttime for an index; They are
currently in the 140-150G range).

The query optimizer thinks I have ~ 290M rows (I'm not actually doing a
real select count since the last time I did that was around 10M rows or
so and it took a long while, I don't want to wait days :).
------------------------------
stats=> explain select count(*) from tp3;
                              QUERY PLAN
-----------------------------------------------------------------------
 Aggregate  (cost=7632998.20..7632998.20 rows=1 width=0)
   ->  Seq Scan on tp3  (cost=0.00..6906493.16 rows=290602016 width=0)
(2 rows)

Table def
----------------------------
stats=> \d tp3
                  Table "public.tp3"
   Column    |            Type             | Modifiers
-------------+-----------------------------+-----------
 host        | character(4)                |
 point       | character varying(64)       |
 type        | character(1)                |
 cooked      | character(1)                |
 starttime   | timestamp without time zone |
 intervallen | interval                    |
 arrivetime  | timestamp without time zone |
 pvalue      | numeric                     |
Indexes:
    "tp3_host_starttime" btree (host, starttime, cooked)
    "tp3_point_starttime" btree (point, starttime, cooked)


Sample data mining query:
----------------------------
select point, avg(pvalue) as avg from tp3 where host in ('node', 'node',
....) and starttime between 'timestamp' and 'timestamp' group by point

Tuning done so far:
----------------------------
$ cat /etc/sysctl.conf

kernel.shmall=805306368
kernel.shmmax=805306368

$ egrep -v "^#|^$" postgresql.conf

shared_buffers = 60800
sort_mem = 1286720              # min 64, size in KB
vacuum_mem = 102400             # min 1024, size in KB
fsync=false   # Play fast and loose - whee
max_files_per_process = 1000
wal_buffers = 16
checkpoint_segments = 20
checkpoint_timeout = 100
effective_cache_size = 160000

Re: Large DB

From

Ericson Smith

Date:

31 March 2004, 02:21:52

The issue here might be just organizing the data differently. Or getting
an Opteron server with 16GB RAM :-) Based on the strength of the
developers recommendations in this newsgroup, we recently upgraded to a
dual Opteron 2GHZ with 16GB Ram and 15K hard drives. We set
shared_buffers to 40,000 (just about 320MB Ram), and the difference is
amazing. Just having the OS handle the caching has made all the
difference. You can actually see lots of blocks getting cached by the
OS. (RH Linux Enterprise in our case). In most cases, tables with
millions of records would get entirely cached in RAM, and there would be
no disk access whatsoever for selects in a few minutes.

Based on the queries you run, is it possible to split up the schema into
different tables? Are the differences between timestamps in the sample
query usually small? We had a similar problem, although with a slightly
smaller data set -- but one that was going to keep growing.

Our questions were: how could we scale? What about vacuuming our tables,
running analyze in a decent time? backing up? and so on. We found that
most of the queries we wanted were in the domain of a day. So we
actually split up that giant table and made one for each day. We could
have done it one for each week as well, but the daily tables worked well
for us. Sure, its a bit more work getting data over a long time period,
but those common queries were a cinch.

We've also seen that in cases were we have to dump in thousands of
records every few minutes that select queries respond remarkedly faster
when frequent (one or two every hour) ANALYZE's are done even on those
daily tables which contain just a few mil records each.

Tweaking the hardware IMHO would probably take you just a little
further, but you gotta think about what your response times will be in
another month based on your growth now. Can your schema stand it then?

- Ericson Smith

Mooney, Ryan wrote:

>Hello,
>
>I have a single table that just went over 234GB in size with about 290M+
>rows.  I think that I'm starting to approach some limits since things
>have gotten quite a bit slower over the last couple days.  The table is
>really simple and I'm mostly doing simple data mining queries like the
>query included below.  These have gone from taking a under a few minutes
>to taking tens of minutes over the last week or so (a query like the
>below would generally return 1-6 million rows from the base table).  The
>queries use the indexes fairly well, although I suspect that the order
>of host/starttime is suboptimal (fewer hosts than starttime, and the
>table is naturally in starttime order).  I'm going to try adding an
>index on just starttime (and later just host) and see if I can tune the
>queries on that more.  I never delete rows from the table, only do
>inserts (up to around 11,000/minute mostly in one big burst every
>minute, this is anticipated to go up some over time).  There are about
>32 processes doing the inserts (on the same machine - yeah I know it'd
>be happier if they moved); I think it might help if there was only one,
>but for architectural reasons that won't happen for a while.
>
>This is on a dual 3Ghz xenon with 4G Ram and an IDE-SCSI raid array
>(ACNC) I'm running RH Fedora with kernel 2.4.22-1.2115.nptlsmp (we'd
>tried FBSD 4/5 early on, but the insert speeds were actually better with
>RH9 by a ~10% or so - this was pre fbsd 5.2, but it's a bit late to
>migrate easily now).
>
>I'm trying to figure out ways to squeak another ounce or two of
>performance out of this machine, I've included the things I've tuned so
>far below.
>
>The query processes are mostly stuck in D state so I expect that I'm
>hitting some hw limitations, but I'm only doing sub 15MB from the disk
>array (from iostat) and I know it can do in the 40-60MB range when we
>tested the raw speed, and only 400 or so tps which is also well under
>the arrays limits so I suspect that its thrashing a bit, this is also
>indicated by the contrast between rrqm/s (read requests merged per
>second) which is pushing 2000 and the actual r/s (read requests that
>were issued to the device) at around 400 or so (same as tps).  I suspect
>that a lot of the time is spent reading indexes, so a "better" indexing
>scheme may be my best bet.
>
>Estimating the table size
>-------------------------
>
>stats=> select relfilenode,relname from pg_class where
>relfilenode=37057796;
> relfilenode | relname
>-------------+---------
>    37057796 | tp3
>
>du -sc 37057796* | grep total
>234002372       total
>
>However the two indexes are also - large (which may be part of the
>problem, which is why I'm trying just starttime for an index; They are
>currently in the 140-150G range).
>
>The query optimizer thinks I have ~ 290M rows (I'm not actually doing a
>real select count since the last time I did that was around 10M rows or
>so and it took a long while, I don't want to wait days :).
>------------------------------
>stats=> explain select count(*) from tp3;
>                              QUERY PLAN
>-----------------------------------------------------------------------
> Aggregate  (cost=7632998.20..7632998.20 rows=1 width=0)
>   ->  Seq Scan on tp3  (cost=0.00..6906493.16 rows=290602016 width=0)
>(2 rows)
>
>Table def
>----------------------------
>stats=> \d tp3
>                  Table "public.tp3"
>   Column    |            Type             | Modifiers
>-------------+-----------------------------+-----------
> host        | character(4)                |
> point       | character varying(64)       |
> type        | character(1)                |
> cooked      | character(1)                |
> starttime   | timestamp without time zone |
> intervallen | interval                    |
> arrivetime  | timestamp without time zone |
> pvalue      | numeric                     |
>Indexes:
>    "tp3_host_starttime" btree (host, starttime, cooked)
>    "tp3_point_starttime" btree (point, starttime, cooked)
>
>
>Sample data mining query:
>----------------------------
>select point, avg(pvalue) as avg from tp3 where host in ('node', 'node',
>....) and starttime between 'timestamp' and 'timestamp' group by point
>
>Tuning done so far:
>----------------------------
>$ cat /etc/sysctl.conf
>
>kernel.shmall=805306368
>kernel.shmmax=805306368
>
>$ egrep -v "^#|^$" postgresql.conf
>
>shared_buffers = 60800
>sort_mem = 1286720              # min 64, size in KB
>vacuum_mem = 102400             # min 1024, size in KB
>fsync=false   # Play fast and loose - whee
>max_files_per_process = 1000
>wal_buffers = 16
>checkpoint_segments = 20
>checkpoint_timeout = 100
>effective_cache_size = 160000
>
>---------------------------(end of broadcast)---------------------------
>TIP 3: if posting/reading through Usenet, please send an appropriate
>      subscribe-nomail command to majordomo@postgresql.org so that your
>      message can get through to the mailing list cleanly
>
>

Re: Large DB

From

Manfred Koizar

Date:

31 March 2004, 05:15:57

On Tue, 30 Mar 2004 17:48:14 -0800, "Mooney, Ryan" <ryan.mooney@pnl.gov>
wrote:
>I have a single table that just went over 234GB in size with about 290M+
>rows.

That would mean ~ 800 bytes/row which, given your schema, is hard to
believe unless there are lots of dead tuples lying around.

>queries use the indexes fairly well, although I suspect that the order
>of host/starttime is suboptimal (fewer hosts than starttime, and the
>table is naturally in starttime order).  I'm going to try adding an
>index on just starttime (and later just host) and see if I can tune the
>queries on that more.

Yes, if you are ready to switch OS for a 10% performance gain, getting
your indices right should be no question.

>  I never delete rows from the table, only do
>inserts (up to around 11,000/minute mostly in one big burst every
>minute, this is anticipated to go up some over time).

How often do you ANALYSE?

Have there been DELETEs or UPDATEs or aborted transactions in the past?
Did you VACUUM or VACUUM FULL since then?

> I'm only doing sub 15MB from the disk
>array (from iostat) and I know it can do in the 40-60MB range when we
>tested the raw speed,

Sounds plausible for nonsequential I/O.

>However the two indexes are also - large (which may be part of the
>problem, which is why I'm trying just starttime for an index; They are
>currently in the 140-150G range).

This would be extreme index bloat which is only possible after massive
DELETEs/UPDATEs.

>stats=> explain select count(*) from tp3;
>   ->  Seq Scan on tp3  (cost=0.00..6906493.16 rows=290602016 width=0)

The planner thinks that the table size is 4M pages, 32GB.  The average
tuple size of ~110 bytes (including tuple header) suits your schema
quite nicely.

>                  Table "public.tp3"
>   Column    |            Type             | Modifiers
>-------------+-----------------------------+-----------
> host        | character(4)                |
> point       | character varying(64)       |
> type        | character(1)                |
> cooked      | character(1)                |
> starttime   | timestamp without time zone |
> intervallen | interval                    |
> arrivetime  | timestamp without time zone |
> pvalue      | numeric                     |
>Indexes:
>    "tp3_host_starttime" btree (host, starttime, cooked)
>    "tp3_point_starttime" btree (point, starttime, cooked)

In my experience any reduction in average tuple size results directly in
a proportional increase of throughput for large tables.  So here are
some random thoughts:

You said there are only a few hosts.  So moving the hosts into a
separate table with an integer primary key would save 4 bytes per row.

Datatype "char" (with quotes) needs only 1 byte, char(1) needs 5 bytes,
both before padding.  Changing type and cooked from char(1) to "char"
would save 12 bytes.

And if you want to push it, you change hostid to smallint and rearrange
the fields, saving 4 more padding bytes:
  hostid      | smallint
  type        | "char"
  cooked      | "char"

What about point?  If there is a known small number of different values,
move it into its own table.

I'm not sure about the storage needs of numeric, might be at least 8
bytes.  Consider using bigint. Someone please correct me if I'm wrong.

Did you CREATE TABLE tp3 (...) WITHOUT OIDS?

>Sample data mining query:
>----------------------------
>select point, avg(pvalue) as avg from tp3 where host in ('node', 'node',
>....) and starttime between 'timestamp' and 'timestamp' group by point

Show us EXPLAIN ANALYSE, please.

>shared_buffers = 60800

Looks a bit large to me.  But if your tests have shown it to be the best
value, it should be ok.

>sort_mem = 1286720              # min 64, size in KB

This is more than 1GB, I think this is too high.

>fsync=false   # Play fast and loose - whee

How much did this help?

>effective_cache_size = 160000

Try more, say 320000 or even 400000.

Servus
 Manfred

Re: Large DB

From

"Mooney, Ryan"

Date:

31 March 2004, 14:30:33

Oops,
PostgreSQL 7.4beta2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC)
3.2.2 20030222 (Red Hat Linux 3.2.2-5)

Yeah I know I need to upgrade to the release version...

> -----Original Message-----
> From: Sean Chittenden [mailto:sean@chittenden.org]
> Sent: Tuesday, March 30, 2004 9:04 PM
> To: Mooney, Ryan
> Subject: Re: [GENERAL] Large DB
>
>
> > However the two indexes are also - large (which may be part of the
> > problem, which is why I'm trying just starttime for an
> index; They are
> > currently in the 140-150G range).
>
> echo 'SELECT version();' | psql template1
>
> What version of PostgreSQL are you using?  :)  -sc
>
> --
> Sean Chittenden
>
>

Re: Large DB

From

"Mooney, Ryan"

Date:

31 March 2004, 15:17:16

Thanks for the excellent feedback (all)!

Good point on the excess bytes/row, not sure how to explain that.  There
have never been any deletes or updates on this table and all inserts
just simple inserts (no transactions or anything funky) so there
shouldn't be many aborted transactions (I could see the daemons that do
the inserts dying part way through a few times, but nothing to explain
the variance).

I haven't run ANALYZE on this table in a while.  After about 50-60M rows
it didn't seem to change the query plan at all and since there were
never any deletes/updates it seemed like it wasn't making much/any
difference (should have been no pages to reclaim).  That may be an
invalid assumption though.

I'll try the other suggestions over the next couple of days and see how
it goes.  Thanks again.

Here is an explain on the query:

=> explain select point, avg(pvalue) as avg from tp3 where host in
('m563', 'm562', 'm561', 'm560', 'm559', 'm558', 'm557', 'm538', 'm755',
'm754', 'm753', 'm752', 'm751', 'm750', 'm749', 'm748') and starttime
between '2004-03-27 07:37:43' and '2004-03-30 07:40:08' group by point;

 HashAggregate  (cost=96.90..96.90 rows=1 width=25)
   ->  Index Scan using tp3_host_starttime, tp3_host_starttime,
tp3_host_starttime, tp3_host_starttime, tp3_host_starttime,
tp3_host_starttime, tp3_host_starttime, tp3_host_starttime,
tp3_host_starttime, tp3_host_starttime, tp3_host_starttime,
tp3_host_starttime, tp3_host_starttime, tp3_host_starttime,
tp3_host_starttime, tp3_host_starttime on tp3  (cost=0.00..96.90 rows=1
width=25)
         Index Cond: (((host = 'm563'::bpchar) AND (starttime >=
'2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <=
'2004-03-30 07:40:08'::timestamp without time zone)) OR ((host =
'm562'::bpchar) AND (starttime >= '2004-03-27 07:37:43'::timestamp
without time zone) AND (starttime <= '2004-03-30 07:40:08'::timestamp
without time zone)) OR ((host = 'm561'::bpchar) AND (starttime >=
'2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <=
'2004-03-30 07:40:08'::timestamp without time zone)) OR ((host =
'm560'::bpchar) AND (starttime >= '2004-03-27 07:37:43'::timestamp
without time zone) AND (starttime <= '2004-03-30 07:40:08'::timestamp
without time zone)) OR ((host = 'm559'::bpchar) AND (starttime >=
'2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <=
'2004-03-30 07:40:08'::timestamp without time zone)) OR ((host =
'm558'::bpchar) AND (starttime >= '2004-03-27 07:37:43'::timestamp
without time zone) AND (starttime <= '2004-03-30 07:40:08'::timestamp
without time zone)) OR ((host = 'm557'::bpchar) AND (starttime >=
'2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <=
'2004-03-30 07:40:08'::timestamp without time zone)) OR ((host =
'm538'::bpchar) AND (starttime >= '2004-03-27 07:37:43'::timestamp
without time zone) AND (starttime <= '2004-03-30 07:40:08'::timestamp
without time zone)) OR ((host = 'm755'::bpchar) AND (starttime >=
'2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <=
'2004-03-30 07:40:08'::timestamp without time zone)) OR ((host =
'm754'::bpchar) AND (starttime >= '2004-03-27 07:37:43'::timestamp
without time zone) AND (starttime <= '2004-03-30 07:40:08'::timestamp
without time zone)) OR ((host = 'm753'::bpchar) AND (starttime >=
'2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <=
'2004-03-30 07:40:08'::timestamp without time zone)) OR ((host =
'm752'::bpchar) AND (starttime >= '2004-03-27 07:37:43'::timestamp
without time zone) AND (starttime <= '2004-03-30 07:40:08'::timestamp
without time zone)) OR ((host = 'm751'::bpchar) AND (starttime >=
'2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <=
'2004-03-30 07:40:08'::timestamp without time zone)) OR ((host =
'm750'::bpchar) AND (starttime >= '2004-03-27 07:37:43'::timestamp
without time zone) AND (starttime <= '2004-03-30 07:40:08'::timestamp
without time zone)) OR ((host = 'm749'::bpchar) AND (starttime >=
'2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <=
'2004-03-30 07:40:08'::timestamp without time zone)) OR ((host =
'm748'::bpchar) AND (starttime >= '2004-03-27 07:37:43'::timestamp
without time zone) AND (starttime <= '2004-03-30 07:40:08'::timestamp
without time zone)))
         Filter: (((host = 'm563'::bpchar) OR (host = 'm562'::bpchar) OR
(host = 'm561'::bpchar) OR (host = 'm560'::bpchar) OR (host =
'm559'::bpchar) OR (host = 'm558'::bpchar) OR (host = 'm557'::bpchar) OR
(host = 'm538'::bpchar) OR (host = 'm755'::bpchar) OR (host =
'm754'::bpchar) OR (host = 'm753'::bpchar) OR (host = 'm752'::bpchar) OR
(host = 'm751'::bpchar) OR (host = 'm750'::bpchar) OR (host =
'm749'::bpchar) OR (host = 'm748'::bpchar)) AND (starttime >=
'2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <=
'2004-03-30 07:40:08'::timestamp without time zone))
(4 rows)

> -----Original Message-----
> From: Manfred Koizar [mailto:mkoi-pg@aon.at]
> Sent: Wednesday, March 31, 2004 1:18 AM
> To: Mooney, Ryan
> Cc: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] Large DB
>
>
> On Tue, 30 Mar 2004 17:48:14 -0800, "Mooney, Ryan"
> <ryan.mooney@pnl.gov>
> wrote:
> >I have a single table that just went over 234GB in size with about
> >290M+ rows.
>
> That would mean ~ 800 bytes/row which, given your schema, is
> hard to believe unless there are lots of dead tuples lying around.
>
> >queries use the indexes fairly well, although I suspect that
> the order
> >of host/starttime is suboptimal (fewer hosts than starttime, and the
> >table is naturally in starttime order).  I'm going to try adding an
> >index on just starttime (and later just host) and see if I
> can tune the
> >queries on that more.
>
> Yes, if you are ready to switch OS for a 10% performance
> gain, getting your indices right should be no question.
>
> >  I never delete rows from the table, only do
> >inserts (up to around 11,000/minute mostly in one big burst every
> >minute, this is anticipated to go up some over time).
>
> How often do you ANALYSE?
>
> Have there been DELETEs or UPDATEs or aborted transactions in
> the past? Did you VACUUM or VACUUM FULL since then?
>
> > I'm only doing sub 15MB from the disk
> >array (from iostat) and I know it can do in the 40-60MB
> range when we
> >tested the raw speed,
>
> Sounds plausible for nonsequential I/O.
>
> >However the two indexes are also - large (which may be part of the
> >problem, which is why I'm trying just starttime for an
> index; They are
> >currently in the 140-150G range).
>
> This would be extreme index bloat which is only possible
> after massive DELETEs/UPDATEs.
>
> >stats=> explain select count(*) from tp3;
> >   ->  Seq Scan on tp3  (cost=0.00..6906493.16
> rows=290602016 width=0)
>
> The planner thinks that the table size is 4M pages, 32GB.
> The average tuple size of ~110 bytes (including tuple header)
> suits your schema quite nicely.
>
> >                  Table "public.tp3"
> >   Column    |            Type             | Modifiers
> >-------------+-----------------------------+-----------
> > host        | character(4)                |
> > point       | character varying(64)       |
> > type        | character(1)                |
> > cooked      | character(1)                |
> > starttime   | timestamp without time zone |
> > intervallen | interval                    |
> > arrivetime  | timestamp without time zone |
> > pvalue      | numeric                     |
> >Indexes:
> >    "tp3_host_starttime" btree (host, starttime, cooked)
> >    "tp3_point_starttime" btree (point, starttime, cooked)
>
> In my experience any reduction in average tuple size results
> directly in a proportional increase of throughput for large
> tables.  So here are some random thoughts:
>
> You said there are only a few hosts.  So moving the hosts
> into a separate table with an integer primary key would save
> 4 bytes per row.
>
> Datatype "char" (with quotes) needs only 1 byte, char(1)
> needs 5 bytes, both before padding.  Changing type and cooked
> from char(1) to "char" would save 12 bytes.
>
> And if you want to push it, you change hostid to smallint and
> rearrange the fields, saving 4 more padding bytes:
>   hostid      | smallint
>   type        | "char"
>   cooked      | "char"
>
> What about point?  If there is a known small number of
> different values, move it into its own table.
>
> I'm not sure about the storage needs of numeric, might be at
> least 8 bytes.  Consider using bigint. Someone please correct
> me if I'm wrong.
>
> Did you CREATE TABLE tp3 (...) WITHOUT OIDS?
>
> >Sample data mining query:
> >----------------------------
> >select point, avg(pvalue) as avg from tp3 where host in ('node',
> >'node',
> >....) and starttime between 'timestamp' and 'timestamp'
> group by point
>
> Show us EXPLAIN ANALYSE, please.
>
> >shared_buffers = 60800
>
> Looks a bit large to me.  But if your tests have shown it to
> be the best value, it should be ok.
>
> >sort_mem = 1286720              # min 64, size in KB
>
> This is more than 1GB, I think this is too high.
>
> >fsync=false   # Play fast and loose - whee
>
> How much did this help?
>
> >effective_cache_size = 160000
>
> Try more, say 320000 or even 400000.
>
> Servus
>  Manfred
>

Re: Large DB

From

Manfred Koizar

Date:

01 April 2004, 06:21:12

On Wed, 31 Mar 2004 10:08:26 -0800, "Mooney, Ryan" <ryan.mooney@pnl.gov>
wrote:
>I haven't run ANALYZE on this table in a while.  After about 50-60M rows
>it didn't seem to change the query plan at all and since there were
>never any deletes/updates it seemed like it wasn't making much/any
>difference (should have been no pages to reclaim).

Reclaiming pages is not the job of ANALYSE, VACUUM does this.

>  That may be an
>invalid assumption though.

Might be a valid assumption as well -- if you're lucky.  But do you want
to depend on luck?  Eg. 75% of the today's rows contain timestamps that
are greater than what the planner believes to be the maximum.

No VACCUM, no ANALYSE, no REINDEX.  This explains why the planner thinks
there are only 4M pages, which gives 640 bytes/row if there were 50M
rows at that time.  OTOH the EXPLAIN shows 290M rows for the seq scan.
Something doesn't fit together here.

Hackers, what could update reltuples, but not relpages?

Or, Ryan, is it possible that you already had 290M rows when you ran
ANALYSE and you have more than 1G rows today?

BTW, ANALYSE is basically a constant time operation.

>Here is an explain on the query:
>
>=> explain select point, avg(pvalue) as avg from tp3 where host in

This tells us one half of the story.

    EXPLAIN ANALYSE SELECT ...

would tell us the other half, too.

Servus
 Manfred

Re: Large DB

From

Tom Lane

Date:

01 April 2004, 11:23:12

Manfred Koizar <mkoi-pg@aon.at> writes:
> Hackers, what could update reltuples, but not relpages?

Nothing --- they are always updated together.  One possibility is that
the 4M pages and 290M rows numbers really do go together (for about 112
bytes/row) and that the table has since grown, or perhaps merely bloated
due to lack of vacuuming of updated rows.

A different line of thought is that they were updated together, but the
relpages estimate was accurate while reltuples was not.  ANALYZE knows
the actual table size in pages (because it asks the OS) but reltuples is
extrapolated from an average of the number of live tuples on the pages
ANALYZE looks at.  It is possible for ANALYZE to be fooled badly if, for
instance, there are lots and lots of dead rows near the start of the
table.  (Lack of regular vacuuming would certainly improve the odds of
this happening...)

Note that VACUUM is not subject to this error because it has to grovel
over every page anyway.  So either "VACUUM" or "VACUUM ANALYZE" will
give you a known-good reltuples, it's only standalone "ANALYZE" that
has a risk of estimation error.

            regards, tom lane

Re: Large DB

From

Manfred Koizar

Date:

02 April 2004, 11:18:46

On Thu, 01 Apr 2004 12:22:58 +0200, I wrote:
>BTW, ANALYSE is basically a constant time operation.

On closer inspection, this is not the whole truth.  ANALY[SZ]E is a two
stage process:  First it collects a sample of rows, then these rows are
examined to produce various statistics.

The cost of the latter depends on the sample size, which itself depends
on the default or column-specific statistics target, and the number (and
types) of columns, so it *should* take more or less constant time.

The first step, however, (acquire_sample_rows() in analyze.c) has to
read more rows than finally end up in the sample.  It visits less than
O(nblocks) pages but certainly more than O(1).

A vague feeling tries to tell me that the number of page reads is
somehow related to the harmonic numbers 1 + 1/2 + 1/3 + ... + 1/n, which
grow like O(ln(n)).

I have an idea how this could be done with O(1) page reads.  If I'm able
to translate it into C, I'll send a patch ...

Servus
 Manfred

Re: Large DB

From

Tom Lane

Date:

02 April 2004, 12:17:12

Manfred Koizar <mkoi-pg@aon.at> writes:
> The first step, however, (acquire_sample_rows() in analyze.c) has to
> read more rows than finally end up in the sample.  It visits less than
> O(nblocks) pages but certainly more than O(1).

> A vague feeling tries to tell me that the number of page reads is
> somehow related to the harmonic numbers 1 + 1/2 + 1/3 + ... + 1/n, which
> grow like O(ln(n)).

Good guess.  Vitter's paper says the expected time to sample n rows from
a table of size N is O(n * (1 + log(N/n))).

> I have an idea how this could be done with O(1) page reads.

The hard part is getting a genuinely random sample when we don't know N
in advance.  We do however know the table size in blocks, so if you're
willing to make assumptions about constant tuple density you could do
something different.  (But the tuple density assumption is exactly the
weak spot of what we've got, so I'm unconvinced that would be a big step
forward.)

            regards, tom lane

Re: Large DB

From

Manfred Koizar

Date:

02 April 2004, 14:13:03

[time to move this to -hackers]

On Fri, 02 Apr 2004 11:16:21 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>Manfred Koizar <mkoi-pg@aon.at> writes:
>> The first step, however, (acquire_sample_rows() in analyze.c) has to
>> read more rows than finally end up in the sample.  It visits less than
>> O(nblocks) pages but certainly more than O(1).
>
>> A vague feeling tries to tell me that the number of page reads is
>> somehow related to the harmonic numbers 1 + 1/2 + 1/3 + ... + 1/n, which
>> grow like O(ln(n)).
>
>Good guess.  Vitter's paper says the expected time to sample n rows from
>a table of size N is O(n * (1 + log(N/n))).

Well, for what I tried to find out my wild guess seems to be wrong.

I don't doubt that Vitter's formula is correct, but it assumes that
access to any tuple has the same cost.  This does not apply to our
problem, however.  With 100 tuples per page, we access the first
sample_size tuples at a cost of 0.01 sequential page reads per tuple.
Later we use less and less tuples per page which results in higher
per-tuple-cost.  Near the end of a large relation we can expect to
access only one tuple per page and more and more pages are skipped, so
that prefetching doesn't help any more.

Playing around with some real numbers (for 100 tuples/page and a sample
size of 3000) I got:

     rel  | page
     size | reads
    ------+-------------
       30 |    30
      300 |   300    expectation is something like 299.9995
      500 |   499
       1K |   990
       3K |  2.6K
      30K |    8K
     100K |   12K
       1M |   19K
      10M |   26K
     100M |   33K

This growth rate is steeper than O(log(nblocks)).

>> I have an idea how this could be done with O(1) page reads.

What I have in mind is a kind of "Double Vitter" algorithm.  Whatever we
do to get our sample of rows, in the end the sampled rows come from no
more than sample_size different blocks.  So my idea is to first create a
random sample of sample_size block numbers, and then to sample the rows
out of this pool of blocks.

I have to think harder though, what to do about those 400 pages that are
not accessed when the sample size is 3000 ...

>The hard part is getting a genuinely random sample when we don't know N
>in advance.  We do however know the table size in blocks, so if you're
>willing to make assumptions about constant tuple density you could do
>something different.  (But the tuple density assumption is exactly the
>weak spot of what we've got, so I'm unconvinced that would be a big step
>forward.)

Starting the scan at some random blocks should help against the common
case of unusual distribution of dead tuples near the start of the
relation.  And I plan to factor information about dead tuple hits into
an increasingly better estimation of dead/live tuple ratio.

Servus
 Manfred

Re: Large DB

From

"Mooney, Ryan"

Date:

04 April 2004, 03:39:15

> On Wed, 31 Mar 2004 10:08:26 -0800, "Mooney, Ryan"
> <ryan.mooney@pnl.gov>
> wrote:
> >I haven't run ANALYZE on this table in a while.  After about 50-60M
> >rows it didn't seem to change the query plan at all and since there
> >were never any deletes/updates it seemed like it wasn't
> making much/any
> >difference (should have been no pages to reclaim).
>
> Reclaiming pages is not the job of ANALYSE, VACUUM does this.

Sorry, yes that's what I meant - lack of sleep :>

> Might be a valid assumption as well -- if you're lucky.  But
> do you want to depend on luck?  Eg. 75% of the today's rows
> contain timestamps that are greater than what the planner
> believes to be the maximum.

Point taken.  See vacuum info below...

> No VACCUM, no ANALYSE, no REINDEX.  This explains why the
> planner thinks there are only 4M pages, which gives 640
> bytes/row if there were 50M rows at that time.  OTOH the
> EXPLAIN shows 290M rows for the seq scan. Something doesn't
> fit together here.

Ok, so I ran a vacuum analyse.  It took ~1.7 days to finish.  I'm
running
another one now to see how it does.  I had turned off data updates
during
the vacuum below, I've turned them back on for the current run to get
some
idea about how its going to work ongoing, this means that the machine
was
basically idle during the vacuum so that's a pretty good picture of how
long just the vacuum took.

The invalid page block was caused when I tried the 2.6 kernel
(for other reasons than DB performance), its been there for a while,
and I can deal w/ the data loss so I just turned on
zero_damaged_pages = true (a nice feature would be to be able to set
this by table - I didn't see any way to do that), I did try doing
the "select ctid from tp3 offset 10257032 limit 20;" and deleting
rows with those ctid's, but even after deleting several hundred rows
it was still complaining, so I gave up and moved on...

INFO:  vacuuming "public.tp3"
INFO:  index "tp3_host_starttime" now contains 1401476187 tuples in
19778677 pages
DETAIL:  11657230 index tuples were removed.
108729 index pages have been deleted, 1 are currently reusable.
CPU 892.96s/602.59u sec elapsed 149710.57 sec.
ERROR:  invalid page header in block 10257032 of "tp3_point_starttime"

I had thought that there had never been any deletes/updates but my
co-worker pointed
out that we did do some updates on a couple million rows a couple months
back, so that
explains the removed index tuples.

Here is the explain analyse, you can see why I think that an index on
just host might be
better - hosts are a small set, starttime is a large set so the index
should be more
Efficient, at the very least it should be (starttime, host), not (host,
starttime) unless
the indexing engine is smart enough to make that not matter (I suspect
its not???).

Db => explain analyse select point, avg(pvalue) as avg from tp3 where
host in ('m480', 'm456', 'm455', 'm453', 'm452', 'm451', 'm450', 'm449',
'm368', 'm367', 'm366', 'm365', 'm364', 'm363', 'm362', 'm361', 'm783',
'm782', 'm781', 'm779', 'm778', 'm776', 'm774', 'm739', 'm738', 'm737',
'm736', 'm735', 'm732', 'm731', 'm730', 'm727') and starttime between
'2004-03-30 07:28:25' and '2004-03-30 08:37:33' group by point;

 HashAggregate  (cost=195.08..195.08 rows=1 width=25) (actual
time=98667.32..98667.37 rows=12 loops=1)
   ->  Index Scan using tp3_host_starttime, tp3_host_starttime,
tp3_host_starttime, tp3_host_starttime, tp3_host_starttime,
tp3_host_starttime, tp3_host_starttime, tp3_host_starttime,
tp3_host_starttime, tp3_host_starttime, tp3_host_starttime,
tp3_host_starttime, tp3_host_starttime, tp3_host_starttime,
tp3_host_starttime, tp3_host_starttime, tp3_host_starttime,
tp3_host_starttime, tp3_host_starttime, tp3_host_starttime,
tp3_host_starttime, tp3_host_starttime, tp3_host_starttime,
tp3_host_starttime, tp3_host_starttime, tp3_host_starttime,
tp3_host_starttime, tp3_host_starttime, tp3_host_starttime,
tp3_host_starttime, tp3_host_starttime, tp3_host_starttime on tp3
(cost=0.00..195.07 rows=1 width=25) (actual time=59.42..96366.43
rows=206238 loops=1)
         Index Cond: (((host = 'm480'::bpchar) AND (starttime >=
'2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <=
'2004-03-30 08:37:33'::timestamp without time zone)) OR ((host =
'm456'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp
without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp
without time zone)) OR ((host = 'm455'::bpchar) AND (starttime >=
'2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <=
'2004-03-30 08:37:33'::timestamp without time zone)) OR ((host =
'm453'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp
without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp
without time zone)) OR ((host = 'm452'::bpchar) AND (starttime >=
'2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <=
'2004-03-30 08:37:33'::timestamp without time zone)) OR ((host =
'm451'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp
without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp
without time zone)) OR ((host = 'm450'::bpchar) AND (starttime >=
'2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <=
'2004-03-30 08:37:33'::timestamp without time zone)) OR ((host =
'm449'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp
without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp
without time zone)) OR ((host = 'm368'::bpchar) AND (starttime >=
'2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <=
'2004-03-30 08:37:33'::timestamp without time zone)) OR ((host =
'm367'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp
without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp
without time zone)) OR ((host = 'm366'::bpchar) AND (starttime >=
'2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <=
'2004-03-30 08:37:33'::timestamp without time zone)) OR ((host =
'm365'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp
without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp
without time zone)) OR ((host = 'm364'::bpchar) AND (starttime >=
'2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <=
'2004-03-30 08:37:33'::timestamp without time zone)) OR ((host =
'm363'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp
without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp
without time zone)) OR ((host = 'm362'::bpchar) AND (starttime >=
'2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <=
'2004-03-30 08:37:33'::timestamp without time zone)) OR ((host =
'm361'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp
without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp
without time zone)) OR ((host = 'm783'::bpchar) AND (starttime >=
'2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <=
'2004-03-30 08:37:33'::timestamp without time zone)) OR ((host =
'm782'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp
without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp
without time zone)) OR ((host = 'm781'::bpchar) AND (starttime >=
'2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <=
'2004-03-30 08:37:33'::timestamp without time zone)) OR ((host =
'm779'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp
without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp
without time zone)) OR ((host = 'm778'::bpchar) AND (starttime >=
'2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <=
'2004-03-30 08:37:33'::timestamp without time zone)) OR ((host =
'm776'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp
without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp
without time zone)) OR ((host = 'm774'::bpchar) AND (starttime >=
'2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <=
'2004-03-30 08:37:33'::timestamp without time zone)) OR ((host =
'm739'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp
without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp
without time zone)) OR ((host = 'm738'::bpchar) AND (starttime >=
'2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <=
'2004-03-30 08:37:33'::timestamp without time zone)) OR ((host =
'm737'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp
without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp
without time zone)) OR ((host = 'm736'::bpchar) AND (starttime >=
'2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <=
'2004-03-30 08:37:33'::timestamp without time zone)) OR ((host =
'm735'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp
without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp
without time zone)) OR ((host = 'm732'::bpchar) AND (starttime >=
'2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <=
'2004-03-30 08:37:33'::timestamp without time zone)) OR ((host =
'm731'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp
without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp
without time zone)) OR ((host = 'm730'::bpchar) AND (starttime >=
'2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <=
'2004-03-30 08:37:33'::timestamp without time zone)) OR ((host =
'm727'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp
without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp
without time zone)))
         Filter: (((host = 'm480'::bpchar) OR (host = 'm456'::bpchar) OR
(host = 'm455'::bpchar) OR (host = 'm453'::bpchar) OR (host =
'm452'::bpchar) OR (host = 'm451'::bpchar) OR (host = 'm450'::bpchar) OR
(host = 'm449'::bpchar) OR (host = 'm368'::bpchar) OR (host =
'm367'::bpchar) OR (host = 'm366'::bpchar) OR (host = 'm365'::bpchar) OR
(host = 'm364'::bpchar) OR (host = 'm363'::bpchar) OR (host =
'm362'::bpchar) OR (host = 'm361'::bpchar) OR (host = 'm783'::bpchar) OR
(host = 'm782'::bpchar) OR (host = 'm781'::bpchar) OR (host =
'm779'::bpchar) OR (host = 'm778'::bpchar) OR (host = 'm776'::bpchar) OR
(host = 'm774'::bpchar) OR (host = 'm739'::bpchar) OR (host =
'm738'::bpchar) OR (host = 'm737'::bpchar) OR (host = 'm736'::bpchar) OR
(host = 'm735'::bpchar) OR (host = 'm732'::bpchar) OR (host =
'm731'::bpchar) OR (host = 'm730'::bpchar) OR (host = 'm727'::bpchar))
AND (starttime >= '2004-03-30 07:28:25'::timestamp without time zone)
AND (starttime <= '2004-03-30 08:37:33'::timestamp without time zone))
 Total runtime: 98760.26 msec
(5 rows)

> Hackers, what could update reltuples, but not relpages?
>
> Or, Ryan, is it possible that you already had 290M rows when
> you ran ANALYSE and you have more than 1G rows today?

Yes, appears to be the case...

> BTW, ANALYSE is basically a constant time operation.

Having followed the ongoing discusion about this I can concur that it is

definitely NOT O(1).  Unfortunately I didn't track the "time to vacuum"
as
The table grew.  However I believe that I'm going to follow the
suggestions
about reducing the table size so I'll have a brand new BD to play with
in a
couple weeks, so knowing what I know now, I can track that if anyones
interested in the data besides me :)

>
> Servus
>  Manfred
>

Re: Large DB

From

Manfred Koizar

Date:

05 April 2004, 18:08:18

On Sat, 03 Apr 2004 22:39:31 -0800, "Mooney, Ryan" <ryan.mooney@pnl.gov>
wrote:
>Ok, so I ran a vacuum analyse.  It took ~1.7 days to finish.

Just to make it clear:  VACUUM and ANALYSE are two different commands.

VACUUM is for cleaning up.  It has to visit every tuple in every page,
and if there are dead row versions, it also has to scan all indices
belonging to the table.  If there are lots of deleted rows and
vacuum_mem is to small, VACUUM has to scan each index several times to
delete all index entries pointing to dead rows.  This might raise the
cost to even more than O(tuplecount).

ANALYSE collects a random sample of the rows in the table, the sample
size depends on default_statistics_target and the maximum value you have
set any column's statistics target to (ALTER TABLE ... ALTER COLUMN ...
SET STATISTICS ...).  If you didn't touch either, the sample size is
3000 rows.  Then these 3000 rows are sorted and counted in different
ways to generate statistics.

The number of pages that have to be touched to collect the sample
depends on the table size, but it does by far not grow proportionally to
the number of pages, nblocks.  The cost growth rate is greater than
O(ln(nblocks)) and significantly lesser than O(nblocks).  I have no
simple formula for it, but I estimate that analysing your tp3 table
would need between 28000 and 30000 page reads, which should be doable in
a few minutes.

VACUUM ANALYSE is just VACUUM followed by ANALYSE with the added
benefit, that the number of rows does not have to be estimated by
ANALYSE, because VACUUM knows the exact value.

>The invalid page block was caused when I tried the 2.6 kernel
>(for other reasons than DB performance), its been there for a while,
>and I can deal w/ the data loss

>ERROR:  invalid page header in block 10257032 of "tp3_point_starttime"

AFAICS the invalid page is in an index, so there is no data loss.  You
could simply drop and re-create that index.  That might take some time,
though :-(

>Here is the explain analyse, you can see why I think that an index on
>just host might be
>better - hosts are a small set, starttime is a large set so the index
>should be more
>Efficient,

I think you got that backwards.  If there are not many hosts, then an
index on host is not very selective, IOW you get a lot of hits when you
look for a particular host.  OTOH if you select a sufficiently small
starttime interval, you get only a few rows, so using an index is most
efficient.

> at the very least it should be (starttime, host), not (host,
>starttime) unless
>the indexing engine is smart enough to make that not matter (I suspect
>its not???).

Yes, it should be (starttime, host).  And no, PG is generally not smart
enough to use an index if there is no condition on the first index
column.

>   ->  Index Scan using tp3_host_starttime, tp3_host_starttime,
>       [...], tp3_host_starttime on tp3
>(cost=0.00..195.07 rows=1 width=25) (actual time=59.42..96366.43
>rows=206238 loops=1)

Estimated number of rows: 1
Actual number of rows: 206238
The planner is way off here.  Furtunately your huge number of rows makes
it rule out every other (most probably slower) plan.

How many hosts are there?  Even if there are a few hundred, an index
scan with that condition has to access and skip millions of index
tuples.  An index on (starttime, host) would visit less index tuples,
and would more likely access the heap tuples in physical order.

>Having followed the ongoing discusion about this I can concur that it is
>definitely NOT O(1).  Unfortunately I didn't track the "time to vacuum"

The discussion on -hackers and the patch I posted earlier today are
about ANALYSE, not VACUUM.

>However I believe that I'm going to follow the suggestions
>about reducing the table size so I'll have a brand new BD to play with
>in a couple weeks,

Hopefully we'll see a success story here.

> so knowing what I know now, I can track that if anyones
>interested in the data besides me :)

VACUUM and ANALYSE times?  Certainly.

Servus
 Manfred

Re: Large DB

From

Ericson Smith

Date:

06 April 2004, 14:57:40

I've been following this thread with interest since it started, and it
really seems that there is just too much data in that single table. When
it comes down to it, making smaller separate tables seems to be the way
to go. Querying will be a little harder, but much faster.

Warmest regards,
Ericson Smith
Tracking Specialist/DBA
+-----------------------+---------------------------------+
| http://www.did-it.com | "When you have to shoot, shoot, |
| eric@did-it.com       | don't talk!             - Tuco  |
| 516-255-0500          |                                 |
+-----------------------+---------------------------------+



Manfred Koizar wrote:

>On Sat, 03 Apr 2004 22:39:31 -0800, "Mooney, Ryan" <ryan.mooney@pnl.gov>
>wrote:
>
>
>>Ok, so I ran a vacuum analyse.  It took ~1.7 days to finish.
>>
>>
>
>Just to make it clear:  VACUUM and ANALYSE are two different commands.
>
>VACUUM is for cleaning up.  It has to visit every tuple in every page,
>and if there are dead row versions, it also has to scan all indices
>belonging to the table.  If there are lots of deleted rows and
>vacuum_mem is to small, VACUUM has to scan each index several times to
>delete all index entries pointing to dead rows.  This might raise the
>cost to even more than O(tuplecount).
>
>ANALYSE collects a random sample of the rows in the table, the sample
>size depends on default_statistics_target and the maximum value you have
>set any column's statistics target to (ALTER TABLE ... ALTER COLUMN ...
>SET STATISTICS ...).  If you didn't touch either, the sample size is
>3000 rows.  Then these 3000 rows are sorted and counted in different
>ways to generate statistics.
>
>The number of pages that have to be touched to collect the sample
>depends on the table size, but it does by far not grow proportionally to
>the number of pages, nblocks.  The cost growth rate is greater than
>O(ln(nblocks)) and significantly lesser than O(nblocks).  I have no
>simple formula for it, but I estimate that analysing your tp3 table
>would need between 28000 and 30000 page reads, which should be doable in
>a few minutes.
>
>VACUUM ANALYSE is just VACUUM followed by ANALYSE with the added
>benefit, that the number of rows does not have to be estimated by
>ANALYSE, because VACUUM knows the exact value.
>
>
>
>>The invalid page block was caused when I tried the 2.6 kernel
>>(for other reasons than DB performance), its been there for a while,
>>and I can deal w/ the data loss
>>
>>
>
>
>
>>ERROR:  invalid page header in block 10257032 of "tp3_point_starttime"
>>
>>
>
>AFAICS the invalid page is in an index, so there is no data loss.  You
>could simply drop and re-create that index.  That might take some time,
>though :-(
>
>
>
>>Here is the explain analyse, you can see why I think that an index on
>>just host might be
>>better - hosts are a small set, starttime is a large set so the index
>>should be more
>>Efficient,
>>
>>
>
>I think you got that backwards.  If there are not many hosts, then an
>index on host is not very selective, IOW you get a lot of hits when you
>look for a particular host.  OTOH if you select a sufficiently small
>starttime interval, you get only a few rows, so using an index is most
>efficient.
>
>
>
>>at the very least it should be (starttime, host), not (host,
>>starttime) unless
>>the indexing engine is smart enough to make that not matter (I suspect
>>its not???).
>>
>>
>
>Yes, it should be (starttime, host).  And no, PG is generally not smart
>enough to use an index if there is no condition on the first index
>column.
>
>
>
>>  ->  Index Scan using tp3_host_starttime, tp3_host_starttime,
>>      [...], tp3_host_starttime on tp3
>>(cost=0.00..195.07 rows=1 width=25) (actual time=59.42..96366.43
>>rows=206238 loops=1)
>>
>>
>
>Estimated number of rows: 1
>Actual number of rows: 206238
>The planner is way off here.  Furtunately your huge number of rows makes
>it rule out every other (most probably slower) plan.
>
>How many hosts are there?  Even if there are a few hundred, an index
>scan with that condition has to access and skip millions of index
>tuples.  An index on (starttime, host) would visit less index tuples,
>and would more likely access the heap tuples in physical order.
>
>
>
>>Having followed the ongoing discusion about this I can concur that it is
>>definitely NOT O(1).  Unfortunately I didn't track the "time to vacuum"
>>
>>
>
>The discussion on -hackers and the patch I posted earlier today are
>about ANALYSE, not VACUUM.
>
>
>
>>However I believe that I'm going to follow the suggestions
>>about reducing the table size so I'll have a brand new BD to play with
>>in a couple weeks,
>>
>>
>
>Hopefully we'll see a success story here.
>
>
>
>>so knowing what I know now, I can track that if anyones
>>interested in the data besides me :)
>>
>>
>
>VACUUM and ANALYSE times?  Certainly.
>
>Servus
> Manfred
>
>---------------------------(end of broadcast)---------------------------
>TIP 3: if posting/reading through Usenet, please send an appropriate
>      subscribe-nomail command to majordomo@postgresql.org so that your
>      message can get through to the mailing list cleanly
>
>
>

Attachment

eric.vcf

Re: Large DB

From

Bob.Henkel@hartfordlife.com

Date:

06 April 2004, 18:16:06





I'm a fairly new Postgresql user.  And a long time Oracle user so keep that
in mind.

So you are telling me that for 206238 rows on a table the best route is to
break the table into separate tables?  To me that is horrible to have to do
unless you really wanted to do that for some reason.  And unless the count
is way higher then I read how is 206238 a large amout of rows on a table?


|---------+---------------------------------->
|         |           Ericson Smith          |
|         |           <eric@did-it.com>      |
|         |           Sent by:               |
|         |           pgsql-general-owner@pos|
|         |           tgresql.org            |
|         |                                  |
|         |                                  |
|         |           04/06/2004 08:27 AM    |
|         |                                  |
|---------+---------------------------------->

>------------------------------------------------------------------------------------------------------------------------------|
  |
        | 
  |       To:       Manfred Koizar <mkoi-pg@aon.at>
        | 
  |       cc:       "Mooney, Ryan" <ryan.mooney@pnl.gov>, pgsql-general@postgresql.org
        | 
  |       Subject:  Re: [GENERAL] Large DB
        | 

>------------------------------------------------------------------------------------------------------------------------------|




I've been following this thread with interest since it started, and it
really seems that there is just too much data in that single table. When
it comes down to it, making smaller separate tables seems to be the way
to go. Querying will be a little harder, but much faster.

Warmest regards,
Ericson Smith
Tracking Specialist/DBA
+-----------------------+---------------------------------+
| http://www.did-it.com | "When you have to shoot, shoot, |
| eric@did-it.com       | don't talk!             - Tuco  |
| 516-255-0500          |                                 |
+-----------------------+---------------------------------+



Manfred Koizar wrote:

>On Sat, 03 Apr 2004 22:39:31 -0800, "Mooney, Ryan" <ryan.mooney@pnl.gov>
>wrote:
>
>
>>Ok, so I ran a vacuum analyse.  It took ~1.7 days to finish.
>>
>>
>
>Just to make it clear:  VACUUM and ANALYSE are two different commands.
>
>VACUUM is for cleaning up.  It has to visit every tuple in every page,
>and if there are dead row versions, it also has to scan all indices
>belonging to the table.  If there are lots of deleted rows and
>vacuum_mem is to small, VACUUM has to scan each index several times to
>delete all index entries pointing to dead rows.  This might raise the
>cost to even more than O(tuplecount).
>
>ANALYSE collects a random sample of the rows in the table, the sample
>size depends on default_statistics_target and the maximum value you have
>set any column's statistics target to (ALTER TABLE ... ALTER COLUMN ...
>SET STATISTICS ...).  If you didn't touch either, the sample size is
>3000 rows.  Then these 3000 rows are sorted and counted in different
>ways to generate statistics.
>
>The number of pages that have to be touched to collect the sample
>depends on the table size, but it does by far not grow proportionally to
>the number of pages, nblocks.  The cost growth rate is greater than
>O(ln(nblocks)) and significantly lesser than O(nblocks).  I have no
>simple formula for it, but I estimate that analysing your tp3 table
>would need between 28000 and 30000 page reads, which should be doable in
>a few minutes.
>
>VACUUM ANALYSE is just VACUUM followed by ANALYSE with the added
>benefit, that the number of rows does not have to be estimated by
>ANALYSE, because VACUUM knows the exact value.
>
>
>
>>The invalid page block was caused when I tried the 2.6 kernel
>>(for other reasons than DB performance), its been there for a while,
>>and I can deal w/ the data loss
>>
>>
>
>
>
>>ERROR:  invalid page header in block 10257032 of "tp3_point_starttime"
>>
>>
>
>AFAICS the invalid page is in an index, so there is no data loss.  You
>could simply drop and re-create that index.  That might take some time,
>though :-(
>
>
>
>>Here is the explain analyse, you can see why I think that an index on
>>just host might be
>>better - hosts are a small set, starttime is a large set so the index
>>should be more
>>Efficient,
>>
>>
>
>I think you got that backwards.  If there are not many hosts, then an
>index on host is not very selective, IOW you get a lot of hits when you
>look for a particular host.  OTOH if you select a sufficiently small
>starttime interval, you get only a few rows, so using an index is most
>efficient.
>
>
>
>>at the very least it should be (starttime, host), not (host,
>>starttime) unless
>>the indexing engine is smart enough to make that not matter (I suspect
>>its not???).
>>
>>
>
>Yes, it should be (starttime, host).  And no, PG is generally not smart
>enough to use an index if there is no condition on the first index
>column.
>
>
>
>>  ->  Index Scan using tp3_host_starttime, tp3_host_starttime,
>>      [...], tp3_host_starttime on tp3
>>(cost=0.00..195.07 rows=1 width=25) (actual time=59.42..96366.43
>>rows=206238 loops=1)
>>
>>
>
>Estimated number of rows: 1
>Actual number of rows: 206238
>The planner is way off here.  Furtunately your huge number of rows makes
>it rule out every other (most probably slower) plan.
>
>How many hosts are there?  Even if there are a few hundred, an index
>scan with that condition has to access and skip millions of index
>tuples.  An index on (starttime, host) would visit less index tuples,
>and would more likely access the heap tuples in physical order.
>
>
>
>>Having followed the ongoing discusion about this I can concur that it is
>>definitely NOT O(1).  Unfortunately I didn't track the "time to vacuum"
>>
>>
>
>The discussion on -hackers and the patch I posted earlier today are
>about ANALYSE, not VACUUM.
>
>
>
>>However I believe that I'm going to follow the suggestions
>>about reducing the table size so I'll have a brand new BD to play with
>>in a couple weeks,
>>
>>
>
>Hopefully we'll see a success story here.
>
>
>
>>so knowing what I know now, I can track that if anyones
>>interested in the data besides me :)
>>
>>
>
>VACUUM and ANALYSE times?  Certainly.
>
>Servus
> Manfred
>
>---------------------------(end of broadcast)---------------------------
>TIP 3: if posting/reading through Usenet, please send an appropriate
>      subscribe-nomail command to majordomo@postgresql.org so that your
>      message can get through to the mailing list cleanly
>
>
>
(See attached file: eric.vcf)
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
      joining column's datatypes do not match




*************************************************************************
PRIVILEGED AND CONFIDENTIAL: This communication, including attachments, is for the exclusive use of addressee and may
containproprietary, confidential and/or privileged information.  If you are not the intended recipient, any use,
copying,disclosure, dissemination or distribution is strictly prohibited.  If you are not the intended recipient,
pleasenotify the sender immediately by return e-mail, delete this communication and destroy all copies. 
*************************************************************************

Attachment

eric.vcf

Re: Large DB

From

"Mooney, Ryan"

Date:

06 April 2004, 19:57:16

Actually in the case referenced I have over 1.4 billion rows.  The
206238 row example
should be no problem.  I had no (realistic based disk I/O bandwidth,
etc..)
performance issues whatsoever in the hundreds of millions of rows (even
with
sub-optimal indexes).  My average query has to reference 2-30 Million
rows, so
yes that can take up to a few minutes, but even dd'ing that amount of
data to/from
the disk array took almost as long (that's what I used as my base
benchmark, if
my query approaches that I don't see any room to complain :>)

I could only really see 206K rows being a major issue IF they were
really large rows
AND you were doing really complex queries AND the indexes were really
poorly defined,
or you were on REALLY slow hardware.  Unfortunately I didn't follow that
thread as
closely, so I don't remember exactly what the issues expressed there
were.

> -----Original Message-----
> From: Bob.Henkel@hartfordlife.com
> [mailto:Bob.Henkel@hartfordlife.com]
> Sent: Tuesday, April 06, 2004 2:17 PM
> To: Ericson Smith
> Cc: Manfred Koizar; pgsql-general@postgresql.org;
> pgsql-general-owner@postgresql.org; Mooney, Ryan
> Subject: Re: [GENERAL] Large DB
>
>
>
>
>
>
>
> I'm a fairly new Postgresql user.  And a long time Oracle
> user so keep that in mind.
>
> So you are telling me that for 206238 rows on a table the
> best route is to break the table into separate tables?  To me
> that is horrible to have to do unless you really wanted to do
> that for some reason.  And unless the count is way higher
> then I read how is 206238 a large amout of rows on a table?
>
>
> |---------+---------------------------------->
> |         |           Ericson Smith          |
> |         |           <eric@did-it.com>      |
> |         |           Sent by:               |
> |         |           pgsql-general-owner@pos|
> |         |           tgresql.org            |
> |         |                                  |
> |         |                                  |
> |         |           04/06/2004 08:27 AM    |
> |         |                                  |
> |---------+---------------------------------->
>
> >-------------------------------------------------------------
> -----------------------------------------------------------------|
>   |
>                                                                    |
>   |       To:       Manfred Koizar <mkoi-pg@aon.at>
>                                                                    |
>   |       cc:       "Mooney, Ryan" <ryan.mooney@pnl.gov>,
> pgsql-general@postgresql.org
>          |
>   |       Subject:  Re: [GENERAL] Large DB
>                                                                    |
>
> >-------------------------------------------------------------
> -----------------------------------------------------------------|
>
>
>
>
> I've been following this thread with interest since it
> started, and it really seems that there is just too much data
> in that single table. When it comes down to it, making
> smaller separate tables seems to be the way to go. Querying
> will be a little harder, but much faster.
>
> Warmest regards,
> Ericson Smith
> Tracking Specialist/DBA
> +-----------------------+---------------------------------+
> | http://www.did-it.com | "When you have to shoot, shoot, |
> | eric@did-it.com       | don't talk!             - Tuco  |
> | 516-255-0500          |                                 |
> +-----------------------+---------------------------------+
>
>
>
> Manfred Koizar wrote:
>
> >On Sat, 03 Apr 2004 22:39:31 -0800, "Mooney, Ryan"
> ><ryan.mooney@pnl.gov>
> >wrote:
> >
> >
> >>Ok, so I ran a vacuum analyse.  It took ~1.7 days to finish.
> >>
> >>
> >
> >Just to make it clear:  VACUUM and ANALYSE are two different
> commands.
> >
> >VACUUM is for cleaning up.  It has to visit every tuple in
> every page,
> >and if there are dead row versions, it also has to scan all indices
> >belonging to the table.  If there are lots of deleted rows and
> >vacuum_mem is to small, VACUUM has to scan each index
> several times to
> >delete all index entries pointing to dead rows.  This might
> raise the
> >cost to even more than O(tuplecount).
> >
> >ANALYSE collects a random sample of the rows in the table,
> the sample
> >size depends on default_statistics_target and the maximum value you
> >have set any column's statistics target to (ALTER TABLE ... ALTER
> >COLUMN ... SET STATISTICS ...).  If you didn't touch either,
> the sample
> >size is 3000 rows.  Then these 3000 rows are sorted and counted in
> >different ways to generate statistics.
> >
> >The number of pages that have to be touched to collect the sample
> >depends on the table size, but it does by far not grow
> proportionally
> >to the number of pages, nblocks.  The cost growth rate is
> greater than
> >O(ln(nblocks)) and significantly lesser than O(nblocks).  I have no
> >simple formula for it, but I estimate that analysing your tp3 table
> >would need between 28000 and 30000 page reads, which should
> be doable
> >in a few minutes.
> >
> >VACUUM ANALYSE is just VACUUM followed by ANALYSE with the added
> >benefit, that the number of rows does not have to be estimated by
> >ANALYSE, because VACUUM knows the exact value.
> >
> >
> >
> >>The invalid page block was caused when I tried the 2.6 kernel (for
> >>other reasons than DB performance), its been there for a
> while, and I
> >>can deal w/ the data loss
> >>
> >>
> >
> >
> >
> >>ERROR:  invalid page header in block 10257032 of
> "tp3_point_starttime"
> >>
> >>
> >
> >AFAICS the invalid page is in an index, so there is no data
> loss.  You
> >could simply drop and re-create that index.  That might take
> some time,
> >though :-(
> >
> >
> >
> >>Here is the explain analyse, you can see why I think that
> an index on
> >>just host might be better - hosts are a small set, starttime is a
> >>large set so the index should be more
> >>Efficient,
> >>
> >>
> >
> >I think you got that backwards.  If there are not many
> hosts, then an
> >index on host is not very selective, IOW you get a lot of
> hits when you
> >look for a particular host.  OTOH if you select a sufficiently small
> >starttime interval, you get only a few rows, so using an
> index is most
> >efficient.
> >
> >
> >
> >>at the very least it should be (starttime, host), not (host,
> >>starttime) unless
> >>the indexing engine is smart enough to make that not matter
> (I suspect
> >>its not???).
> >>
> >>
> >
> >Yes, it should be (starttime, host).  And no, PG is
> generally not smart
> >enough to use an index if there is no condition on the first index
> >column.
> >
> >
> >
> >>  ->  Index Scan using tp3_host_starttime, tp3_host_starttime,
> >>      [...], tp3_host_starttime on tp3
> >>(cost=0.00..195.07 rows=1 width=25) (actual time=59.42..96366.43
> >>rows=206238 loops=1)
> >>
> >>
> >
> >Estimated number of rows: 1
> >Actual number of rows: 206238
> >The planner is way off here.  Furtunately your huge number of rows
> >makes it rule out every other (most probably slower) plan.
> >
> >How many hosts are there?  Even if there are a few hundred, an index
> >scan with that condition has to access and skip millions of index
> >tuples.  An index on (starttime, host) would visit less
> index tuples,
> >and would more likely access the heap tuples in physical order.
> >
> >
> >
> >>Having followed the ongoing discusion about this I can
> concur that it
> >>is definitely NOT O(1).  Unfortunately I didn't track the "time to
> >>vacuum"
> >>
> >>
> >
> >The discussion on -hackers and the patch I posted earlier today are
> >about ANALYSE, not VACUUM.
> >
> >
> >
> >>However I believe that I'm going to follow the suggestions about
> >>reducing the table size so I'll have a brand new BD to play
> with in a
> >>couple weeks,
> >>
> >>
> >
> >Hopefully we'll see a success story here.
> >
> >
> >
> >>so knowing what I know now, I can track that if anyones
> interested in
> >>the data besides me :)
> >>
> >>
> >
> >VACUUM and ANALYSE times?  Certainly.
> >
> >Servus
> > Manfred
> >
> >---------------------------(end of
> >broadcast)---------------------------
> >TIP 3: if posting/reading through Usenet, please send an appropriate
> >      subscribe-nomail command to majordomo@postgresql.org
> so that your
> >      message can get through to the mailing list cleanly
> >
> >
> >
> (See attached file: eric.vcf)
> ---------------------------(end of
> broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index
> scan if your
>       joining column's datatypes do not match
>
>
>
>
> **************************************************************
> ***********
> PRIVILEGED AND CONFIDENTIAL: This communication, including
> attachments, is for the exclusive use of addressee and may
> contain proprietary, confidential and/or privileged
> information.  If you are not the intended recipient, any use,
> copying, disclosure, dissemination or distribution is
> strictly prohibited.  If you are not the intended recipient,
> please notify the sender immediately by return e-mail, delete
> this communication and destroy all copies.
> **************************************************************
> ***********
>
>

Re: Large DB

From

"Gavin M. Roy"

Date:

06 April 2004, 20:09:18

I don't think that's what's being said, I run tables with tens of
millions of rows without issue.

Gavin

Bob.Henkel@hartfordlife.com wrote:

>I'm a fairly new Postgresql user.  And a long time Oracle user so keep that
>in mind.
>
>So you are telling me that for 206238 rows on a table the best route is to
>break the table into separate tables?  To me that is horrible to have to do
>unless you really wanted to do that for some reason.  And unless the count
>is way higher then I read how is 206238 a large amout of rows on a table?
>
>
>|---------+---------------------------------->
>|         |           Ericson Smith          |
>|         |           <eric@did-it.com>      |
>|         |           Sent by:               |
>|         |           pgsql-general-owner@pos|
>|         |           tgresql.org            |
>|         |                                  |
>|         |                                  |
>|         |           04/06/2004 08:27 AM    |
>|         |                                  |
>|---------+---------------------------------->
>
>------------------------------------------------------------------------------------------------------------------------------|
>  |
         | 
>  |       To:       Manfred Koizar <mkoi-pg@aon.at>
         | 
>  |       cc:       "Mooney, Ryan" <ryan.mooney@pnl.gov>, pgsql-general@postgresql.org
         | 
>  |       Subject:  Re: [GENERAL] Large DB
         | 
>
>------------------------------------------------------------------------------------------------------------------------------|
>
>
>
>
>I've been following this thread with interest since it started, and it
>really seems that there is just too much data in that single table. When
>it comes down to it, making smaller separate tables seems to be the way
>to go. Querying will be a little harder, but much faster.
>
>Warmest regards,
>Ericson Smith
>Tracking Specialist/DBA
>+-----------------------+---------------------------------+
>| http://www.did-it.com | "When you have to shoot, shoot, |
>| eric@did-it.com       | don't talk!             - Tuco  |
>| 516-255-0500          |                                 |
>+-----------------------+---------------------------------+
>
>
>
>Manfred Koizar wrote:
>
>
>
>>On Sat, 03 Apr 2004 22:39:31 -0800, "Mooney, Ryan" <ryan.mooney@pnl.gov>
>>wrote:
>>
>>
>>
>>
>>>Ok, so I ran a vacuum analyse.  It took ~1.7 days to finish.
>>>
>>>
>>>
>>>
>>Just to make it clear:  VACUUM and ANALYSE are two different commands.
>>
>>VACUUM is for cleaning up.  It has to visit every tuple in every page,
>>and if there are dead row versions, it also has to scan all indices
>>belonging to the table.  If there are lots of deleted rows and
>>vacuum_mem is to small, VACUUM has to scan each index several times to
>>delete all index entries pointing to dead rows.  This might raise the
>>cost to even more than O(tuplecount).
>>
>>ANALYSE collects a random sample of the rows in the table, the sample
>>size depends on default_statistics_target and the maximum value you have
>>set any column's statistics target to (ALTER TABLE ... ALTER COLUMN ...
>>SET STATISTICS ...).  If you didn't touch either, the sample size is
>>3000 rows.  Then these 3000 rows are sorted and counted in different
>>ways to generate statistics.
>>
>>The number of pages that have to be touched to collect the sample
>>depends on the table size, but it does by far not grow proportionally to
>>the number of pages, nblocks.  The cost growth rate is greater than
>>O(ln(nblocks)) and significantly lesser than O(nblocks).  I have no
>>simple formula for it, but I estimate that analysing your tp3 table
>>would need between 28000 and 30000 page reads, which should be doable in
>>a few minutes.
>>
>>VACUUM ANALYSE is just VACUUM followed by ANALYSE with the added
>>benefit, that the number of rows does not have to be estimated by
>>ANALYSE, because VACUUM knows the exact value.
>>
>>
>>
>>
>>
>>>The invalid page block was caused when I tried the 2.6 kernel
>>>(for other reasons than DB performance), its been there for a while,
>>>and I can deal w/ the data loss
>>>
>>>
>>>
>>>
>>
>>
>>
>>>ERROR:  invalid page header in block 10257032 of "tp3_point_starttime"
>>>
>>>
>>>
>>>
>>AFAICS the invalid page is in an index, so there is no data loss.  You
>>could simply drop and re-create that index.  That might take some time,
>>though :-(
>>
>>
>>
>>
>>
>>>Here is the explain analyse, you can see why I think that an index on
>>>just host might be
>>>better - hosts are a small set, starttime is a large set so the index
>>>should be more
>>>Efficient,
>>>
>>>
>>>
>>>
>>I think you got that backwards.  If there are not many hosts, then an
>>index on host is not very selective, IOW you get a lot of hits when you
>>look for a particular host.  OTOH if you select a sufficiently small
>>starttime interval, you get only a few rows, so using an index is most
>>efficient.
>>
>>
>>
>>
>>
>>>at the very least it should be (starttime, host), not (host,
>>>starttime) unless
>>>the indexing engine is smart enough to make that not matter (I suspect
>>>its not???).
>>>
>>>
>>>
>>>
>>Yes, it should be (starttime, host).  And no, PG is generally not smart
>>enough to use an index if there is no condition on the first index
>>column.
>>
>>
>>
>>
>>
>>> ->  Index Scan using tp3_host_starttime, tp3_host_starttime,
>>>     [...], tp3_host_starttime on tp3
>>>(cost=0.00..195.07 rows=1 width=25) (actual time=59.42..96366.43
>>>rows=206238 loops=1)
>>>
>>>
>>>
>>>
>>Estimated number of rows: 1
>>Actual number of rows: 206238
>>The planner is way off here.  Furtunately your huge number of rows makes
>>it rule out every other (most probably slower) plan.
>>
>>How many hosts are there?  Even if there are a few hundred, an index
>>scan with that condition has to access and skip millions of index
>>tuples.  An index on (starttime, host) would visit less index tuples,
>>and would more likely access the heap tuples in physical order.
>>
>>
>>
>>
>>
>>>Having followed the ongoing discusion about this I can concur that it is
>>>definitely NOT O(1).  Unfortunately I didn't track the "time to vacuum"
>>>
>>>
>>>
>>>
>>The discussion on -hackers and the patch I posted earlier today are
>>about ANALYSE, not VACUUM.
>>
>>
>>
>>
>>
>>>However I believe that I'm going to follow the suggestions
>>>about reducing the table size so I'll have a brand new BD to play with
>>>in a couple weeks,
>>>
>>>
>>>
>>>
>>Hopefully we'll see a success story here.
>>
>>
>>
>>
>>
>>>so knowing what I know now, I can track that if anyones
>>>interested in the data besides me :)
>>>
>>>
>>>
>>>
>>VACUUM and ANALYSE times?  Certainly.
>>
>>Servus
>>Manfred
>>
>>---------------------------(end of broadcast)---------------------------
>>TIP 3: if posting/reading through Usenet, please send an appropriate
>>     subscribe-nomail command to majordomo@postgresql.org so that your
>>     message can get through to the mailing list cleanly
>>
>>
>>
>>
>>
>(See attached file: eric.vcf)
>---------------------------(end of broadcast)---------------------------
>TIP 9: the planner will ignore your desire to choose an index scan if your
>      joining column's datatypes do not match
>
>
>
>
>*************************************************************************
>PRIVILEGED AND CONFIDENTIAL: This communication, including attachments, is for the exclusive use of addressee and may
containproprietary, confidential and/or privileged information.  If you are not the intended recipient, any use,
copying,disclosure, dissemination or distribution is strictly prohibited.  If you are not the intended recipient,
pleasenotify the sender immediately by return e-mail, delete this communication and destroy all copies. 
>*************************************************************************
>
>
>
>------------------------------------------------------------------------
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>
>

Re: Large DB

From

Ericson Smith

Date:

11 April 2004, 23:48:25

My point was that, if multi-minute query response times were ok, then
you can continue. But if the row sizes are really large, or you have
lots of text data, or there are tons of updates, and you want sub-second
response times -- then I see no option but for you to beef up hardware
dramatically, or indeed --- split the data into smaller tables that you
can still logically query.

That's what we were forced to do -- and trust me, we looked at our
schemas for over a year before deciding to do that. It paid off
hansomely. It's not such a horrible alternative as you might think.

- Ericson Smith

Gavin M. Roy wrote:

> I don't think that's what's being said, I run tables with tens of
> millions of rows without issue.
>
> Gavin
>
> Bob.Henkel@hartfordlife.com wrote:
>
>> I'm a fairly new Postgresql user.  And a long time Oracle user so
>> keep that
>> in mind.
>>
>> So you are telling me that for 206238 rows on a table the best route
>> is to
>> break the table into separate tables?  To me that is horrible to have
>> to do
>> unless you really wanted to do that for some reason.  And unless the
>> count
>> is way higher then I read how is 206238 a large amout of rows on a
>> table?
>>
>>
>> |---------+---------------------------------->
>> |         |           Ericson Smith          |
>> |         |           <eric@did-it.com>      |
>> |         |           Sent by:               |
>> |         |           pgsql-general-owner@pos|
>> |         |           tgresql.org            |
>> |         |                                  |
>> |         |                                  |
>> |         |           04/06/2004 08:27 AM    |
>> |         |                                  |
>> |---------+---------------------------------->
>>
>------------------------------------------------------------------------------------------------------------------------------|

>>
>>  |
           
>> |
>>  |       To:       Manfred Koizar
>> <mkoi-pg@aon.at>
>> |
>>  |       cc:       "Mooney, Ryan" <ryan.mooney@pnl.gov>,
>> pgsql-general@postgresql.org                                           |
>>  |       Subject:  Re: [GENERAL] Large
>> DB
>> |
>>
>------------------------------------------------------------------------------------------------------------------------------|

>>
>>
>>
>>
>>
>> I've been following this thread with interest since it started, and it
>> really seems that there is just too much data in that single table. When
>> it comes down to it, making smaller separate tables seems to be the way
>> to go. Querying will be a little harder, but much faster.
>>
>> Warmest regards,
>> Ericson Smith
>> Tracking Specialist/DBA
>> +-----------------------+---------------------------------+
>> | http://www.did-it.com | "When you have to shoot, shoot, |
>> | eric@did-it.com       | don't talk!             - Tuco  |
>> | 516-255-0500          |                                 |
>> +-----------------------+---------------------------------+
>>
>>
>>
>> Manfred Koizar wrote:
>>
>>
>>
>>> On Sat, 03 Apr 2004 22:39:31 -0800, "Mooney, Ryan"
>>> <ryan.mooney@pnl.gov>
>>> wrote:
>>>
>>>
>>>
>>>
>>>> Ok, so I ran a vacuum analyse.  It took ~1.7 days to finish.
>>>>
>>>>
>>>>
>>>
>>> Just to make it clear:  VACUUM and ANALYSE are two different commands.
>>>
>>> VACUUM is for cleaning up.  It has to visit every tuple in every page,
>>> and if there are dead row versions, it also has to scan all indices
>>> belonging to the table.  If there are lots of deleted rows and
>>> vacuum_mem is to small, VACUUM has to scan each index several times to
>>> delete all index entries pointing to dead rows.  This might raise the
>>> cost to even more than O(tuplecount).
>>>
>>> ANALYSE collects a random sample of the rows in the table, the sample
>>> size depends on default_statistics_target and the maximum value you
>>> have
>>> set any column's statistics target to (ALTER TABLE ... ALTER COLUMN ...
>>> SET STATISTICS ...).  If you didn't touch either, the sample size is
>>> 3000 rows.  Then these 3000 rows are sorted and counted in different
>>> ways to generate statistics.
>>>
>>> The number of pages that have to be touched to collect the sample
>>> depends on the table size, but it does by far not grow
>>> proportionally to
>>> the number of pages, nblocks.  The cost growth rate is greater than
>>> O(ln(nblocks)) and significantly lesser than O(nblocks).  I have no
>>> simple formula for it, but I estimate that analysing your tp3 table
>>> would need between 28000 and 30000 page reads, which should be
>>> doable in
>>> a few minutes.
>>>
>>> VACUUM ANALYSE is just VACUUM followed by ANALYSE with the added
>>> benefit, that the number of rows does not have to be estimated by
>>> ANALYSE, because VACUUM knows the exact value.
>>>
>>>
>>>
>>>
>>>
>>>> The invalid page block was caused when I tried the 2.6 kernel
>>>> (for other reasons than DB performance), its been there for a while,
>>>> and I can deal w/ the data loss
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>> ERROR:  invalid page header in block 10257032 of "tp3_point_starttime"
>>>>
>>>>
>>>>
>>>
>>> AFAICS the invalid page is in an index, so there is no data loss.  You
>>> could simply drop and re-create that index.  That might take some time,
>>> though :-(
>>>
>>>
>>>
>>>
>>>
>>>> Here is the explain analyse, you can see why I think that an index on
>>>> just host might be
>>>> better - hosts are a small set, starttime is a large set so the index
>>>> should be more
>>>> Efficient,
>>>>
>>>>
>>>>
>>>
>>> I think you got that backwards.  If there are not many hosts, then an
>>> index on host is not very selective, IOW you get a lot of hits when you
>>> look for a particular host.  OTOH if you select a sufficiently small
>>> starttime interval, you get only a few rows, so using an index is most
>>> efficient.
>>>
>>>
>>>
>>>
>>>
>>>> at the very least it should be (starttime, host), not (host,
>>>> starttime) unless
>>>> the indexing engine is smart enough to make that not matter (I suspect
>>>> its not???).
>>>>
>>>>
>>>>
>>>
>>> Yes, it should be (starttime, host).  And no, PG is generally not smart
>>> enough to use an index if there is no condition on the first index
>>> column.
>>>
>>>
>>>
>>>
>>>
>>>> ->  Index Scan using tp3_host_starttime, tp3_host_starttime,
>>>>     [...], tp3_host_starttime on tp3
>>>> (cost=0.00..195.07 rows=1 width=25) (actual time=59.42..96366.43
>>>> rows=206238 loops=1)
>>>>
>>>>
>>>>
>>>
>>> Estimated number of rows: 1
>>> Actual number of rows: 206238
>>> The planner is way off here.  Furtunately your huge number of rows
>>> makes
>>> it rule out every other (most probably slower) plan.
>>>
>>> How many hosts are there?  Even if there are a few hundred, an index
>>> scan with that condition has to access and skip millions of index
>>> tuples.  An index on (starttime, host) would visit less index tuples,
>>> and would more likely access the heap tuples in physical order.
>>>
>>>
>>>
>>>
>>>
>>>> Having followed the ongoing discusion about this I can concur that
>>>> it is
>>>> definitely NOT O(1).  Unfortunately I didn't track the "time to
>>>> vacuum"
>>>>
>>>>
>>>>
>>>
>>> The discussion on -hackers and the patch I posted earlier today are
>>> about ANALYSE, not VACUUM.
>>>
>>>
>>>
>>>
>>>
>>>> However I believe that I'm going to follow the suggestions
>>>> about reducing the table size so I'll have a brand new BD to play with
>>>> in a couple weeks,
>>>>
>>>>
>>>>
>>>
>>> Hopefully we'll see a success story here.
>>>
>>>
>>>
>>>
>>>
>>>> so knowing what I know now, I can track that if anyones
>>>> interested in the data besides me :)
>>>>
>>>>
>>>>
>>>
>>> VACUUM and ANALYSE times?  Certainly.
>>>
>>> Servus
>>> Manfred
>>>
>>> ---------------------------(end of
>>> broadcast)---------------------------
>>> TIP 3: if posting/reading through Usenet, please send an appropriate
>>>     subscribe-nomail command to majordomo@postgresql.org so that your
>>>     message can get through to the mailing list cleanly
>>>
>>>
>>>
>>>
>>
>> (See attached file: eric.vcf)
>> ---------------------------(end of broadcast)---------------------------
>> TIP 9: the planner will ignore your desire to choose an index scan if
>> your
>>      joining column's datatypes do not match
>>
>>
>>
>>
>> *************************************************************************
>>
>> PRIVILEGED AND CONFIDENTIAL: This communication, including
>> attachments, is for the exclusive use of addressee and may contain
>> proprietary, confidential and/or privileged information.  If you are
>> not the intended recipient, any use, copying, disclosure,
>> dissemination or distribution is strictly prohibited.  If you are not
>> the intended recipient, please notify the sender immediately by
>> return e-mail, delete this communication and destroy all copies.
>> *************************************************************************
>>
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>>
>>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings