Thread: Large DB
Hello, I have a single table that just went over 234GB in size with about 290M+ rows. I think that I'm starting to approach some limits since things have gotten quite a bit slower over the last couple days. The table is really simple and I'm mostly doing simple data mining queries like the query included below. These have gone from taking a under a few minutes to taking tens of minutes over the last week or so (a query like the below would generally return 1-6 million rows from the base table). The queries use the indexes fairly well, although I suspect that the order of host/starttime is suboptimal (fewer hosts than starttime, and the table is naturally in starttime order). I'm going to try adding an index on just starttime (and later just host) and see if I can tune the queries on that more. I never delete rows from the table, only do inserts (up to around 11,000/minute mostly in one big burst every minute, this is anticipated to go up some over time). There are about 32 processes doing the inserts (on the same machine - yeah I know it'd be happier if they moved); I think it might help if there was only one, but for architectural reasons that won't happen for a while. This is on a dual 3Ghz xenon with 4G Ram and an IDE-SCSI raid array (ACNC) I'm running RH Fedora with kernel 2.4.22-1.2115.nptlsmp (we'd tried FBSD 4/5 early on, but the insert speeds were actually better with RH9 by a ~10% or so - this was pre fbsd 5.2, but it's a bit late to migrate easily now). I'm trying to figure out ways to squeak another ounce or two of performance out of this machine, I've included the things I've tuned so far below. The query processes are mostly stuck in D state so I expect that I'm hitting some hw limitations, but I'm only doing sub 15MB from the disk array (from iostat) and I know it can do in the 40-60MB range when we tested the raw speed, and only 400 or so tps which is also well under the arrays limits so I suspect that its thrashing a bit, this is also indicated by the contrast between rrqm/s (read requests merged per second) which is pushing 2000 and the actual r/s (read requests that were issued to the device) at around 400 or so (same as tps). I suspect that a lot of the time is spent reading indexes, so a "better" indexing scheme may be my best bet. Estimating the table size ------------------------- stats=> select relfilenode,relname from pg_class where relfilenode=37057796; relfilenode | relname -------------+--------- 37057796 | tp3 du -sc 37057796* | grep total 234002372 total However the two indexes are also - large (which may be part of the problem, which is why I'm trying just starttime for an index; They are currently in the 140-150G range). The query optimizer thinks I have ~ 290M rows (I'm not actually doing a real select count since the last time I did that was around 10M rows or so and it took a long while, I don't want to wait days :). ------------------------------ stats=> explain select count(*) from tp3; QUERY PLAN ----------------------------------------------------------------------- Aggregate (cost=7632998.20..7632998.20 rows=1 width=0) -> Seq Scan on tp3 (cost=0.00..6906493.16 rows=290602016 width=0) (2 rows) Table def ---------------------------- stats=> \d tp3 Table "public.tp3" Column | Type | Modifiers -------------+-----------------------------+----------- host | character(4) | point | character varying(64) | type | character(1) | cooked | character(1) | starttime | timestamp without time zone | intervallen | interval | arrivetime | timestamp without time zone | pvalue | numeric | Indexes: "tp3_host_starttime" btree (host, starttime, cooked) "tp3_point_starttime" btree (point, starttime, cooked) Sample data mining query: ---------------------------- select point, avg(pvalue) as avg from tp3 where host in ('node', 'node', ....) and starttime between 'timestamp' and 'timestamp' group by point Tuning done so far: ---------------------------- $ cat /etc/sysctl.conf kernel.shmall=805306368 kernel.shmmax=805306368 $ egrep -v "^#|^$" postgresql.conf shared_buffers = 60800 sort_mem = 1286720 # min 64, size in KB vacuum_mem = 102400 # min 1024, size in KB fsync=false # Play fast and loose - whee max_files_per_process = 1000 wal_buffers = 16 checkpoint_segments = 20 checkpoint_timeout = 100 effective_cache_size = 160000
The issue here might be just organizing the data differently. Or getting an Opteron server with 16GB RAM :-) Based on the strength of the developers recommendations in this newsgroup, we recently upgraded to a dual Opteron 2GHZ with 16GB Ram and 15K hard drives. We set shared_buffers to 40,000 (just about 320MB Ram), and the difference is amazing. Just having the OS handle the caching has made all the difference. You can actually see lots of blocks getting cached by the OS. (RH Linux Enterprise in our case). In most cases, tables with millions of records would get entirely cached in RAM, and there would be no disk access whatsoever for selects in a few minutes. Based on the queries you run, is it possible to split up the schema into different tables? Are the differences between timestamps in the sample query usually small? We had a similar problem, although with a slightly smaller data set -- but one that was going to keep growing. Our questions were: how could we scale? What about vacuuming our tables, running analyze in a decent time? backing up? and so on. We found that most of the queries we wanted were in the domain of a day. So we actually split up that giant table and made one for each day. We could have done it one for each week as well, but the daily tables worked well for us. Sure, its a bit more work getting data over a long time period, but those common queries were a cinch. We've also seen that in cases were we have to dump in thousands of records every few minutes that select queries respond remarkedly faster when frequent (one or two every hour) ANALYZE's are done even on those daily tables which contain just a few mil records each. Tweaking the hardware IMHO would probably take you just a little further, but you gotta think about what your response times will be in another month based on your growth now. Can your schema stand it then? - Ericson Smith Mooney, Ryan wrote: >Hello, > >I have a single table that just went over 234GB in size with about 290M+ >rows. I think that I'm starting to approach some limits since things >have gotten quite a bit slower over the last couple days. The table is >really simple and I'm mostly doing simple data mining queries like the >query included below. These have gone from taking a under a few minutes >to taking tens of minutes over the last week or so (a query like the >below would generally return 1-6 million rows from the base table). The >queries use the indexes fairly well, although I suspect that the order >of host/starttime is suboptimal (fewer hosts than starttime, and the >table is naturally in starttime order). I'm going to try adding an >index on just starttime (and later just host) and see if I can tune the >queries on that more. I never delete rows from the table, only do >inserts (up to around 11,000/minute mostly in one big burst every >minute, this is anticipated to go up some over time). There are about >32 processes doing the inserts (on the same machine - yeah I know it'd >be happier if they moved); I think it might help if there was only one, >but for architectural reasons that won't happen for a while. > >This is on a dual 3Ghz xenon with 4G Ram and an IDE-SCSI raid array >(ACNC) I'm running RH Fedora with kernel 2.4.22-1.2115.nptlsmp (we'd >tried FBSD 4/5 early on, but the insert speeds were actually better with >RH9 by a ~10% or so - this was pre fbsd 5.2, but it's a bit late to >migrate easily now). > >I'm trying to figure out ways to squeak another ounce or two of >performance out of this machine, I've included the things I've tuned so >far below. > >The query processes are mostly stuck in D state so I expect that I'm >hitting some hw limitations, but I'm only doing sub 15MB from the disk >array (from iostat) and I know it can do in the 40-60MB range when we >tested the raw speed, and only 400 or so tps which is also well under >the arrays limits so I suspect that its thrashing a bit, this is also >indicated by the contrast between rrqm/s (read requests merged per >second) which is pushing 2000 and the actual r/s (read requests that >were issued to the device) at around 400 or so (same as tps). I suspect >that a lot of the time is spent reading indexes, so a "better" indexing >scheme may be my best bet. > >Estimating the table size >------------------------- > >stats=> select relfilenode,relname from pg_class where >relfilenode=37057796; > relfilenode | relname >-------------+--------- > 37057796 | tp3 > >du -sc 37057796* | grep total >234002372 total > >However the two indexes are also - large (which may be part of the >problem, which is why I'm trying just starttime for an index; They are >currently in the 140-150G range). > >The query optimizer thinks I have ~ 290M rows (I'm not actually doing a >real select count since the last time I did that was around 10M rows or >so and it took a long while, I don't want to wait days :). >------------------------------ >stats=> explain select count(*) from tp3; > QUERY PLAN >----------------------------------------------------------------------- > Aggregate (cost=7632998.20..7632998.20 rows=1 width=0) > -> Seq Scan on tp3 (cost=0.00..6906493.16 rows=290602016 width=0) >(2 rows) > >Table def >---------------------------- >stats=> \d tp3 > Table "public.tp3" > Column | Type | Modifiers >-------------+-----------------------------+----------- > host | character(4) | > point | character varying(64) | > type | character(1) | > cooked | character(1) | > starttime | timestamp without time zone | > intervallen | interval | > arrivetime | timestamp without time zone | > pvalue | numeric | >Indexes: > "tp3_host_starttime" btree (host, starttime, cooked) > "tp3_point_starttime" btree (point, starttime, cooked) > > >Sample data mining query: >---------------------------- >select point, avg(pvalue) as avg from tp3 where host in ('node', 'node', >....) and starttime between 'timestamp' and 'timestamp' group by point > >Tuning done so far: >---------------------------- >$ cat /etc/sysctl.conf > >kernel.shmall=805306368 >kernel.shmmax=805306368 > >$ egrep -v "^#|^$" postgresql.conf > >shared_buffers = 60800 >sort_mem = 1286720 # min 64, size in KB >vacuum_mem = 102400 # min 1024, size in KB >fsync=false # Play fast and loose - whee >max_files_per_process = 1000 >wal_buffers = 16 >checkpoint_segments = 20 >checkpoint_timeout = 100 >effective_cache_size = 160000 > >---------------------------(end of broadcast)--------------------------- >TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > >
On Tue, 30 Mar 2004 17:48:14 -0800, "Mooney, Ryan" <ryan.mooney@pnl.gov> wrote: >I have a single table that just went over 234GB in size with about 290M+ >rows. That would mean ~ 800 bytes/row which, given your schema, is hard to believe unless there are lots of dead tuples lying around. >queries use the indexes fairly well, although I suspect that the order >of host/starttime is suboptimal (fewer hosts than starttime, and the >table is naturally in starttime order). I'm going to try adding an >index on just starttime (and later just host) and see if I can tune the >queries on that more. Yes, if you are ready to switch OS for a 10% performance gain, getting your indices right should be no question. > I never delete rows from the table, only do >inserts (up to around 11,000/minute mostly in one big burst every >minute, this is anticipated to go up some over time). How often do you ANALYSE? Have there been DELETEs or UPDATEs or aborted transactions in the past? Did you VACUUM or VACUUM FULL since then? > I'm only doing sub 15MB from the disk >array (from iostat) and I know it can do in the 40-60MB range when we >tested the raw speed, Sounds plausible for nonsequential I/O. >However the two indexes are also - large (which may be part of the >problem, which is why I'm trying just starttime for an index; They are >currently in the 140-150G range). This would be extreme index bloat which is only possible after massive DELETEs/UPDATEs. >stats=> explain select count(*) from tp3; > -> Seq Scan on tp3 (cost=0.00..6906493.16 rows=290602016 width=0) The planner thinks that the table size is 4M pages, 32GB. The average tuple size of ~110 bytes (including tuple header) suits your schema quite nicely. > Table "public.tp3" > Column | Type | Modifiers >-------------+-----------------------------+----------- > host | character(4) | > point | character varying(64) | > type | character(1) | > cooked | character(1) | > starttime | timestamp without time zone | > intervallen | interval | > arrivetime | timestamp without time zone | > pvalue | numeric | >Indexes: > "tp3_host_starttime" btree (host, starttime, cooked) > "tp3_point_starttime" btree (point, starttime, cooked) In my experience any reduction in average tuple size results directly in a proportional increase of throughput for large tables. So here are some random thoughts: You said there are only a few hosts. So moving the hosts into a separate table with an integer primary key would save 4 bytes per row. Datatype "char" (with quotes) needs only 1 byte, char(1) needs 5 bytes, both before padding. Changing type and cooked from char(1) to "char" would save 12 bytes. And if you want to push it, you change hostid to smallint and rearrange the fields, saving 4 more padding bytes: hostid | smallint type | "char" cooked | "char" What about point? If there is a known small number of different values, move it into its own table. I'm not sure about the storage needs of numeric, might be at least 8 bytes. Consider using bigint. Someone please correct me if I'm wrong. Did you CREATE TABLE tp3 (...) WITHOUT OIDS? >Sample data mining query: >---------------------------- >select point, avg(pvalue) as avg from tp3 where host in ('node', 'node', >....) and starttime between 'timestamp' and 'timestamp' group by point Show us EXPLAIN ANALYSE, please. >shared_buffers = 60800 Looks a bit large to me. But if your tests have shown it to be the best value, it should be ok. >sort_mem = 1286720 # min 64, size in KB This is more than 1GB, I think this is too high. >fsync=false # Play fast and loose - whee How much did this help? >effective_cache_size = 160000 Try more, say 320000 or even 400000. Servus Manfred
Oops, PostgreSQL 7.4beta2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.2.2 20030222 (Red Hat Linux 3.2.2-5) Yeah I know I need to upgrade to the release version... > -----Original Message----- > From: Sean Chittenden [mailto:sean@chittenden.org] > Sent: Tuesday, March 30, 2004 9:04 PM > To: Mooney, Ryan > Subject: Re: [GENERAL] Large DB > > > > However the two indexes are also - large (which may be part of the > > problem, which is why I'm trying just starttime for an > index; They are > > currently in the 140-150G range). > > echo 'SELECT version();' | psql template1 > > What version of PostgreSQL are you using? :) -sc > > -- > Sean Chittenden > >
Thanks for the excellent feedback (all)! Good point on the excess bytes/row, not sure how to explain that. There have never been any deletes or updates on this table and all inserts just simple inserts (no transactions or anything funky) so there shouldn't be many aborted transactions (I could see the daemons that do the inserts dying part way through a few times, but nothing to explain the variance). I haven't run ANALYZE on this table in a while. After about 50-60M rows it didn't seem to change the query plan at all and since there were never any deletes/updates it seemed like it wasn't making much/any difference (should have been no pages to reclaim). That may be an invalid assumption though. I'll try the other suggestions over the next couple of days and see how it goes. Thanks again. Here is an explain on the query: => explain select point, avg(pvalue) as avg from tp3 where host in ('m563', 'm562', 'm561', 'm560', 'm559', 'm558', 'm557', 'm538', 'm755', 'm754', 'm753', 'm752', 'm751', 'm750', 'm749', 'm748') and starttime between '2004-03-27 07:37:43' and '2004-03-30 07:40:08' group by point; HashAggregate (cost=96.90..96.90 rows=1 width=25) -> Index Scan using tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime on tp3 (cost=0.00..96.90 rows=1 width=25) Index Cond: (((host = 'm563'::bpchar) AND (starttime >= '2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <= '2004-03-30 07:40:08'::timestamp without time zone)) OR ((host = 'm562'::bpchar) AND (starttime >= '2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <= '2004-03-30 07:40:08'::timestamp without time zone)) OR ((host = 'm561'::bpchar) AND (starttime >= '2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <= '2004-03-30 07:40:08'::timestamp without time zone)) OR ((host = 'm560'::bpchar) AND (starttime >= '2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <= '2004-03-30 07:40:08'::timestamp without time zone)) OR ((host = 'm559'::bpchar) AND (starttime >= '2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <= '2004-03-30 07:40:08'::timestamp without time zone)) OR ((host = 'm558'::bpchar) AND (starttime >= '2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <= '2004-03-30 07:40:08'::timestamp without time zone)) OR ((host = 'm557'::bpchar) AND (starttime >= '2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <= '2004-03-30 07:40:08'::timestamp without time zone)) OR ((host = 'm538'::bpchar) AND (starttime >= '2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <= '2004-03-30 07:40:08'::timestamp without time zone)) OR ((host = 'm755'::bpchar) AND (starttime >= '2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <= '2004-03-30 07:40:08'::timestamp without time zone)) OR ((host = 'm754'::bpchar) AND (starttime >= '2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <= '2004-03-30 07:40:08'::timestamp without time zone)) OR ((host = 'm753'::bpchar) AND (starttime >= '2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <= '2004-03-30 07:40:08'::timestamp without time zone)) OR ((host = 'm752'::bpchar) AND (starttime >= '2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <= '2004-03-30 07:40:08'::timestamp without time zone)) OR ((host = 'm751'::bpchar) AND (starttime >= '2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <= '2004-03-30 07:40:08'::timestamp without time zone)) OR ((host = 'm750'::bpchar) AND (starttime >= '2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <= '2004-03-30 07:40:08'::timestamp without time zone)) OR ((host = 'm749'::bpchar) AND (starttime >= '2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <= '2004-03-30 07:40:08'::timestamp without time zone)) OR ((host = 'm748'::bpchar) AND (starttime >= '2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <= '2004-03-30 07:40:08'::timestamp without time zone))) Filter: (((host = 'm563'::bpchar) OR (host = 'm562'::bpchar) OR (host = 'm561'::bpchar) OR (host = 'm560'::bpchar) OR (host = 'm559'::bpchar) OR (host = 'm558'::bpchar) OR (host = 'm557'::bpchar) OR (host = 'm538'::bpchar) OR (host = 'm755'::bpchar) OR (host = 'm754'::bpchar) OR (host = 'm753'::bpchar) OR (host = 'm752'::bpchar) OR (host = 'm751'::bpchar) OR (host = 'm750'::bpchar) OR (host = 'm749'::bpchar) OR (host = 'm748'::bpchar)) AND (starttime >= '2004-03-27 07:37:43'::timestamp without time zone) AND (starttime <= '2004-03-30 07:40:08'::timestamp without time zone)) (4 rows) > -----Original Message----- > From: Manfred Koizar [mailto:mkoi-pg@aon.at] > Sent: Wednesday, March 31, 2004 1:18 AM > To: Mooney, Ryan > Cc: pgsql-general@postgresql.org > Subject: Re: [GENERAL] Large DB > > > On Tue, 30 Mar 2004 17:48:14 -0800, "Mooney, Ryan" > <ryan.mooney@pnl.gov> > wrote: > >I have a single table that just went over 234GB in size with about > >290M+ rows. > > That would mean ~ 800 bytes/row which, given your schema, is > hard to believe unless there are lots of dead tuples lying around. > > >queries use the indexes fairly well, although I suspect that > the order > >of host/starttime is suboptimal (fewer hosts than starttime, and the > >table is naturally in starttime order). I'm going to try adding an > >index on just starttime (and later just host) and see if I > can tune the > >queries on that more. > > Yes, if you are ready to switch OS for a 10% performance > gain, getting your indices right should be no question. > > > I never delete rows from the table, only do > >inserts (up to around 11,000/minute mostly in one big burst every > >minute, this is anticipated to go up some over time). > > How often do you ANALYSE? > > Have there been DELETEs or UPDATEs or aborted transactions in > the past? Did you VACUUM or VACUUM FULL since then? > > > I'm only doing sub 15MB from the disk > >array (from iostat) and I know it can do in the 40-60MB > range when we > >tested the raw speed, > > Sounds plausible for nonsequential I/O. > > >However the two indexes are also - large (which may be part of the > >problem, which is why I'm trying just starttime for an > index; They are > >currently in the 140-150G range). > > This would be extreme index bloat which is only possible > after massive DELETEs/UPDATEs. > > >stats=> explain select count(*) from tp3; > > -> Seq Scan on tp3 (cost=0.00..6906493.16 > rows=290602016 width=0) > > The planner thinks that the table size is 4M pages, 32GB. > The average tuple size of ~110 bytes (including tuple header) > suits your schema quite nicely. > > > Table "public.tp3" > > Column | Type | Modifiers > >-------------+-----------------------------+----------- > > host | character(4) | > > point | character varying(64) | > > type | character(1) | > > cooked | character(1) | > > starttime | timestamp without time zone | > > intervallen | interval | > > arrivetime | timestamp without time zone | > > pvalue | numeric | > >Indexes: > > "tp3_host_starttime" btree (host, starttime, cooked) > > "tp3_point_starttime" btree (point, starttime, cooked) > > In my experience any reduction in average tuple size results > directly in a proportional increase of throughput for large > tables. So here are some random thoughts: > > You said there are only a few hosts. So moving the hosts > into a separate table with an integer primary key would save > 4 bytes per row. > > Datatype "char" (with quotes) needs only 1 byte, char(1) > needs 5 bytes, both before padding. Changing type and cooked > from char(1) to "char" would save 12 bytes. > > And if you want to push it, you change hostid to smallint and > rearrange the fields, saving 4 more padding bytes: > hostid | smallint > type | "char" > cooked | "char" > > What about point? If there is a known small number of > different values, move it into its own table. > > I'm not sure about the storage needs of numeric, might be at > least 8 bytes. Consider using bigint. Someone please correct > me if I'm wrong. > > Did you CREATE TABLE tp3 (...) WITHOUT OIDS? > > >Sample data mining query: > >---------------------------- > >select point, avg(pvalue) as avg from tp3 where host in ('node', > >'node', > >....) and starttime between 'timestamp' and 'timestamp' > group by point > > Show us EXPLAIN ANALYSE, please. > > >shared_buffers = 60800 > > Looks a bit large to me. But if your tests have shown it to > be the best value, it should be ok. > > >sort_mem = 1286720 # min 64, size in KB > > This is more than 1GB, I think this is too high. > > >fsync=false # Play fast and loose - whee > > How much did this help? > > >effective_cache_size = 160000 > > Try more, say 320000 or even 400000. > > Servus > Manfred >
On Wed, 31 Mar 2004 10:08:26 -0800, "Mooney, Ryan" <ryan.mooney@pnl.gov> wrote: >I haven't run ANALYZE on this table in a while. After about 50-60M rows >it didn't seem to change the query plan at all and since there were >never any deletes/updates it seemed like it wasn't making much/any >difference (should have been no pages to reclaim). Reclaiming pages is not the job of ANALYSE, VACUUM does this. > That may be an >invalid assumption though. Might be a valid assumption as well -- if you're lucky. But do you want to depend on luck? Eg. 75% of the today's rows contain timestamps that are greater than what the planner believes to be the maximum. No VACCUM, no ANALYSE, no REINDEX. This explains why the planner thinks there are only 4M pages, which gives 640 bytes/row if there were 50M rows at that time. OTOH the EXPLAIN shows 290M rows for the seq scan. Something doesn't fit together here. Hackers, what could update reltuples, but not relpages? Or, Ryan, is it possible that you already had 290M rows when you ran ANALYSE and you have more than 1G rows today? BTW, ANALYSE is basically a constant time operation. >Here is an explain on the query: > >=> explain select point, avg(pvalue) as avg from tp3 where host in This tells us one half of the story. EXPLAIN ANALYSE SELECT ... would tell us the other half, too. Servus Manfred
Manfred Koizar <mkoi-pg@aon.at> writes: > Hackers, what could update reltuples, but not relpages? Nothing --- they are always updated together. One possibility is that the 4M pages and 290M rows numbers really do go together (for about 112 bytes/row) and that the table has since grown, or perhaps merely bloated due to lack of vacuuming of updated rows. A different line of thought is that they were updated together, but the relpages estimate was accurate while reltuples was not. ANALYZE knows the actual table size in pages (because it asks the OS) but reltuples is extrapolated from an average of the number of live tuples on the pages ANALYZE looks at. It is possible for ANALYZE to be fooled badly if, for instance, there are lots and lots of dead rows near the start of the table. (Lack of regular vacuuming would certainly improve the odds of this happening...) Note that VACUUM is not subject to this error because it has to grovel over every page anyway. So either "VACUUM" or "VACUUM ANALYZE" will give you a known-good reltuples, it's only standalone "ANALYZE" that has a risk of estimation error. regards, tom lane
On Thu, 01 Apr 2004 12:22:58 +0200, I wrote: >BTW, ANALYSE is basically a constant time operation. On closer inspection, this is not the whole truth. ANALY[SZ]E is a two stage process: First it collects a sample of rows, then these rows are examined to produce various statistics. The cost of the latter depends on the sample size, which itself depends on the default or column-specific statistics target, and the number (and types) of columns, so it *should* take more or less constant time. The first step, however, (acquire_sample_rows() in analyze.c) has to read more rows than finally end up in the sample. It visits less than O(nblocks) pages but certainly more than O(1). A vague feeling tries to tell me that the number of page reads is somehow related to the harmonic numbers 1 + 1/2 + 1/3 + ... + 1/n, which grow like O(ln(n)). I have an idea how this could be done with O(1) page reads. If I'm able to translate it into C, I'll send a patch ... Servus Manfred
Manfred Koizar <mkoi-pg@aon.at> writes: > The first step, however, (acquire_sample_rows() in analyze.c) has to > read more rows than finally end up in the sample. It visits less than > O(nblocks) pages but certainly more than O(1). > A vague feeling tries to tell me that the number of page reads is > somehow related to the harmonic numbers 1 + 1/2 + 1/3 + ... + 1/n, which > grow like O(ln(n)). Good guess. Vitter's paper says the expected time to sample n rows from a table of size N is O(n * (1 + log(N/n))). > I have an idea how this could be done with O(1) page reads. The hard part is getting a genuinely random sample when we don't know N in advance. We do however know the table size in blocks, so if you're willing to make assumptions about constant tuple density you could do something different. (But the tuple density assumption is exactly the weak spot of what we've got, so I'm unconvinced that would be a big step forward.) regards, tom lane
[time to move this to -hackers] On Fri, 02 Apr 2004 11:16:21 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote: >Manfred Koizar <mkoi-pg@aon.at> writes: >> The first step, however, (acquire_sample_rows() in analyze.c) has to >> read more rows than finally end up in the sample. It visits less than >> O(nblocks) pages but certainly more than O(1). > >> A vague feeling tries to tell me that the number of page reads is >> somehow related to the harmonic numbers 1 + 1/2 + 1/3 + ... + 1/n, which >> grow like O(ln(n)). > >Good guess. Vitter's paper says the expected time to sample n rows from >a table of size N is O(n * (1 + log(N/n))). Well, for what I tried to find out my wild guess seems to be wrong. I don't doubt that Vitter's formula is correct, but it assumes that access to any tuple has the same cost. This does not apply to our problem, however. With 100 tuples per page, we access the first sample_size tuples at a cost of 0.01 sequential page reads per tuple. Later we use less and less tuples per page which results in higher per-tuple-cost. Near the end of a large relation we can expect to access only one tuple per page and more and more pages are skipped, so that prefetching doesn't help any more. Playing around with some real numbers (for 100 tuples/page and a sample size of 3000) I got: rel | page size | reads ------+------------- 30 | 30 300 | 300 expectation is something like 299.9995 500 | 499 1K | 990 3K | 2.6K 30K | 8K 100K | 12K 1M | 19K 10M | 26K 100M | 33K This growth rate is steeper than O(log(nblocks)). >> I have an idea how this could be done with O(1) page reads. What I have in mind is a kind of "Double Vitter" algorithm. Whatever we do to get our sample of rows, in the end the sampled rows come from no more than sample_size different blocks. So my idea is to first create a random sample of sample_size block numbers, and then to sample the rows out of this pool of blocks. I have to think harder though, what to do about those 400 pages that are not accessed when the sample size is 3000 ... >The hard part is getting a genuinely random sample when we don't know N >in advance. We do however know the table size in blocks, so if you're >willing to make assumptions about constant tuple density you could do >something different. (But the tuple density assumption is exactly the >weak spot of what we've got, so I'm unconvinced that would be a big step >forward.) Starting the scan at some random blocks should help against the common case of unusual distribution of dead tuples near the start of the relation. And I plan to factor information about dead tuple hits into an increasingly better estimation of dead/live tuple ratio. Servus Manfred
> On Wed, 31 Mar 2004 10:08:26 -0800, "Mooney, Ryan" > <ryan.mooney@pnl.gov> > wrote: > >I haven't run ANALYZE on this table in a while. After about 50-60M > >rows it didn't seem to change the query plan at all and since there > >were never any deletes/updates it seemed like it wasn't > making much/any > >difference (should have been no pages to reclaim). > > Reclaiming pages is not the job of ANALYSE, VACUUM does this. Sorry, yes that's what I meant - lack of sleep :> > Might be a valid assumption as well -- if you're lucky. But > do you want to depend on luck? Eg. 75% of the today's rows > contain timestamps that are greater than what the planner > believes to be the maximum. Point taken. See vacuum info below... > No VACCUM, no ANALYSE, no REINDEX. This explains why the > planner thinks there are only 4M pages, which gives 640 > bytes/row if there were 50M rows at that time. OTOH the > EXPLAIN shows 290M rows for the seq scan. Something doesn't > fit together here. Ok, so I ran a vacuum analyse. It took ~1.7 days to finish. I'm running another one now to see how it does. I had turned off data updates during the vacuum below, I've turned them back on for the current run to get some idea about how its going to work ongoing, this means that the machine was basically idle during the vacuum so that's a pretty good picture of how long just the vacuum took. The invalid page block was caused when I tried the 2.6 kernel (for other reasons than DB performance), its been there for a while, and I can deal w/ the data loss so I just turned on zero_damaged_pages = true (a nice feature would be to be able to set this by table - I didn't see any way to do that), I did try doing the "select ctid from tp3 offset 10257032 limit 20;" and deleting rows with those ctid's, but even after deleting several hundred rows it was still complaining, so I gave up and moved on... INFO: vacuuming "public.tp3" INFO: index "tp3_host_starttime" now contains 1401476187 tuples in 19778677 pages DETAIL: 11657230 index tuples were removed. 108729 index pages have been deleted, 1 are currently reusable. CPU 892.96s/602.59u sec elapsed 149710.57 sec. ERROR: invalid page header in block 10257032 of "tp3_point_starttime" I had thought that there had never been any deletes/updates but my co-worker pointed out that we did do some updates on a couple million rows a couple months back, so that explains the removed index tuples. Here is the explain analyse, you can see why I think that an index on just host might be better - hosts are a small set, starttime is a large set so the index should be more Efficient, at the very least it should be (starttime, host), not (host, starttime) unless the indexing engine is smart enough to make that not matter (I suspect its not???). Db => explain analyse select point, avg(pvalue) as avg from tp3 where host in ('m480', 'm456', 'm455', 'm453', 'm452', 'm451', 'm450', 'm449', 'm368', 'm367', 'm366', 'm365', 'm364', 'm363', 'm362', 'm361', 'm783', 'm782', 'm781', 'm779', 'm778', 'm776', 'm774', 'm739', 'm738', 'm737', 'm736', 'm735', 'm732', 'm731', 'm730', 'm727') and starttime between '2004-03-30 07:28:25' and '2004-03-30 08:37:33' group by point; HashAggregate (cost=195.08..195.08 rows=1 width=25) (actual time=98667.32..98667.37 rows=12 loops=1) -> Index Scan using tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime, tp3_host_starttime on tp3 (cost=0.00..195.07 rows=1 width=25) (actual time=59.42..96366.43 rows=206238 loops=1) Index Cond: (((host = 'm480'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp without time zone)) OR ((host = 'm456'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp without time zone)) OR ((host = 'm455'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp without time zone)) OR ((host = 'm453'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp without time zone)) OR ((host = 'm452'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp without time zone)) OR ((host = 'm451'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp without time zone)) OR ((host = 'm450'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp without time zone)) OR ((host = 'm449'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp without time zone)) OR ((host = 'm368'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp without time zone)) OR ((host = 'm367'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp without time zone)) OR ((host = 'm366'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp without time zone)) OR ((host = 'm365'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp without time zone)) OR ((host = 'm364'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp without time zone)) OR ((host = 'm363'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp without time zone)) OR ((host = 'm362'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp without time zone)) OR ((host = 'm361'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp without time zone)) OR ((host = 'm783'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp without time zone)) OR ((host = 'm782'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp without time zone)) OR ((host = 'm781'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp without time zone)) OR ((host = 'm779'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp without time zone)) OR ((host = 'm778'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp without time zone)) OR ((host = 'm776'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp without time zone)) OR ((host = 'm774'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp without time zone)) OR ((host = 'm739'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp without time zone)) OR ((host = 'm738'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp without time zone)) OR ((host = 'm737'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp without time zone)) OR ((host = 'm736'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp without time zone)) OR ((host = 'm735'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp without time zone)) OR ((host = 'm732'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp without time zone)) OR ((host = 'm731'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp without time zone)) OR ((host = 'm730'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp without time zone)) OR ((host = 'm727'::bpchar) AND (starttime >= '2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp without time zone))) Filter: (((host = 'm480'::bpchar) OR (host = 'm456'::bpchar) OR (host = 'm455'::bpchar) OR (host = 'm453'::bpchar) OR (host = 'm452'::bpchar) OR (host = 'm451'::bpchar) OR (host = 'm450'::bpchar) OR (host = 'm449'::bpchar) OR (host = 'm368'::bpchar) OR (host = 'm367'::bpchar) OR (host = 'm366'::bpchar) OR (host = 'm365'::bpchar) OR (host = 'm364'::bpchar) OR (host = 'm363'::bpchar) OR (host = 'm362'::bpchar) OR (host = 'm361'::bpchar) OR (host = 'm783'::bpchar) OR (host = 'm782'::bpchar) OR (host = 'm781'::bpchar) OR (host = 'm779'::bpchar) OR (host = 'm778'::bpchar) OR (host = 'm776'::bpchar) OR (host = 'm774'::bpchar) OR (host = 'm739'::bpchar) OR (host = 'm738'::bpchar) OR (host = 'm737'::bpchar) OR (host = 'm736'::bpchar) OR (host = 'm735'::bpchar) OR (host = 'm732'::bpchar) OR (host = 'm731'::bpchar) OR (host = 'm730'::bpchar) OR (host = 'm727'::bpchar)) AND (starttime >= '2004-03-30 07:28:25'::timestamp without time zone) AND (starttime <= '2004-03-30 08:37:33'::timestamp without time zone)) Total runtime: 98760.26 msec (5 rows) > Hackers, what could update reltuples, but not relpages? > > Or, Ryan, is it possible that you already had 290M rows when > you ran ANALYSE and you have more than 1G rows today? Yes, appears to be the case... > BTW, ANALYSE is basically a constant time operation. Having followed the ongoing discusion about this I can concur that it is definitely NOT O(1). Unfortunately I didn't track the "time to vacuum" as The table grew. However I believe that I'm going to follow the suggestions about reducing the table size so I'll have a brand new BD to play with in a couple weeks, so knowing what I know now, I can track that if anyones interested in the data besides me :) > > Servus > Manfred >
On Sat, 03 Apr 2004 22:39:31 -0800, "Mooney, Ryan" <ryan.mooney@pnl.gov> wrote: >Ok, so I ran a vacuum analyse. It took ~1.7 days to finish. Just to make it clear: VACUUM and ANALYSE are two different commands. VACUUM is for cleaning up. It has to visit every tuple in every page, and if there are dead row versions, it also has to scan all indices belonging to the table. If there are lots of deleted rows and vacuum_mem is to small, VACUUM has to scan each index several times to delete all index entries pointing to dead rows. This might raise the cost to even more than O(tuplecount). ANALYSE collects a random sample of the rows in the table, the sample size depends on default_statistics_target and the maximum value you have set any column's statistics target to (ALTER TABLE ... ALTER COLUMN ... SET STATISTICS ...). If you didn't touch either, the sample size is 3000 rows. Then these 3000 rows are sorted and counted in different ways to generate statistics. The number of pages that have to be touched to collect the sample depends on the table size, but it does by far not grow proportionally to the number of pages, nblocks. The cost growth rate is greater than O(ln(nblocks)) and significantly lesser than O(nblocks). I have no simple formula for it, but I estimate that analysing your tp3 table would need between 28000 and 30000 page reads, which should be doable in a few minutes. VACUUM ANALYSE is just VACUUM followed by ANALYSE with the added benefit, that the number of rows does not have to be estimated by ANALYSE, because VACUUM knows the exact value. >The invalid page block was caused when I tried the 2.6 kernel >(for other reasons than DB performance), its been there for a while, >and I can deal w/ the data loss >ERROR: invalid page header in block 10257032 of "tp3_point_starttime" AFAICS the invalid page is in an index, so there is no data loss. You could simply drop and re-create that index. That might take some time, though :-( >Here is the explain analyse, you can see why I think that an index on >just host might be >better - hosts are a small set, starttime is a large set so the index >should be more >Efficient, I think you got that backwards. If there are not many hosts, then an index on host is not very selective, IOW you get a lot of hits when you look for a particular host. OTOH if you select a sufficiently small starttime interval, you get only a few rows, so using an index is most efficient. > at the very least it should be (starttime, host), not (host, >starttime) unless >the indexing engine is smart enough to make that not matter (I suspect >its not???). Yes, it should be (starttime, host). And no, PG is generally not smart enough to use an index if there is no condition on the first index column. > -> Index Scan using tp3_host_starttime, tp3_host_starttime, > [...], tp3_host_starttime on tp3 >(cost=0.00..195.07 rows=1 width=25) (actual time=59.42..96366.43 >rows=206238 loops=1) Estimated number of rows: 1 Actual number of rows: 206238 The planner is way off here. Furtunately your huge number of rows makes it rule out every other (most probably slower) plan. How many hosts are there? Even if there are a few hundred, an index scan with that condition has to access and skip millions of index tuples. An index on (starttime, host) would visit less index tuples, and would more likely access the heap tuples in physical order. >Having followed the ongoing discusion about this I can concur that it is >definitely NOT O(1). Unfortunately I didn't track the "time to vacuum" The discussion on -hackers and the patch I posted earlier today are about ANALYSE, not VACUUM. >However I believe that I'm going to follow the suggestions >about reducing the table size so I'll have a brand new BD to play with >in a couple weeks, Hopefully we'll see a success story here. > so knowing what I know now, I can track that if anyones >interested in the data besides me :) VACUUM and ANALYSE times? Certainly. Servus Manfred
I've been following this thread with interest since it started, and it really seems that there is just too much data in that single table. When it comes down to it, making smaller separate tables seems to be the way to go. Querying will be a little harder, but much faster. Warmest regards, Ericson Smith Tracking Specialist/DBA +-----------------------+---------------------------------+ | http://www.did-it.com | "When you have to shoot, shoot, | | eric@did-it.com | don't talk! - Tuco | | 516-255-0500 | | +-----------------------+---------------------------------+ Manfred Koizar wrote: >On Sat, 03 Apr 2004 22:39:31 -0800, "Mooney, Ryan" <ryan.mooney@pnl.gov> >wrote: > > >>Ok, so I ran a vacuum analyse. It took ~1.7 days to finish. >> >> > >Just to make it clear: VACUUM and ANALYSE are two different commands. > >VACUUM is for cleaning up. It has to visit every tuple in every page, >and if there are dead row versions, it also has to scan all indices >belonging to the table. If there are lots of deleted rows and >vacuum_mem is to small, VACUUM has to scan each index several times to >delete all index entries pointing to dead rows. This might raise the >cost to even more than O(tuplecount). > >ANALYSE collects a random sample of the rows in the table, the sample >size depends on default_statistics_target and the maximum value you have >set any column's statistics target to (ALTER TABLE ... ALTER COLUMN ... >SET STATISTICS ...). If you didn't touch either, the sample size is >3000 rows. Then these 3000 rows are sorted and counted in different >ways to generate statistics. > >The number of pages that have to be touched to collect the sample >depends on the table size, but it does by far not grow proportionally to >the number of pages, nblocks. The cost growth rate is greater than >O(ln(nblocks)) and significantly lesser than O(nblocks). I have no >simple formula for it, but I estimate that analysing your tp3 table >would need between 28000 and 30000 page reads, which should be doable in >a few minutes. > >VACUUM ANALYSE is just VACUUM followed by ANALYSE with the added >benefit, that the number of rows does not have to be estimated by >ANALYSE, because VACUUM knows the exact value. > > > >>The invalid page block was caused when I tried the 2.6 kernel >>(for other reasons than DB performance), its been there for a while, >>and I can deal w/ the data loss >> >> > > > >>ERROR: invalid page header in block 10257032 of "tp3_point_starttime" >> >> > >AFAICS the invalid page is in an index, so there is no data loss. You >could simply drop and re-create that index. That might take some time, >though :-( > > > >>Here is the explain analyse, you can see why I think that an index on >>just host might be >>better - hosts are a small set, starttime is a large set so the index >>should be more >>Efficient, >> >> > >I think you got that backwards. If there are not many hosts, then an >index on host is not very selective, IOW you get a lot of hits when you >look for a particular host. OTOH if you select a sufficiently small >starttime interval, you get only a few rows, so using an index is most >efficient. > > > >>at the very least it should be (starttime, host), not (host, >>starttime) unless >>the indexing engine is smart enough to make that not matter (I suspect >>its not???). >> >> > >Yes, it should be (starttime, host). And no, PG is generally not smart >enough to use an index if there is no condition on the first index >column. > > > >> -> Index Scan using tp3_host_starttime, tp3_host_starttime, >> [...], tp3_host_starttime on tp3 >>(cost=0.00..195.07 rows=1 width=25) (actual time=59.42..96366.43 >>rows=206238 loops=1) >> >> > >Estimated number of rows: 1 >Actual number of rows: 206238 >The planner is way off here. Furtunately your huge number of rows makes >it rule out every other (most probably slower) plan. > >How many hosts are there? Even if there are a few hundred, an index >scan with that condition has to access and skip millions of index >tuples. An index on (starttime, host) would visit less index tuples, >and would more likely access the heap tuples in physical order. > > > >>Having followed the ongoing discusion about this I can concur that it is >>definitely NOT O(1). Unfortunately I didn't track the "time to vacuum" >> >> > >The discussion on -hackers and the patch I posted earlier today are >about ANALYSE, not VACUUM. > > > >>However I believe that I'm going to follow the suggestions >>about reducing the table size so I'll have a brand new BD to play with >>in a couple weeks, >> >> > >Hopefully we'll see a success story here. > > > >>so knowing what I know now, I can track that if anyones >>interested in the data besides me :) >> >> > >VACUUM and ANALYSE times? Certainly. > >Servus > Manfred > >---------------------------(end of broadcast)--------------------------- >TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > > >
Attachment
I'm a fairly new Postgresql user. And a long time Oracle user so keep that in mind. So you are telling me that for 206238 rows on a table the best route is to break the table into separate tables? To me that is horrible to have to do unless you really wanted to do that for some reason. And unless the count is way higher then I read how is 206238 a large amout of rows on a table? |---------+----------------------------------> | | Ericson Smith | | | <eric@did-it.com> | | | Sent by: | | | pgsql-general-owner@pos| | | tgresql.org | | | | | | | | | 04/06/2004 08:27 AM | | | | |---------+----------------------------------> >------------------------------------------------------------------------------------------------------------------------------| | | | To: Manfred Koizar <mkoi-pg@aon.at> | | cc: "Mooney, Ryan" <ryan.mooney@pnl.gov>, pgsql-general@postgresql.org | | Subject: Re: [GENERAL] Large DB | >------------------------------------------------------------------------------------------------------------------------------| I've been following this thread with interest since it started, and it really seems that there is just too much data in that single table. When it comes down to it, making smaller separate tables seems to be the way to go. Querying will be a little harder, but much faster. Warmest regards, Ericson Smith Tracking Specialist/DBA +-----------------------+---------------------------------+ | http://www.did-it.com | "When you have to shoot, shoot, | | eric@did-it.com | don't talk! - Tuco | | 516-255-0500 | | +-----------------------+---------------------------------+ Manfred Koizar wrote: >On Sat, 03 Apr 2004 22:39:31 -0800, "Mooney, Ryan" <ryan.mooney@pnl.gov> >wrote: > > >>Ok, so I ran a vacuum analyse. It took ~1.7 days to finish. >> >> > >Just to make it clear: VACUUM and ANALYSE are two different commands. > >VACUUM is for cleaning up. It has to visit every tuple in every page, >and if there are dead row versions, it also has to scan all indices >belonging to the table. If there are lots of deleted rows and >vacuum_mem is to small, VACUUM has to scan each index several times to >delete all index entries pointing to dead rows. This might raise the >cost to even more than O(tuplecount). > >ANALYSE collects a random sample of the rows in the table, the sample >size depends on default_statistics_target and the maximum value you have >set any column's statistics target to (ALTER TABLE ... ALTER COLUMN ... >SET STATISTICS ...). If you didn't touch either, the sample size is >3000 rows. Then these 3000 rows are sorted and counted in different >ways to generate statistics. > >The number of pages that have to be touched to collect the sample >depends on the table size, but it does by far not grow proportionally to >the number of pages, nblocks. The cost growth rate is greater than >O(ln(nblocks)) and significantly lesser than O(nblocks). I have no >simple formula for it, but I estimate that analysing your tp3 table >would need between 28000 and 30000 page reads, which should be doable in >a few minutes. > >VACUUM ANALYSE is just VACUUM followed by ANALYSE with the added >benefit, that the number of rows does not have to be estimated by >ANALYSE, because VACUUM knows the exact value. > > > >>The invalid page block was caused when I tried the 2.6 kernel >>(for other reasons than DB performance), its been there for a while, >>and I can deal w/ the data loss >> >> > > > >>ERROR: invalid page header in block 10257032 of "tp3_point_starttime" >> >> > >AFAICS the invalid page is in an index, so there is no data loss. You >could simply drop and re-create that index. That might take some time, >though :-( > > > >>Here is the explain analyse, you can see why I think that an index on >>just host might be >>better - hosts are a small set, starttime is a large set so the index >>should be more >>Efficient, >> >> > >I think you got that backwards. If there are not many hosts, then an >index on host is not very selective, IOW you get a lot of hits when you >look for a particular host. OTOH if you select a sufficiently small >starttime interval, you get only a few rows, so using an index is most >efficient. > > > >>at the very least it should be (starttime, host), not (host, >>starttime) unless >>the indexing engine is smart enough to make that not matter (I suspect >>its not???). >> >> > >Yes, it should be (starttime, host). And no, PG is generally not smart >enough to use an index if there is no condition on the first index >column. > > > >> -> Index Scan using tp3_host_starttime, tp3_host_starttime, >> [...], tp3_host_starttime on tp3 >>(cost=0.00..195.07 rows=1 width=25) (actual time=59.42..96366.43 >>rows=206238 loops=1) >> >> > >Estimated number of rows: 1 >Actual number of rows: 206238 >The planner is way off here. Furtunately your huge number of rows makes >it rule out every other (most probably slower) plan. > >How many hosts are there? Even if there are a few hundred, an index >scan with that condition has to access and skip millions of index >tuples. An index on (starttime, host) would visit less index tuples, >and would more likely access the heap tuples in physical order. > > > >>Having followed the ongoing discusion about this I can concur that it is >>definitely NOT O(1). Unfortunately I didn't track the "time to vacuum" >> >> > >The discussion on -hackers and the patch I posted earlier today are >about ANALYSE, not VACUUM. > > > >>However I believe that I'm going to follow the suggestions >>about reducing the table size so I'll have a brand new BD to play with >>in a couple weeks, >> >> > >Hopefully we'll see a success story here. > > > >>so knowing what I know now, I can track that if anyones >>interested in the data besides me :) >> >> > >VACUUM and ANALYSE times? Certainly. > >Servus > Manfred > >---------------------------(end of broadcast)--------------------------- >TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > > > (See attached file: eric.vcf) ---------------------------(end of broadcast)--------------------------- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match ************************************************************************* PRIVILEGED AND CONFIDENTIAL: This communication, including attachments, is for the exclusive use of addressee and may containproprietary, confidential and/or privileged information. If you are not the intended recipient, any use, copying,disclosure, dissemination or distribution is strictly prohibited. If you are not the intended recipient, pleasenotify the sender immediately by return e-mail, delete this communication and destroy all copies. *************************************************************************
Attachment
Actually in the case referenced I have over 1.4 billion rows. The 206238 row example should be no problem. I had no (realistic based disk I/O bandwidth, etc..) performance issues whatsoever in the hundreds of millions of rows (even with sub-optimal indexes). My average query has to reference 2-30 Million rows, so yes that can take up to a few minutes, but even dd'ing that amount of data to/from the disk array took almost as long (that's what I used as my base benchmark, if my query approaches that I don't see any room to complain :>) I could only really see 206K rows being a major issue IF they were really large rows AND you were doing really complex queries AND the indexes were really poorly defined, or you were on REALLY slow hardware. Unfortunately I didn't follow that thread as closely, so I don't remember exactly what the issues expressed there were. > -----Original Message----- > From: Bob.Henkel@hartfordlife.com > [mailto:Bob.Henkel@hartfordlife.com] > Sent: Tuesday, April 06, 2004 2:17 PM > To: Ericson Smith > Cc: Manfred Koizar; pgsql-general@postgresql.org; > pgsql-general-owner@postgresql.org; Mooney, Ryan > Subject: Re: [GENERAL] Large DB > > > > > > > > I'm a fairly new Postgresql user. And a long time Oracle > user so keep that in mind. > > So you are telling me that for 206238 rows on a table the > best route is to break the table into separate tables? To me > that is horrible to have to do unless you really wanted to do > that for some reason. And unless the count is way higher > then I read how is 206238 a large amout of rows on a table? > > > |---------+----------------------------------> > | | Ericson Smith | > | | <eric@did-it.com> | > | | Sent by: | > | | pgsql-general-owner@pos| > | | tgresql.org | > | | | > | | | > | | 04/06/2004 08:27 AM | > | | | > |---------+----------------------------------> > > >------------------------------------------------------------- > -----------------------------------------------------------------| > | > | > | To: Manfred Koizar <mkoi-pg@aon.at> > | > | cc: "Mooney, Ryan" <ryan.mooney@pnl.gov>, > pgsql-general@postgresql.org > | > | Subject: Re: [GENERAL] Large DB > | > > >------------------------------------------------------------- > -----------------------------------------------------------------| > > > > > I've been following this thread with interest since it > started, and it really seems that there is just too much data > in that single table. When it comes down to it, making > smaller separate tables seems to be the way to go. Querying > will be a little harder, but much faster. > > Warmest regards, > Ericson Smith > Tracking Specialist/DBA > +-----------------------+---------------------------------+ > | http://www.did-it.com | "When you have to shoot, shoot, | > | eric@did-it.com | don't talk! - Tuco | > | 516-255-0500 | | > +-----------------------+---------------------------------+ > > > > Manfred Koizar wrote: > > >On Sat, 03 Apr 2004 22:39:31 -0800, "Mooney, Ryan" > ><ryan.mooney@pnl.gov> > >wrote: > > > > > >>Ok, so I ran a vacuum analyse. It took ~1.7 days to finish. > >> > >> > > > >Just to make it clear: VACUUM and ANALYSE are two different > commands. > > > >VACUUM is for cleaning up. It has to visit every tuple in > every page, > >and if there are dead row versions, it also has to scan all indices > >belonging to the table. If there are lots of deleted rows and > >vacuum_mem is to small, VACUUM has to scan each index > several times to > >delete all index entries pointing to dead rows. This might > raise the > >cost to even more than O(tuplecount). > > > >ANALYSE collects a random sample of the rows in the table, > the sample > >size depends on default_statistics_target and the maximum value you > >have set any column's statistics target to (ALTER TABLE ... ALTER > >COLUMN ... SET STATISTICS ...). If you didn't touch either, > the sample > >size is 3000 rows. Then these 3000 rows are sorted and counted in > >different ways to generate statistics. > > > >The number of pages that have to be touched to collect the sample > >depends on the table size, but it does by far not grow > proportionally > >to the number of pages, nblocks. The cost growth rate is > greater than > >O(ln(nblocks)) and significantly lesser than O(nblocks). I have no > >simple formula for it, but I estimate that analysing your tp3 table > >would need between 28000 and 30000 page reads, which should > be doable > >in a few minutes. > > > >VACUUM ANALYSE is just VACUUM followed by ANALYSE with the added > >benefit, that the number of rows does not have to be estimated by > >ANALYSE, because VACUUM knows the exact value. > > > > > > > >>The invalid page block was caused when I tried the 2.6 kernel (for > >>other reasons than DB performance), its been there for a > while, and I > >>can deal w/ the data loss > >> > >> > > > > > > > >>ERROR: invalid page header in block 10257032 of > "tp3_point_starttime" > >> > >> > > > >AFAICS the invalid page is in an index, so there is no data > loss. You > >could simply drop and re-create that index. That might take > some time, > >though :-( > > > > > > > >>Here is the explain analyse, you can see why I think that > an index on > >>just host might be better - hosts are a small set, starttime is a > >>large set so the index should be more > >>Efficient, > >> > >> > > > >I think you got that backwards. If there are not many > hosts, then an > >index on host is not very selective, IOW you get a lot of > hits when you > >look for a particular host. OTOH if you select a sufficiently small > >starttime interval, you get only a few rows, so using an > index is most > >efficient. > > > > > > > >>at the very least it should be (starttime, host), not (host, > >>starttime) unless > >>the indexing engine is smart enough to make that not matter > (I suspect > >>its not???). > >> > >> > > > >Yes, it should be (starttime, host). And no, PG is > generally not smart > >enough to use an index if there is no condition on the first index > >column. > > > > > > > >> -> Index Scan using tp3_host_starttime, tp3_host_starttime, > >> [...], tp3_host_starttime on tp3 > >>(cost=0.00..195.07 rows=1 width=25) (actual time=59.42..96366.43 > >>rows=206238 loops=1) > >> > >> > > > >Estimated number of rows: 1 > >Actual number of rows: 206238 > >The planner is way off here. Furtunately your huge number of rows > >makes it rule out every other (most probably slower) plan. > > > >How many hosts are there? Even if there are a few hundred, an index > >scan with that condition has to access and skip millions of index > >tuples. An index on (starttime, host) would visit less > index tuples, > >and would more likely access the heap tuples in physical order. > > > > > > > >>Having followed the ongoing discusion about this I can > concur that it > >>is definitely NOT O(1). Unfortunately I didn't track the "time to > >>vacuum" > >> > >> > > > >The discussion on -hackers and the patch I posted earlier today are > >about ANALYSE, not VACUUM. > > > > > > > >>However I believe that I'm going to follow the suggestions about > >>reducing the table size so I'll have a brand new BD to play > with in a > >>couple weeks, > >> > >> > > > >Hopefully we'll see a success story here. > > > > > > > >>so knowing what I know now, I can track that if anyones > interested in > >>the data besides me :) > >> > >> > > > >VACUUM and ANALYSE times? Certainly. > > > >Servus > > Manfred > > > >---------------------------(end of > >broadcast)--------------------------- > >TIP 3: if posting/reading through Usenet, please send an appropriate > > subscribe-nomail command to majordomo@postgresql.org > so that your > > message can get through to the mailing list cleanly > > > > > > > (See attached file: eric.vcf) > ---------------------------(end of > broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index > scan if your > joining column's datatypes do not match > > > > > ************************************************************** > *********** > PRIVILEGED AND CONFIDENTIAL: This communication, including > attachments, is for the exclusive use of addressee and may > contain proprietary, confidential and/or privileged > information. If you are not the intended recipient, any use, > copying, disclosure, dissemination or distribution is > strictly prohibited. If you are not the intended recipient, > please notify the sender immediately by return e-mail, delete > this communication and destroy all copies. > ************************************************************** > *********** > >
I don't think that's what's being said, I run tables with tens of millions of rows without issue. Gavin Bob.Henkel@hartfordlife.com wrote: >I'm a fairly new Postgresql user. And a long time Oracle user so keep that >in mind. > >So you are telling me that for 206238 rows on a table the best route is to >break the table into separate tables? To me that is horrible to have to do >unless you really wanted to do that for some reason. And unless the count >is way higher then I read how is 206238 a large amout of rows on a table? > > >|---------+----------------------------------> >| | Ericson Smith | >| | <eric@did-it.com> | >| | Sent by: | >| | pgsql-general-owner@pos| >| | tgresql.org | >| | | >| | | >| | 04/06/2004 08:27 AM | >| | | >|---------+----------------------------------> > >------------------------------------------------------------------------------------------------------------------------------| > | | > | To: Manfred Koizar <mkoi-pg@aon.at> | > | cc: "Mooney, Ryan" <ryan.mooney@pnl.gov>, pgsql-general@postgresql.org | > | Subject: Re: [GENERAL] Large DB | > >------------------------------------------------------------------------------------------------------------------------------| > > > > >I've been following this thread with interest since it started, and it >really seems that there is just too much data in that single table. When >it comes down to it, making smaller separate tables seems to be the way >to go. Querying will be a little harder, but much faster. > >Warmest regards, >Ericson Smith >Tracking Specialist/DBA >+-----------------------+---------------------------------+ >| http://www.did-it.com | "When you have to shoot, shoot, | >| eric@did-it.com | don't talk! - Tuco | >| 516-255-0500 | | >+-----------------------+---------------------------------+ > > > >Manfred Koizar wrote: > > > >>On Sat, 03 Apr 2004 22:39:31 -0800, "Mooney, Ryan" <ryan.mooney@pnl.gov> >>wrote: >> >> >> >> >>>Ok, so I ran a vacuum analyse. It took ~1.7 days to finish. >>> >>> >>> >>> >>Just to make it clear: VACUUM and ANALYSE are two different commands. >> >>VACUUM is for cleaning up. It has to visit every tuple in every page, >>and if there are dead row versions, it also has to scan all indices >>belonging to the table. If there are lots of deleted rows and >>vacuum_mem is to small, VACUUM has to scan each index several times to >>delete all index entries pointing to dead rows. This might raise the >>cost to even more than O(tuplecount). >> >>ANALYSE collects a random sample of the rows in the table, the sample >>size depends on default_statistics_target and the maximum value you have >>set any column's statistics target to (ALTER TABLE ... ALTER COLUMN ... >>SET STATISTICS ...). If you didn't touch either, the sample size is >>3000 rows. Then these 3000 rows are sorted and counted in different >>ways to generate statistics. >> >>The number of pages that have to be touched to collect the sample >>depends on the table size, but it does by far not grow proportionally to >>the number of pages, nblocks. The cost growth rate is greater than >>O(ln(nblocks)) and significantly lesser than O(nblocks). I have no >>simple formula for it, but I estimate that analysing your tp3 table >>would need between 28000 and 30000 page reads, which should be doable in >>a few minutes. >> >>VACUUM ANALYSE is just VACUUM followed by ANALYSE with the added >>benefit, that the number of rows does not have to be estimated by >>ANALYSE, because VACUUM knows the exact value. >> >> >> >> >> >>>The invalid page block was caused when I tried the 2.6 kernel >>>(for other reasons than DB performance), its been there for a while, >>>and I can deal w/ the data loss >>> >>> >>> >>> >> >> >> >>>ERROR: invalid page header in block 10257032 of "tp3_point_starttime" >>> >>> >>> >>> >>AFAICS the invalid page is in an index, so there is no data loss. You >>could simply drop and re-create that index. That might take some time, >>though :-( >> >> >> >> >> >>>Here is the explain analyse, you can see why I think that an index on >>>just host might be >>>better - hosts are a small set, starttime is a large set so the index >>>should be more >>>Efficient, >>> >>> >>> >>> >>I think you got that backwards. If there are not many hosts, then an >>index on host is not very selective, IOW you get a lot of hits when you >>look for a particular host. OTOH if you select a sufficiently small >>starttime interval, you get only a few rows, so using an index is most >>efficient. >> >> >> >> >> >>>at the very least it should be (starttime, host), not (host, >>>starttime) unless >>>the indexing engine is smart enough to make that not matter (I suspect >>>its not???). >>> >>> >>> >>> >>Yes, it should be (starttime, host). And no, PG is generally not smart >>enough to use an index if there is no condition on the first index >>column. >> >> >> >> >> >>> -> Index Scan using tp3_host_starttime, tp3_host_starttime, >>> [...], tp3_host_starttime on tp3 >>>(cost=0.00..195.07 rows=1 width=25) (actual time=59.42..96366.43 >>>rows=206238 loops=1) >>> >>> >>> >>> >>Estimated number of rows: 1 >>Actual number of rows: 206238 >>The planner is way off here. Furtunately your huge number of rows makes >>it rule out every other (most probably slower) plan. >> >>How many hosts are there? Even if there are a few hundred, an index >>scan with that condition has to access and skip millions of index >>tuples. An index on (starttime, host) would visit less index tuples, >>and would more likely access the heap tuples in physical order. >> >> >> >> >> >>>Having followed the ongoing discusion about this I can concur that it is >>>definitely NOT O(1). Unfortunately I didn't track the "time to vacuum" >>> >>> >>> >>> >>The discussion on -hackers and the patch I posted earlier today are >>about ANALYSE, not VACUUM. >> >> >> >> >> >>>However I believe that I'm going to follow the suggestions >>>about reducing the table size so I'll have a brand new BD to play with >>>in a couple weeks, >>> >>> >>> >>> >>Hopefully we'll see a success story here. >> >> >> >> >> >>>so knowing what I know now, I can track that if anyones >>>interested in the data besides me :) >>> >>> >>> >>> >>VACUUM and ANALYSE times? Certainly. >> >>Servus >>Manfred >> >>---------------------------(end of broadcast)--------------------------- >>TIP 3: if posting/reading through Usenet, please send an appropriate >> subscribe-nomail command to majordomo@postgresql.org so that your >> message can get through to the mailing list cleanly >> >> >> >> >> >(See attached file: eric.vcf) >---------------------------(end of broadcast)--------------------------- >TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match > > > > >************************************************************************* >PRIVILEGED AND CONFIDENTIAL: This communication, including attachments, is for the exclusive use of addressee and may containproprietary, confidential and/or privileged information. If you are not the intended recipient, any use, copying,disclosure, dissemination or distribution is strictly prohibited. If you are not the intended recipient, pleasenotify the sender immediately by return e-mail, delete this communication and destroy all copies. >************************************************************************* > > > >------------------------------------------------------------------------ > > >---------------------------(end of broadcast)--------------------------- >TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > >
My point was that, if multi-minute query response times were ok, then you can continue. But if the row sizes are really large, or you have lots of text data, or there are tons of updates, and you want sub-second response times -- then I see no option but for you to beef up hardware dramatically, or indeed --- split the data into smaller tables that you can still logically query. That's what we were forced to do -- and trust me, we looked at our schemas for over a year before deciding to do that. It paid off hansomely. It's not such a horrible alternative as you might think. - Ericson Smith Gavin M. Roy wrote: > I don't think that's what's being said, I run tables with tens of > millions of rows without issue. > > Gavin > > Bob.Henkel@hartfordlife.com wrote: > >> I'm a fairly new Postgresql user. And a long time Oracle user so >> keep that >> in mind. >> >> So you are telling me that for 206238 rows on a table the best route >> is to >> break the table into separate tables? To me that is horrible to have >> to do >> unless you really wanted to do that for some reason. And unless the >> count >> is way higher then I read how is 206238 a large amout of rows on a >> table? >> >> >> |---------+----------------------------------> >> | | Ericson Smith | >> | | <eric@did-it.com> | >> | | Sent by: | >> | | pgsql-general-owner@pos| >> | | tgresql.org | >> | | | >> | | | >> | | 04/06/2004 08:27 AM | >> | | | >> |---------+----------------------------------> >> >------------------------------------------------------------------------------------------------------------------------------| >> >> | >> | >> | To: Manfred Koizar >> <mkoi-pg@aon.at> >> | >> | cc: "Mooney, Ryan" <ryan.mooney@pnl.gov>, >> pgsql-general@postgresql.org | >> | Subject: Re: [GENERAL] Large >> DB >> | >> >------------------------------------------------------------------------------------------------------------------------------| >> >> >> >> >> >> I've been following this thread with interest since it started, and it >> really seems that there is just too much data in that single table. When >> it comes down to it, making smaller separate tables seems to be the way >> to go. Querying will be a little harder, but much faster. >> >> Warmest regards, >> Ericson Smith >> Tracking Specialist/DBA >> +-----------------------+---------------------------------+ >> | http://www.did-it.com | "When you have to shoot, shoot, | >> | eric@did-it.com | don't talk! - Tuco | >> | 516-255-0500 | | >> +-----------------------+---------------------------------+ >> >> >> >> Manfred Koizar wrote: >> >> >> >>> On Sat, 03 Apr 2004 22:39:31 -0800, "Mooney, Ryan" >>> <ryan.mooney@pnl.gov> >>> wrote: >>> >>> >>> >>> >>>> Ok, so I ran a vacuum analyse. It took ~1.7 days to finish. >>>> >>>> >>>> >>> >>> Just to make it clear: VACUUM and ANALYSE are two different commands. >>> >>> VACUUM is for cleaning up. It has to visit every tuple in every page, >>> and if there are dead row versions, it also has to scan all indices >>> belonging to the table. If there are lots of deleted rows and >>> vacuum_mem is to small, VACUUM has to scan each index several times to >>> delete all index entries pointing to dead rows. This might raise the >>> cost to even more than O(tuplecount). >>> >>> ANALYSE collects a random sample of the rows in the table, the sample >>> size depends on default_statistics_target and the maximum value you >>> have >>> set any column's statistics target to (ALTER TABLE ... ALTER COLUMN ... >>> SET STATISTICS ...). If you didn't touch either, the sample size is >>> 3000 rows. Then these 3000 rows are sorted and counted in different >>> ways to generate statistics. >>> >>> The number of pages that have to be touched to collect the sample >>> depends on the table size, but it does by far not grow >>> proportionally to >>> the number of pages, nblocks. The cost growth rate is greater than >>> O(ln(nblocks)) and significantly lesser than O(nblocks). I have no >>> simple formula for it, but I estimate that analysing your tp3 table >>> would need between 28000 and 30000 page reads, which should be >>> doable in >>> a few minutes. >>> >>> VACUUM ANALYSE is just VACUUM followed by ANALYSE with the added >>> benefit, that the number of rows does not have to be estimated by >>> ANALYSE, because VACUUM knows the exact value. >>> >>> >>> >>> >>> >>>> The invalid page block was caused when I tried the 2.6 kernel >>>> (for other reasons than DB performance), its been there for a while, >>>> and I can deal w/ the data loss >>>> >>>> >>>> >>> >>> >>> >>> >>>> ERROR: invalid page header in block 10257032 of "tp3_point_starttime" >>>> >>>> >>>> >>> >>> AFAICS the invalid page is in an index, so there is no data loss. You >>> could simply drop and re-create that index. That might take some time, >>> though :-( >>> >>> >>> >>> >>> >>>> Here is the explain analyse, you can see why I think that an index on >>>> just host might be >>>> better - hosts are a small set, starttime is a large set so the index >>>> should be more >>>> Efficient, >>>> >>>> >>>> >>> >>> I think you got that backwards. If there are not many hosts, then an >>> index on host is not very selective, IOW you get a lot of hits when you >>> look for a particular host. OTOH if you select a sufficiently small >>> starttime interval, you get only a few rows, so using an index is most >>> efficient. >>> >>> >>> >>> >>> >>>> at the very least it should be (starttime, host), not (host, >>>> starttime) unless >>>> the indexing engine is smart enough to make that not matter (I suspect >>>> its not???). >>>> >>>> >>>> >>> >>> Yes, it should be (starttime, host). And no, PG is generally not smart >>> enough to use an index if there is no condition on the first index >>> column. >>> >>> >>> >>> >>> >>>> -> Index Scan using tp3_host_starttime, tp3_host_starttime, >>>> [...], tp3_host_starttime on tp3 >>>> (cost=0.00..195.07 rows=1 width=25) (actual time=59.42..96366.43 >>>> rows=206238 loops=1) >>>> >>>> >>>> >>> >>> Estimated number of rows: 1 >>> Actual number of rows: 206238 >>> The planner is way off here. Furtunately your huge number of rows >>> makes >>> it rule out every other (most probably slower) plan. >>> >>> How many hosts are there? Even if there are a few hundred, an index >>> scan with that condition has to access and skip millions of index >>> tuples. An index on (starttime, host) would visit less index tuples, >>> and would more likely access the heap tuples in physical order. >>> >>> >>> >>> >>> >>>> Having followed the ongoing discusion about this I can concur that >>>> it is >>>> definitely NOT O(1). Unfortunately I didn't track the "time to >>>> vacuum" >>>> >>>> >>>> >>> >>> The discussion on -hackers and the patch I posted earlier today are >>> about ANALYSE, not VACUUM. >>> >>> >>> >>> >>> >>>> However I believe that I'm going to follow the suggestions >>>> about reducing the table size so I'll have a brand new BD to play with >>>> in a couple weeks, >>>> >>>> >>>> >>> >>> Hopefully we'll see a success story here. >>> >>> >>> >>> >>> >>>> so knowing what I know now, I can track that if anyones >>>> interested in the data besides me :) >>>> >>>> >>>> >>> >>> VACUUM and ANALYSE times? Certainly. >>> >>> Servus >>> Manfred >>> >>> ---------------------------(end of >>> broadcast)--------------------------- >>> TIP 3: if posting/reading through Usenet, please send an appropriate >>> subscribe-nomail command to majordomo@postgresql.org so that your >>> message can get through to the mailing list cleanly >>> >>> >>> >>> >> >> (See attached file: eric.vcf) >> ---------------------------(end of broadcast)--------------------------- >> TIP 9: the planner will ignore your desire to choose an index scan if >> your >> joining column's datatypes do not match >> >> >> >> >> ************************************************************************* >> >> PRIVILEGED AND CONFIDENTIAL: This communication, including >> attachments, is for the exclusive use of addressee and may contain >> proprietary, confidential and/or privileged information. If you are >> not the intended recipient, any use, copying, disclosure, >> dissemination or distribution is strictly prohibited. If you are not >> the intended recipient, please notify the sender immediately by >> return e-mail, delete this communication and destroy all copies. >> ************************************************************************* >> >> >> >> >> ------------------------------------------------------------------------ >> >> >> ---------------------------(end of broadcast)--------------------------- >> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org >> >> > > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings