Thread: Optimizer internals
I'm not a programmer so understanding the optimizer code is WAY beyond my limits.
My question, that I haven't seen answered elsewhere, is WHAT things can affect the choice of an index scan over a sequence scan. I understand that sometimes a sequence scan is faster and that you still have to get the data from the disk but my question relates to an issue we had pop up today.
We have 2 tables, which we'll refer to as laaf and laaf_new. The first table has 220M rows and the second table has 4M rows. What were basically doing is aggregating the records from the first table into the second one at which point we're going to drop the first one. This is the same table I mentioned previously in my post about pg_dump.
laaf_new has one less column than laaf and both were freshly vacuum analyzed after having an index added on a single column (other than the primary key).
The query we were doing was as follows:
select main_account_status_dim_id, count(*)
from cla_dw.loan_account_agg_fact_new
group by main_account_status_dim_id
order by main_account_status_dim_id;
One of our problems is that we don't have any PGSQL dbas here. All of our guys are DB2 (we're still looking though).
Now I've been told by our DBA that we should have been able to wholy satisfy that query via the indexes.
We did regular EXPLAINS on the query with seqscan enabled and disabled and even in our own tests actually running the queries, the results WERE faster with a seq scan than an index scan but the question we were discussing is WHY did it choose the index scan and why is the index scan slower than the sequence scan? He's telling me that DB2 would have been able to do the whole thing with indexes.
EXPLAINS:
(the reason for the random_page_cost was that we had the default of 4 in the .conf file and were planning on changing it to 2 anyway to match our other server)
set random_page_cost=2;
set enable_seqscan=on;
explain select main_account_status_dim_id, count(*)
from cla_dw.loan_account_agg_fact
group by main_account_status_dim_id
order by main_account_status_dim_id;
"Sort (cost=8774054.54..8774054.66 rows=48 width=4)"
" Sort Key: main_account_status_dim_id"
" -> HashAggregate (cost=8774052.60..8774053.20 rows=48 width=4)"
" -> Seq Scan on loan_account_agg_fact (cost=0.00..7609745.40 rows=232861440 width=4)"
set random_page_cost=2;
set enable_seqscan=off;
explain select main_account_status_dim_id, count(*)
from cla_dw.loan_account_agg_fact
group by main_account_status_dim_id
order by main_account_status_dim_id;
"Sort (cost=108774054.54..108774054.66 rows=48 width=4)"
" Sort Key: main_account_status_dim_id"
" -> HashAggregate (cost=108774052.60..108774053.20 rows=48 width=4)"
" -> Seq Scan on loan_account_agg_fact (cost=100000000.00..107609745.40 rows=232861440 width=4)"
Here's the DDL for the table laaf:
When the system is not busy again, I'll run a verbose version. The query was run against each of the tables to compare the results of aggregation change with the new table.
CREATE TABLE cla_dw.loan_account_agg_fact
(
loan_account_agg_fact_id int8 NOT NULL DEFAULT nextval('loan_account_agg_fact_loan_account_agg_fact_id_seq'::regclass),
dw_load_date_id int4 NOT NULL DEFAULT 0,
servicer_branch_dim_id int4 NOT NULL DEFAULT 0,
main_account_status_dim_id int4 NOT NULL DEFAULT 0,
product_dim_id int4 NOT NULL DEFAULT 0,
next_due_date_id int4 NOT NULL DEFAULT 0,
account_balance numeric(15,6) NOT NULL DEFAULT 0,
loan_count int4 NOT NULL DEFAULT 0,
principal numeric(15,6) NOT NULL DEFAULT 0,
interest numeric(15,6) NOT NULL DEFAULT 0,
fees numeric(15,6) NOT NULL DEFAULT 0,
gl_principal numeric(15,6) NOT NULL DEFAULT 0,
gl_interest numeric(15,6) NOT NULL DEFAULT 0,
accruable_principal numeric(15,6) NOT NULL DEFAULT 0,
unaccruable_principal numeric(15,6) NOT NULL DEFAULT 0,
calculated_principal numeric(15,6) DEFAULT 0,
current_interest numeric(15,6) NOT NULL DEFAULT 0,
past_due_interest numeric(16,5) NOT NULL DEFAULT 0,
cash_available numeric(15,6) DEFAULT 0,
cash_collected numeric(15,6) DEFAULT 0,
cash_collected_date_id int4 DEFAULT 0,
dw_agg_load_dt timestamp(0) DEFAULT ('now'::text)::timestamp(6) with time zone,
cash_available_principal numeric(15,6) DEFAULT 0,
cash_available_current numeric(15,6) DEFAULT 0,
cash_available_last numeric(15,6) DEFAULT 0,
cash_available_interest numeric(15,6) DEFAULT 0,
cash_available_fees numeric(15,6) DEFAULT 0,
cash_not_collected numeric(15,6) DEFAULT 0,
number_contacts_total int4 DEFAULT 0,
number_broken_commitments int4 DEFAULT 0,
loc_current_due_total numeric(15,6) DEFAULT 0,
loc_current_due_principal numeric(15,6) DEFAULT 0,
loc_current_due_interest numeric(15,6) DEFAULT 0,
loc_current_due_fees numeric(15,6) DEFAULT 0,
loc_past_due_last numeric(15,6) DEFAULT 0,
loc_past_due_total numeric(15,6) DEFAULT 0,
number_made_commitments int4 DEFAULT 0,
CONSTRAINT loan_account_agg_fact_pkey PRIMARY KEY (loan_account_agg_fact_id)
)
WITH OIDS;
CREATE INDEX loan_account_agg_fact_main_account_status_dim_id
ON cla_dw.loan_account_agg_fact
USING btree
(main_account_status_dim_id)
TABLESPACE fact_idx_part1_ts;
Here's the DDL for the table laaf_new:
CREATE TABLE cla_dw.loan_account_agg_fact_new
(
loan_account_agg_fact_id bigserial NOT NULL,
dw_load_date_id int4 NOT NULL,
servicer_branch_dim_id int4 NOT NULL,
main_account_status_dim_id int4 NOT NULL,
product_dim_id int4 NOT NULL,
dw_agg_load_dt timestamp,
account_balance numeric(15,6) NOT NULL DEFAULT 0,
loan_count int4 NOT NULL DEFAULT 0,
principal numeric(15,6) NOT NULL DEFAULT 0,
interest numeric(15,6) NOT NULL DEFAULT 0,
fees numeric(15,6) NOT NULL DEFAULT 0,
gl_principal numeric(15,6) NOT NULL DEFAULT 0,
gl_interest numeric(15,6) NOT NULL DEFAULT 0,
accruable_principal numeric(15,6) DEFAULT 0,
unaccruable_principal numeric(15,6) DEFAULT 0,
calculated_principal numeric(15,6) DEFAULT 0,
current_interest numeric(15,6) DEFAULT 0,
past_due_interest numeric(15,6) DEFAULT 0,
cash_available numeric(15,6) DEFAULT 0,
cash_collected numeric(15,6) DEFAULT 0,
cash_available_principal numeric(15,6) DEFAULT 0,
cash_available_current numeric(15,6) DEFAULT 0,
cash_available_last numeric(15,6) DEFAULT 0,
cash_available_interest numeric(15,6) DEFAULT 0,
cash_available_fees numeric(15,6) DEFAULT 0,
cash_not_collected numeric(15,6) DEFAULT 0,
number_contacts_total int4 DEFAULT 0,
number_broken_commitments int4 DEFAULT 0,
loc_current_due_total numeric(15,6) DEFAULT 0,
loc_current_due_principal numeric(15,6) DEFAULT 0,
loc_current_due_interest numeric(15,6) DEFAULT 0,
loc_current_due_fees numeric(15,6) DEFAULT 0,
loc_past_due_last numeric(15,6) DEFAULT 0,
loc_past_due_total numeric(15,6) DEFAULT 0,
number_made_commitments int4 DEFAULT 0,
CONSTRAINT loan_account_agg_fact_pkey_new PRIMARY KEY (loan_account_agg_fact_id) USING INDEX TABLESPACE default_ts
)
WITH OIDS TABLESPACE fact_data_part1_ts;
CREATE INDEX laafn_main_account_status_dim
ON cla_dw.loan_account_agg_fact_new
USING btree
(main_account_status_dim_id)
TABLESPACE fact_idx_part2_ts;
My question, that I haven't seen answered elsewhere, is WHAT things can affect the choice of an index scan over a sequence scan. I understand that sometimes a sequence scan is faster and that you still have to get the data from the disk but my question relates to an issue we had pop up today.
We have 2 tables, which we'll refer to as laaf and laaf_new. The first table has 220M rows and the second table has 4M rows. What were basically doing is aggregating the records from the first table into the second one at which point we're going to drop the first one. This is the same table I mentioned previously in my post about pg_dump.
laaf_new has one less column than laaf and both were freshly vacuum analyzed after having an index added on a single column (other than the primary key).
The query we were doing was as follows:
select main_account_status_dim_id, count(*)
from cla_dw.loan_account_agg_fact_new
group by main_account_status_dim_id
order by main_account_status_dim_id;
One of our problems is that we don't have any PGSQL dbas here. All of our guys are DB2 (we're still looking though).
Now I've been told by our DBA that we should have been able to wholy satisfy that query via the indexes.
We did regular EXPLAINS on the query with seqscan enabled and disabled and even in our own tests actually running the queries, the results WERE faster with a seq scan than an index scan but the question we were discussing is WHY did it choose the index scan and why is the index scan slower than the sequence scan? He's telling me that DB2 would have been able to do the whole thing with indexes.
EXPLAINS:
(the reason for the random_page_cost was that we had the default of 4 in the .conf file and were planning on changing it to 2 anyway to match our other server)
set random_page_cost=2;
set enable_seqscan=on;
explain select main_account_status_dim_id, count(*)
from cla_dw.loan_account_agg_fact
group by main_account_status_dim_id
order by main_account_status_dim_id;
"Sort (cost=8774054.54..8774054.66 rows=48 width=4)"
" Sort Key: main_account_status_dim_id"
" -> HashAggregate (cost=8774052.60..8774053.20 rows=48 width=4)"
" -> Seq Scan on loan_account_agg_fact (cost=0.00..7609745.40 rows=232861440 width=4)"
set random_page_cost=2;
set enable_seqscan=off;
explain select main_account_status_dim_id, count(*)
from cla_dw.loan_account_agg_fact
group by main_account_status_dim_id
order by main_account_status_dim_id;
"Sort (cost=108774054.54..108774054.66 rows=48 width=4)"
" Sort Key: main_account_status_dim_id"
" -> HashAggregate (cost=108774052.60..108774053.20 rows=48 width=4)"
" -> Seq Scan on loan_account_agg_fact (cost=100000000.00..107609745.40 rows=232861440 width=4)"
Here's the DDL for the table laaf:
When the system is not busy again, I'll run a verbose version. The query was run against each of the tables to compare the results of aggregation change with the new table.
CREATE TABLE cla_dw.loan_account_agg_fact
(
loan_account_agg_fact_id int8 NOT NULL DEFAULT nextval('loan_account_agg_fact_loan_account_agg_fact_id_seq'::regclass),
dw_load_date_id int4 NOT NULL DEFAULT 0,
servicer_branch_dim_id int4 NOT NULL DEFAULT 0,
main_account_status_dim_id int4 NOT NULL DEFAULT 0,
product_dim_id int4 NOT NULL DEFAULT 0,
next_due_date_id int4 NOT NULL DEFAULT 0,
account_balance numeric(15,6) NOT NULL DEFAULT 0,
loan_count int4 NOT NULL DEFAULT 0,
principal numeric(15,6) NOT NULL DEFAULT 0,
interest numeric(15,6) NOT NULL DEFAULT 0,
fees numeric(15,6) NOT NULL DEFAULT 0,
gl_principal numeric(15,6) NOT NULL DEFAULT 0,
gl_interest numeric(15,6) NOT NULL DEFAULT 0,
accruable_principal numeric(15,6) NOT NULL DEFAULT 0,
unaccruable_principal numeric(15,6) NOT NULL DEFAULT 0,
calculated_principal numeric(15,6) DEFAULT 0,
current_interest numeric(15,6) NOT NULL DEFAULT 0,
past_due_interest numeric(16,5) NOT NULL DEFAULT 0,
cash_available numeric(15,6) DEFAULT 0,
cash_collected numeric(15,6) DEFAULT 0,
cash_collected_date_id int4 DEFAULT 0,
dw_agg_load_dt timestamp(0) DEFAULT ('now'::text)::timestamp(6) with time zone,
cash_available_principal numeric(15,6) DEFAULT 0,
cash_available_current numeric(15,6) DEFAULT 0,
cash_available_last numeric(15,6) DEFAULT 0,
cash_available_interest numeric(15,6) DEFAULT 0,
cash_available_fees numeric(15,6) DEFAULT 0,
cash_not_collected numeric(15,6) DEFAULT 0,
number_contacts_total int4 DEFAULT 0,
number_broken_commitments int4 DEFAULT 0,
loc_current_due_total numeric(15,6) DEFAULT 0,
loc_current_due_principal numeric(15,6) DEFAULT 0,
loc_current_due_interest numeric(15,6) DEFAULT 0,
loc_current_due_fees numeric(15,6) DEFAULT 0,
loc_past_due_last numeric(15,6) DEFAULT 0,
loc_past_due_total numeric(15,6) DEFAULT 0,
number_made_commitments int4 DEFAULT 0,
CONSTRAINT loan_account_agg_fact_pkey PRIMARY KEY (loan_account_agg_fact_id)
)
WITH OIDS;
CREATE INDEX loan_account_agg_fact_main_account_status_dim_id
ON cla_dw.loan_account_agg_fact
USING btree
(main_account_status_dim_id)
TABLESPACE fact_idx_part1_ts;
Here's the DDL for the table laaf_new:
CREATE TABLE cla_dw.loan_account_agg_fact_new
(
loan_account_agg_fact_id bigserial NOT NULL,
dw_load_date_id int4 NOT NULL,
servicer_branch_dim_id int4 NOT NULL,
main_account_status_dim_id int4 NOT NULL,
product_dim_id int4 NOT NULL,
dw_agg_load_dt timestamp,
account_balance numeric(15,6) NOT NULL DEFAULT 0,
loan_count int4 NOT NULL DEFAULT 0,
principal numeric(15,6) NOT NULL DEFAULT 0,
interest numeric(15,6) NOT NULL DEFAULT 0,
fees numeric(15,6) NOT NULL DEFAULT 0,
gl_principal numeric(15,6) NOT NULL DEFAULT 0,
gl_interest numeric(15,6) NOT NULL DEFAULT 0,
accruable_principal numeric(15,6) DEFAULT 0,
unaccruable_principal numeric(15,6) DEFAULT 0,
calculated_principal numeric(15,6) DEFAULT 0,
current_interest numeric(15,6) DEFAULT 0,
past_due_interest numeric(15,6) DEFAULT 0,
cash_available numeric(15,6) DEFAULT 0,
cash_collected numeric(15,6) DEFAULT 0,
cash_available_principal numeric(15,6) DEFAULT 0,
cash_available_current numeric(15,6) DEFAULT 0,
cash_available_last numeric(15,6) DEFAULT 0,
cash_available_interest numeric(15,6) DEFAULT 0,
cash_available_fees numeric(15,6) DEFAULT 0,
cash_not_collected numeric(15,6) DEFAULT 0,
number_contacts_total int4 DEFAULT 0,
number_broken_commitments int4 DEFAULT 0,
loc_current_due_total numeric(15,6) DEFAULT 0,
loc_current_due_principal numeric(15,6) DEFAULT 0,
loc_current_due_interest numeric(15,6) DEFAULT 0,
loc_current_due_fees numeric(15,6) DEFAULT 0,
loc_past_due_last numeric(15,6) DEFAULT 0,
loc_past_due_total numeric(15,6) DEFAULT 0,
number_made_commitments int4 DEFAULT 0,
CONSTRAINT loan_account_agg_fact_pkey_new PRIMARY KEY (loan_account_agg_fact_id) USING INDEX TABLESPACE default_ts
)
WITH OIDS TABLESPACE fact_data_part1_ts;
CREATE INDEX laafn_main_account_status_dim
ON cla_dw.loan_account_agg_fact_new
USING btree
(main_account_status_dim_id)
TABLESPACE fact_idx_part2_ts;
On Thu, 2006-06-15 at 14:05 -0400, John Vincent wrote: > Now I've been told by our DBA that we should have been able to wholy > satisfy that query via the indexes. DB2 can satisfy the query using only indexes because DB2 doesn't do MVCC. Although MVCC is generally a win in terms of making the database easier to use and applications less brittle, it also means that the database must inspect the visibility information for each row before it can answer a query. For most types of queries this isn't a big deal, but for count(*) type queries, it slows things down. Since adding the visibility information to indexes would make them significantly more expensive to use and maintain, it isn't done. Therefore, each row has to be fetched from the main table anyway. Since in this particular query you are counting all rows of the database, PG must fetch each row from the main table regardless, so the sequential scan is much faster because it avoids traversing the index and performing random read operations. -- Mark Lewis
On 6/15/06, Mark Lewis <mark.lewis@mir3.com> wrote:
Mark,
Thanks for the answer. My DBAs just got this look on thier face when I showed. It's not like the couldn't have investigated this information themselves but I think the light finally came on.
One question that we came up with is how does this affect other aggregate functions like MAX,MIN,SUM and whatnot? Being that this is our data warehouse, we use these all the time. As I've said previously, I didn't know a human could generate some of the queries we've passed through this system.
DB2 can satisfy the query using only indexes because DB2 doesn't do
MVCC.
Although MVCC is generally a win in terms of making the database easier
to use and applications less brittle, it also means that the database
must inspect the visibility information for each row before it can
answer a query. For most types of queries this isn't a big deal, but
for count(*) type queries, it slows things down.
Mark,
Thanks for the answer. My DBAs just got this look on thier face when I showed. It's not like the couldn't have investigated this information themselves but I think the light finally came on.
One question that we came up with is how does this affect other aggregate functions like MAX,MIN,SUM and whatnot? Being that this is our data warehouse, we use these all the time. As I've said previously, I didn't know a human could generate some of the queries we've passed through this system.
Since adding the visibility information to indexes would make them
significantly more expensive to use and maintain, it isn't done.
Therefore, each row has to be fetched from the main table anyway.
Since in this particular query you are counting all rows of the
database, PG must fetch each row from the main table regardless, so the
sequential scan is much faster because it avoids traversing the index
and performing random read operations.
-- Mark Lewis
On Thu, 2006-06-15 at 14:46 -0400, John Vincent wrote: > One question that we came up with is how does this affect other > aggregate functions like MAX,MIN,SUM and whatnot? Being that this is > our data warehouse, we use these all the time. As I've said > previously, I didn't know a human could generate some of the queries > we've passed through this system. Previously, MIN and MAX would also run slowly, for the same reason as COUNT(*). But there really isn't a need for that, since you can still get a big speedup by scanning the index in order, looking up each row and stopping as soon as you find a visible one. This has been fixed so newer versions of PG will run quickly and use the index for MIN and MAX. I don't remember which version had that change; it might not be until 8.2. You can dig the archives to find out for sure. For older versions of PG before the fix, you can make MIN and MAX run quickly by rewriting them in the following form: SELECT column FROM table ORDER BY column LIMIT 1; Unfortunately SUM is in the same boat as COUNT; in order for it to return a meaningful result it must inspect visibility information for all of the rows. -- Mark
On 6/15/06, Mark Lewis <mark.lewis@mir3.com> wrote:
We'll this is interesting news to say the least. We went with PostgreSQL for our warehouse because we needed the advanced features that MySQL didn't have at the time (views/sprocs).
It sounds like we almost need another fact table for the places that we do SUM (which is not a problem just an additional map. If I'm interpreting this all correctly, we can't force PG to bypass a sequence scan even if we know our data is stable because of the MVCC aspect. In our case, as with most warehouses (except those that do rolling loads during the day), we only write data to it for about 5 hours at night in batch.
Any suggestions? FYI the original question wasn't meant as a poke at comparing PG to MySQL to DB2. I'm not making an yvalue judgements either way. I'm just trying to understand how we can use it the best way possible.
If anyone from the bizgres team is watching, have they done any work in this area?
Thanks.
John
Unfortunately SUM is in the same boat as COUNT; in order for it to
return a meaningful result it must inspect visibility information for
all of the rows.
-- Mark
We'll this is interesting news to say the least. We went with PostgreSQL for our warehouse because we needed the advanced features that MySQL didn't have at the time (views/sprocs).
It sounds like we almost need another fact table for the places that we do SUM (which is not a problem just an additional map. If I'm interpreting this all correctly, we can't force PG to bypass a sequence scan even if we know our data is stable because of the MVCC aspect. In our case, as with most warehouses (except those that do rolling loads during the day), we only write data to it for about 5 hours at night in batch.
Any suggestions? FYI the original question wasn't meant as a poke at comparing PG to MySQL to DB2. I'm not making an yvalue judgements either way. I'm just trying to understand how we can use it the best way possible.
If anyone from the bizgres team is watching, have they done any work in this area?
Thanks.
John
On Thu, 2006-06-15 at 14:21, John Vincent wrote: > On 6/15/06, Mark Lewis <mark.lewis@mir3.com> wrote: > Unfortunately SUM is in the same boat as COUNT; in order for > it to > return a meaningful result it must inspect visibility > information for > all of the rows. > > -- Mark > > We'll this is interesting news to say the least. We went with > PostgreSQL for our warehouse because we needed the advanced features > that MySQL didn't have at the time (views/sprocs). > > It sounds like we almost need another fact table for the places that > we do SUM (which is not a problem just an additional map. If I'm > interpreting this all correctly, we can't force PG to bypass a > sequence scan even if we know our data is stable because of the MVCC > aspect. In our case, as with most warehouses (except those that do > rolling loads during the day), we only write data to it for about 5 > hours at night in batch. > > Any suggestions? FYI the original question wasn't meant as a poke at > comparing PG to MySQL to DB2. I'm not making an yvalue judgements > either way. I'm just trying to understand how we can use it the best > way possible. > > If anyone from the bizgres team is watching, have they done any work > in this area? This might help: http://jonathangardner.net/PostgreSQL/materialized_views/matviews.html Since you're doing a data warehouse, I would think materialized views would be a natural addition anyway.
Any suggestions? FYI the original question wasn't meant as a poke at comparing PG to MySQL to DB2. I'm not making an yvalue judgements either way. I'm just trying to understand how we can use it the best way possible.
If anyone from the bizgres team is watching, have they done any work in this area?
Thanks.John
Actually we just thought about something. With PG, we can create an index that is a SUM of the column where indexing, no? We're going to test this in a few hours. Would that be able to be satisfied by an index scan?
Also, we're looking at the link provided for the materialized views in PG.
Thanks.
On Thu, Jun 15, 2006 at 03:43:09PM -0400, John Vincent wrote: > >Any suggestions? FYI the original question wasn't meant as a poke at > >comparing PG to MySQL to DB2. I'm not making an yvalue judgements either > >way. I'm just trying to understand how we can use it the best way possible. > > > >If anyone from the bizgres team is watching, have they done any work in > >this area? > > > >Thanks. > >John > > > > Actually we just thought about something. With PG, we can create an index > that is a SUM of the column where indexing, no? We're going to test this in > a few hours. Would that be able to be satisfied by an index scan? > > Also, we're looking at the link provided for the materialized views in PG. > > Thanks. decibel=# create index test on i ( sum(i) ); ERROR: cannot use aggregate function in index expression decibel=# BTW, there have been a number of proposals to negate the effect of not having visibility info in indexes. Unfortunately, none of them have come to fruition yet, mostly because it's a very difficult problem to solve. But it is something that the community would like to see happen. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
decibel=# create index test on i ( sum(i) );
ERROR: cannot use aggregate function in index expression
decibel=#
BTW, there have been a number of proposals to negate the effect of not
having visibility info in indexes. Unfortunately, none of them have come
to fruition yet, mostly because it's a very difficult problem to solve.
But it is something that the community would like to see happen.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Yeah we got the same thing when we tried it.
I thought about the whole thing on the way home and the downside is that we might have to ditch pgsql.
As far as implementing it, it might make sense to translate READ UNCOMMITTED to that new functionality. If the default isolation level stays the current level, the people who need it can use it via WITH UR or somesuch.
I know it's not that easy but it's an idea. I'm also thinking that the table inheritance we're going to be taking advantage of in 8.1 on the new server might make the sequence scan less of an issue. The only reason the sequence scan really blows is that we have a single table with 220M rows and growing.
Mark Lewis wrote: > On Thu, 2006-06-15 at 14:05 -0400, John Vincent wrote: >> Now I've been told by our DBA that we should have been able to wholy >> satisfy that query via the indexes. > DB2 can satisfy the query using only indexes because DB2 doesn't do > MVCC. You can get pretty much the same effect with materialized views. Create a table that LOOKS like the index (just those columns), with a foreign key relationship to the original table (cascade delete), and have the after-insert trigger on the main table write a row to the derived table. Now (index and) query the skinny table. Advantage of these tables: you can cluster them regularily, because it doesn't hard-lock the main table. -- Engineers think that equations approximate reality. Physicists think that reality approximates the equations. Mathematicians never make the connection.
Mark Lewis <mark.lewis@mir3.com> writes: > On Thu, 2006-06-15 at 14:05 -0400, John Vincent wrote: > > Now I've been told by our DBA that we should have been able to wholy > > satisfy that query via the indexes. > > DB2 can satisfy the query using only indexes because DB2 doesn't do > MVCC. Well it's more subtle than that. DB2 most certainly does provide MVCC semantics as does Oracle and MSSQL and any other serious SQL implementation. But there are different ways to implement MVCC and every database makes decisions that have pros and cons. Postgres's implementation has some big benefits over others (no rollback segments, no expensive recovery operations, fast inserts and updates) but it also has disadvantages (periodic vacuums and indexes don't cover the data). The distinction you're looking for here is sometimes called "optimistic" versus "pessimistic" space management. (Not locking, that's something else.) Postgres is "pessimistic" -- treats every transaction as if it might be rolled back. Oracle and most others are "optimistic" assumes every transaction will be committed and stores information elsewhere to implement MVCC And recover in case it's rolled back. The flip side is that Oracle and others like it have to do a lot of extra footwork to do if you query data that hasn't been committed yet. That footwork has performance implications. -- greg
On 16 Jun 2006 07:23:26 -0400, Greg Stark <gsstark@mit.edu> wrote: > The flip side is that Oracle and others like it have to > do a lot of extra footwork to do if you query data > that hasn't been committed yet. That footwork > has performance implications. Not disagreeing here at all, but considering that Oracle, DB2, and SQL Server, et al have proven themselves to perform extremely well under heavy load (in multiple benchmarks), the overhead of an UNDO implementation has a calculable break even point. Feel free to debate it, but the optimistic approach adopted by nearly every commercial database vendor is *generally* a better approach for OLTP. Consider Weikum & Vossen (p. 442): We also need to consider the extra work that the recovery algorithm incurs during normal operation. This is exactly the catch with the class of no-undo/no-redo algorithms. By and large, they come at the expense of a substantial overhead during normal operations that may increase the execution cost per transaction by a factor of two or even higher. In other words, it reduces the achievable transaction throughput of a given server configuration by a factor of two or more. Now, if we're considering UPDATES (the worst case for PostgreSQL's current MVCC architecture), then this is (IMHO) a true statement. There aren't many *successful* commercial databases that incur the additional overhead of creating another version of the record, marking the old one as having been updated, inserting N-number of new index entries to point to said record, and having to WAL-log all aforementioned changes. I have yet to see any successful commercial RDBMS using some sort of no-undo algorithm that doesn't follow the, "factor of two or more" performance reduction. However, if you consider an INSERT or DELETE in PostgreSQL, those are implemented much better than in most commercial database systems due to PostgreSQL's MVCC design. I've done a good amount of research on enhancing PostgreSQL's MVCC in UPDATE conditions and believe there is a nice happy medium for us. /me waits for the obligatory and predictable, "the benchmarks are flawed" response. -- Jonah H. Harris, Software Architect | phone: 732.331.1300 EnterpriseDB Corporation | fax: 732.331.1301 33 Wood Ave S, 2nd Floor | jharris@enterprisedb.com Iselin, New Jersey 08830 | http://www.enterprisedb.com/
"Jonah H. Harris" <jonah.harris@gmail.com> writes: > Now, if we're considering UPDATES (the worst case for PostgreSQL's > current MVCC architecture), then this is (IMHO) a true statement. > There aren't many *successful* commercial databases that incur the > additional overhead of creating another version of the record, marking > the old one as having been updated, inserting N-number of new index > entries to point to said record, and having to WAL-log all > aforementioned changes. Well Oracle has to do almost all that same work, it's just doing it in a separate place called a rollback segment. There are pros and cons especially where it comes to indexes, but also where it comes to what happens when the new record is larger than the old one. > I've done a good amount of research on enhancing PostgreSQL's MVCC in UPDATE > conditions and believe there is a nice happy medium for us. IMHO the biggest problem Postgres has is when you're updating a lot of records in a table with little free space. Postgres has to keep jumping back and forth between the old records it's reading in and the new records it's writing out. That can in theory turn a simple linear update scan into a O(n^2) operation. In practice read-ahead and caching should help but I'm not clear to what extent. That and of course the visibility bitmap that has been much-discussed that might make vacuum not have to visit every page and allow index scans to skip checking visibility info for some pages would be major wins. > /me waits for the obligatory and predictable, "the benchmarks are > flawed" response. I wouldnt' say the benchmarks are flawed but I also don't think you can point to any specific design feature and say it's essential just on the basis of bottom-line results. You have to look at the actual benefit the specific wins. Oracle and the others all implement tons of features intended to optimize applications like the benchmarks (and the benchmarks specifically of course:) that have huge effects on the results. Partitioned tables, materialized views, etc allow algorithmic improvements that do much more than any low level optimizations can do. -- greg
On 16 Jun 2006 09:21:01 -0400, Greg Stark <gsstark@mit.edu> wrote: > Well Oracle has to do almost all that same work, it's just doing it in a > separate place called a rollback segment. Well, it's not really the same work. The process by which Oracle manages UNDO is actually pretty simple and efficient, but complex in its implementation. There has also been some significant performance improvements in this area in both 9i and 10g. > There are pros and cons especially where it comes > to indexes, but also where it comes to what happens > when the new record is larger than the old one. Certainly, you want to avoid row chaining at all costs; which is why PCTFREE is there. I have researched update-in-place for PostgreSQL and can avoid row-chaining... so I think we can get the same benefit without the management and administration cost. > IMHO the biggest problem Postgres has is when you're > updating a lot of records in a table with little free space. Yes, this is certainly the most noticible case. This is one reason I'm behind the freespace patch. Unfortunately, a lot of inexperienced people use VACUUM FULL and don't understand why VACUUM is *generally* better.(to free up block-level freespace and update FSM) assuming they have enough hard disk space for the database. > That and of course the visibility bitmap that has been > much-discussed I'd certainly like to see it. > I wouldnt' say the benchmarks are flawed but I also > don't think you can point to any specific design > feature and say it's essential just on the basis of > bottom-line results. You have to look at the actual > benefit the specific wins. True. -- Jonah H. Harris, Software Architect | phone: 732.331.1300 EnterpriseDB Corporation | fax: 732.331.1301 33 Wood Ave S, 2nd Floor | jharris@enterprisedb.com Iselin, New Jersey 08830 | http://www.enterprisedb.com/
On Jun 16, 2006, at 8:43 AM, Jonah H. Harris wrote: > Yes, this is certainly the most noticible case. This is one reason > I'm behind the freespace patch. Unfortunately, a lot of inexperienced > people use VACUUM FULL and don't understand why VACUUM is *generally* > better.(to free up block-level freespace and update FSM) assuming they > have enough hard disk space for the database. Another reason to turn autovac on by default in 8.2... >> That and of course the visibility bitmap that has been >> much-discussed > I'd certainly like to see it. What's the hold-up on this? I thought there were some technical issues that had yet to be resolved? BTW, I'll point out that DB2 and MSSQL didn't switch to MVCC until their most recent versions. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Any suggestions? FYI the original question wasn't meant as a poke at comparing PG to MySQL to DB2. I'm not making an yvalue judgements either way. I'm just trying to understand how we can use it the best way possible.
Actually we just thought about something. With PG, we can create an index that is a SUM of the column where indexing, no? We're going to test this in a few hours. Would that be able to be satisfied by an index scan?
On Thu, Jun 15, 2006 at 15:38:32 -0400, John Vincent <lusis.org@gmail.com> wrote: > Any suggestions? FYI the original question wasn't meant as a poke at > >comparing PG to MySQL to DB2. I'm not making an yvalue judgements either > >way. I'm just trying to understand how we can use it the best way possible. > > > > Actually we just thought about something. With PG, we can create an index > that is a SUM of the column where indexing, no? We're going to test this in > a few hours. Would that be able to be satisfied by an index scan? No, that won't work. While you can make indexes on functions of a row, you can't make indexes on aggregate functions. You might find making a materialized view of the information you want can help with performance. The issues with "sum" are pretty much the same ones as with "count". You can find a couple different ways of doing materialized views for "count" in the archives. There is a simple way of doing it that doesn't work well with lots of concurrent updates and a more complicated method that does work well with lots of concurrent updates.