Thread: TB-sized databases

TB-sized databases

From

"Peter Koczan"

Date:

26 November 2007, 13:04:58

Hi all,

I have a user who is looking to store 500+ GB of data in a database
(and when all the indexes and metadata are factored in, it's going to
be more like 3-4 TB). He is wondering how well PostgreSQL scales with
TB-sized databases and what can be done to help optimize them (mostly
hardware and config parameters, maybe a little advocacy). I can't
speak on that since I don't have any DBs approaching that size.

The other part of this puzzle is that he's torn between MS SQL Server
(running on Windows and unsupported by us) and PostgreSQL (running on
Linux...which we would fully support). If any of you have ideas of how
well PostgreSQL compares to SQL Server, especially in TB-sized
databases, that would be much appreciated.

We're running PG 8.2.5, by the way.

Peter

Re: TB-sized databases

From

"Joshua D. Drake"

Date:

26 November 2007, 13:23:18

Peter Koczan wrote:
> Hi all,
>
> I have a user who is looking to store 500+ GB of data in a database
> (and when all the indexes and metadata are factored in, it's going to
> be more like 3-4 TB). He is wondering how well PostgreSQL scales with
> TB-sized databases and what can be done to help optimize them (mostly
> hardware and config parameters, maybe a little advocacy). I can't
> speak on that since I don't have any DBs approaching that size.
>
> The other part of this puzzle is that he's torn between MS SQL Server
> (running on Windows and unsupported by us) and PostgreSQL (running on
> Linux...which we would fully support). If any of you have ideas of how
> well PostgreSQL compares to SQL Server, especially in TB-sized
> databases, that would be much appreciated.
>
> We're running PG 8.2.5, by the way.

Well I can't speak to MS SQL-Server because all of our clients run
PostgreSQL ;).. I can tell you we have many that are in the 500GB -
1.5TB range.

All perform admirably as long as you have the hardware behind it and are
doing correct table structuring (such as table partitioning).

Sincerely,

Joshua D. Drake


>
> Peter
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org
>

Re: TB-sized databases

From

Oleg Bartunov

Date:

26 November 2007, 13:34:15

We have several TB database in production and it works well on
HP rx1620 dual Itanium2, MSA 20, running Linux. It's read-only storage for
astronomical catalogs with about 4-billions objects. We have custom
index for spherical coordinates which provide great performance.

Oleg
On Mon, 26 Nov 2007, Peter Koczan wrote:

> Hi all,
>
> I have a user who is looking to store 500+ GB of data in a database
> (and when all the indexes and metadata are factored in, it's going to
> be more like 3-4 TB). He is wondering how well PostgreSQL scales with
> TB-sized databases and what can be done to help optimize them (mostly
> hardware and config parameters, maybe a little advocacy). I can't
> speak on that since I don't have any DBs approaching that size.
>
> The other part of this puzzle is that he's torn between MS SQL Server
> (running on Windows and unsupported by us) and PostgreSQL (running on
> Linux...which we would fully support). If any of you have ideas of how
> well PostgreSQL compares to SQL Server, especially in TB-sized
> databases, that would be much appreciated.
>
> We're running PG 8.2.5, by the way.
>
> Peter
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>               http://archives.postgresql.org
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: TB-sized databases

From

Pablo Alcaraz

Date:

26 November 2007, 14:44:22

I had a client that tried to use Ms Sql Server to run a 500Gb+ database.
The database simply colapsed. They switched to Teradata and it is
running good. This database has now 1.5Tb+.

Currently I have clients using postgresql huge databases and they are
happy. In one client's database the biggest table has 237Gb+ (only 1
table!) and postgresql run the database without problem using
partitioning, triggers and rules (using postgresql 8.2.5).

Pablo

Peter Koczan wrote:
> Hi all,
>
> I have a user who is looking to store 500+ GB of data in a database
> (and when all the indexes and metadata are factored in, it's going to
> be more like 3-4 TB). He is wondering how well PostgreSQL scales with
> TB-sized databases and what can be done to help optimize them (mostly
> hardware and config parameters, maybe a little advocacy). I can't
> speak on that since I don't have any DBs approaching that size.
>
> The other part of this puzzle is that he's torn between MS SQL Server
> (running on Windows and unsupported by us) and PostgreSQL (running on
> Linux...which we would fully support). If any of you have ideas of how
> well PostgreSQL compares to SQL Server, especially in TB-sized
> databases, that would be much appreciated.
>
> We're running PG 8.2.5, by the way.
>
> Peter
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org
>
>

Re: TB-sized databases

From

Stephen Cook

Date:

26 November 2007, 15:16:21

I think either would work; both PostgreSQL and MS SQL Server have
success stories out there running VLDBs.  It really depends on what you
know and what you have.  If you have a lot of experience with Postgres
running on Linux, and not much with SQL Server on Windows, of course the
former would be a better choice for you.  You stand a much better chance
working with tools you know.


Pablo Alcaraz wrote:
> I had a client that tried to use Ms Sql Server to run a 500Gb+ database.
> The database simply colapsed. They switched to Teradata and it is
> running good. This database has now 1.5Tb+.
>
> Currently I have clients using postgresql huge databases and they are
> happy. In one client's database the biggest table has 237Gb+ (only 1
> table!) and postgresql run the database without problem using
> partitioning, triggers and rules (using postgresql 8.2.5).
>
> Pablo
>
> Peter Koczan wrote:
>> Hi all,
>>
>> I have a user who is looking to store 500+ GB of data in a database
>> (and when all the indexes and metadata are factored in, it's going to
>> be more like 3-4 TB). He is wondering how well PostgreSQL scales with
>> TB-sized databases and what can be done to help optimize them (mostly
>> hardware and config parameters, maybe a little advocacy). I can't
>> speak on that since I don't have any DBs approaching that size.
>>
>> The other part of this puzzle is that he's torn between MS SQL Server
>> (running on Windows and unsupported by us) and PostgreSQL (running on
>> Linux...which we would fully support). If any of you have ideas of how
>> well PostgreSQL compares to SQL Server, especially in TB-sized
>> databases, that would be much appreciated.
>>
>> We're running PG 8.2.5, by the way.
>>
>> Peter
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 4: Have you searched our list archives?
>>
>>                http://archives.postgresql.org
>>
>>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>

Re: TB-sized databases

From

"Peter Koczan"

Date:

27 November 2007, 16:19:14

Thanks all. This is just what I needed.

On Nov 26, 2007 1:16 PM, Stephen Cook <sclists@gmail.com> wrote:
> I think either would work; both PostgreSQL and MS SQL Server have
> success stories out there running VLDBs.  It really depends on what you
> know and what you have.  If you have a lot of experience with Postgres
> running on Linux, and not much with SQL Server on Windows, of course the
> former would be a better choice for you.  You stand a much better chance
> working with tools you know.
>
>
>
> Pablo Alcaraz wrote:
> > I had a client that tried to use Ms Sql Server to run a 500Gb+ database.
> > The database simply colapsed. They switched to Teradata and it is
> > running good. This database has now 1.5Tb+.
> >
> > Currently I have clients using postgresql huge databases and they are
> > happy. In one client's database the biggest table has 237Gb+ (only 1
> > table!) and postgresql run the database without problem using
> > partitioning, triggers and rules (using postgresql 8.2.5).
> >
> > Pablo
> >
> > Peter Koczan wrote:
> >> Hi all,
> >>
> >> I have a user who is looking to store 500+ GB of data in a database
> >> (and when all the indexes and metadata are factored in, it's going to
> >> be more like 3-4 TB). He is wondering how well PostgreSQL scales with
> >> TB-sized databases and what can be done to help optimize them (mostly
> >> hardware and config parameters, maybe a little advocacy). I can't
> >> speak on that since I don't have any DBs approaching that size.
> >>
> >> The other part of this puzzle is that he's torn between MS SQL Server
> >> (running on Windows and unsupported by us) and PostgreSQL (running on
> >> Linux...which we would fully support). If any of you have ideas of how
> >> well PostgreSQL compares to SQL Server, especially in TB-sized
> >> databases, that would be much appreciated.
> >>
> >> We're running PG 8.2.5, by the way.
> >>
> >> Peter
> >>
> >> ---------------------------(end of broadcast)---------------------------
> >> TIP 4: Have you searched our list archives?
> >>
> >>                http://archives.postgresql.org
> >>
> >>
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 5: don't forget to increase your free space map settings
> >
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>

Re: TB-sized databases

From

Simon Riggs

Date:

27 November 2007, 16:57:22

On Tue, 2007-11-27 at 14:18 -0600, Peter Koczan wrote:

> Thanks all. This is just what I needed.

All of those responses have cooked up quite a few topics into one. Large
databases might mean text warehouses, XML message stores, relational
archives and fact-based business data warehouses.

The main thing is that TB-sized databases are performance critical. So
it all depends upon your workload really as to how well PostgreSQL, or
another other RDBMS vendor can handle them.


Anyway, my reason for replying to this thread is that I'm planning
changes for PostgreSQL 8.4+ that will make allow us to get bigger and
faster databases. If anybody has specific concerns then I'd like to hear
them so I can consider those things in the planning stages.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com

Re: TB-sized databases

From

Pablo Alcaraz

Date:

27 November 2007, 19:06:44

Simon Riggs wrote:
> All of those responses have cooked up quite a few topics into one. Large
> databases might mean text warehouses, XML message stores, relational
> archives and fact-based business data warehouses.
>
> The main thing is that TB-sized databases are performance critical. So
> it all depends upon your workload really as to how well PostgreSQL, or
> another other RDBMS vendor can handle them.
>
>
> Anyway, my reason for replying to this thread is that I'm planning
> changes for PostgreSQL 8.4+ that will make allow us to get bigger and
> faster databases. If anybody has specific concerns then I'd like to hear
> them so I can consider those things in the planning stages
it would be nice to do something with selects so we can recover a rowset
on huge tables using a criteria with indexes without fall running a full
scan.

In my opinion, by definition, a huge database sooner or later will have
tables far bigger than RAM available (same for their indexes). I think
the queries need to be solved using indexes enough smart to be fast on disk.

Pablo

Re: TB-sized databases

From

Matthew

Date:

28 November 2007, 08:54:05

On Tue, 27 Nov 2007, Pablo Alcaraz wrote:
> it would be nice to do something with selects so we can recover a rowset
> on huge tables using a criteria with indexes without fall running a full
> scan.

You mean: Be able to tell Postgres "Don't ever do a sequential scan of
this table. It's silly. I would rather the query failed than have to wait
for a sequential scan of the entire table."

Yes, that would be really useful, if you have huge tables in your
database.

Matthew

--
Trying to write a program that can't be written is... well, it can be an
enormous amount of fun!                 -- Computer Science Lecturer

Re: TB-sized databases

From

Bill Moran

Date:

28 November 2007, 09:27:28

In response to Matthew <matthew@flymine.org>:

> On Tue, 27 Nov 2007, Pablo Alcaraz wrote:
> > it would be nice to do something with selects so we can recover a rowset
> > on huge tables using a criteria with indexes without fall running a full
> > scan.
>
> You mean: Be able to tell Postgres "Don't ever do a sequential scan of
> this table. It's silly. I would rather the query failed than have to wait
> for a sequential scan of the entire table."
>
> Yes, that would be really useful, if you have huge tables in your
> database.

Is there something wrong with:
set enable_seqscan = off
?

--
Bill Moran
Collaborative Fusion Inc.
http://people.collaborativefusion.com/~wmoran/

wmoran@collaborativefusion.com
Phone: 412-422-3463x4023

Re: TB-sized databases

From

Csaba Nagy

Date:

28 November 2007, 09:48:59

On Wed, 2007-11-28 at 08:27 -0500, Bill Moran wrote:
> Is there something wrong with:
> set enable_seqscan = off
> ?

Nothing wrong with enable_seqscan = off except it is all or nothing type
of thing... if you want the big table to never use seqscan, but a medium
table which is joined in should use it, then what you do ? And setting
enable_seqscan = off will actually not mean the planner can't use a
sequential scan for the query if no other alternative exist. In any case
it doesn't mean "please throw an error if you can't do this without a
sequential scan".

In fact an even more useful option would be to ask the planner to throw
error if the expected cost exceeds a certain threshold...

Cheers,
Csaba.

Re: TB-sized databases

From

Bill Moran

Date:

28 November 2007, 09:54:50

In response to Csaba Nagy <nagy@ecircle-ag.com>:

> On Wed, 2007-11-28 at 08:27 -0500, Bill Moran wrote:
> > Is there something wrong with:
> > set enable_seqscan = off
> > ?
>
> Nothing wrong with enable_seqscan = off except it is all or nothing type
> of thing...

If that's true, then I have a bug report to file:

test=# set enable_seqscan=off;
SET
test=# show enable_seqscan;
 enable_seqscan
----------------
 off
(1 row)

test=# set enable_seqscan=on;
SET
test=# show enable_seqscan;
 enable_seqscan
----------------
 on
(1 row)

It looks to me to be session-alterable.

> if you want the big table to never use seqscan, but a medium
> table which is joined in should use it, then what you do ? And setting
> enable_seqscan = off will actually not mean the planner can't use a
> sequential scan for the query if no other alternative exist. In any case
> it doesn't mean "please throw an error if you can't do this without a
> sequential scan".

True.  It would still choose some other plan.

> In fact an even more useful option would be to ask the planner to throw
> error if the expected cost exceeds a certain threshold...

Interesting concept.

--
Bill Moran
Collaborative Fusion Inc.
http://people.collaborativefusion.com/~wmoran/

wmoran@collaborativefusion.com
Phone: 412-422-3463x4023

Re: TB-sized databases

From

Gregory Stark

Date:

28 November 2007, 09:55:14

"Bill Moran" <wmoran@collaborativefusion.com> writes:

> In response to Matthew <matthew@flymine.org>:
>
>> On Tue, 27 Nov 2007, Pablo Alcaraz wrote:
>> > it would be nice to do something with selects so we can recover a rowset
>> > on huge tables using a criteria with indexes without fall running a full
>> > scan.
>>
>> You mean: Be able to tell Postgres "Don't ever do a sequential scan of
>> this table. It's silly. I would rather the query failed than have to wait
>> for a sequential scan of the entire table."
>>
>> Yes, that would be really useful, if you have huge tables in your
>> database.
>
> Is there something wrong with:
> set enable_seqscan = off
> ?

This does kind of the opposite of what you would actually want here. What you
want is that if you give it a query which would be best satisfied by a
sequential scan it should throw an error since you've obviously made an error
in the query.

What this does is it forces such a query to use an even *slower* method such
as a large index scan. In cases where there isn't any other method it goes
ahead and does the sequential scan anyways.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's PostGIS support!

Re: TB-sized databases

From

david@lang.hm

Date:

28 November 2007, 09:57:37

On Wed, 28 Nov 2007, Csaba Nagy wrote:

> On Wed, 2007-11-28 at 08:27 -0500, Bill Moran wrote:
>> Is there something wrong with:
>> set enable_seqscan = off
>> ?
>
> Nothing wrong with enable_seqscan = off except it is all or nothing type
> of thing... if you want the big table to never use seqscan, but a medium
> table which is joined in should use it, then what you do ? And setting
> enable_seqscan = off will actually not mean the planner can't use a
> sequential scan for the query if no other alternative exist. In any case
> it doesn't mean "please throw an error if you can't do this without a
> sequential scan".
>
> In fact an even more useful option would be to ask the planner to throw
> error if the expected cost exceeds a certain threshold...

and even better if the option can be overridden for a specific transaction
or connection. that way it can be set relativly low for normal operations,
but when you need to do an expensive query you can change it for that
query.

David Lang

Re: TB-sized databases

From

Bill Moran

Date:

28 November 2007, 09:59:43

In response to Gregory Stark <stark@enterprisedb.com>:

> "Bill Moran" <wmoran@collaborativefusion.com> writes:
>
> > In response to Matthew <matthew@flymine.org>:
> >
> >> On Tue, 27 Nov 2007, Pablo Alcaraz wrote:
> >> > it would be nice to do something with selects so we can recover a rowset
> >> > on huge tables using a criteria with indexes without fall running a full
> >> > scan.
> >>
> >> You mean: Be able to tell Postgres "Don't ever do a sequential scan of
> >> this table. It's silly. I would rather the query failed than have to wait
> >> for a sequential scan of the entire table."
> >>
> >> Yes, that would be really useful, if you have huge tables in your
> >> database.
> >
> > Is there something wrong with:
> > set enable_seqscan = off
> > ?
>
> This does kind of the opposite of what you would actually want here. What you
> want is that if you give it a query which would be best satisfied by a
> sequential scan it should throw an error since you've obviously made an error
> in the query.
>
> What this does is it forces such a query to use an even *slower* method such
> as a large index scan. In cases where there isn't any other method it goes
> ahead and does the sequential scan anyways.

Ah.  I misunderstood the intent of the comment.

--
Bill Moran
Collaborative Fusion Inc.
http://people.collaborativefusion.com/~wmoran/

wmoran@collaborativefusion.com
Phone: 412-422-3463x4023

Re: TB-sized databases

From

Csaba Nagy

Date:

28 November 2007, 10:04:27

On Wed, 2007-11-28 at 08:54 -0500, Bill Moran wrote:
> > Nothing wrong with enable_seqscan = off except it is all or nothing type
> > of thing...
>
> If that's true, then I have a bug report to file:
[snip]
> It looks to me to be session-alterable.

I didn't mean that it can't be set per session, I meant that it is not
fine grained enough to select the affected table but it affects _all_
tables in a query... and big tables are rarely alone in a query.

Cheers,
Csaba.

Re: TB-sized databases

From

Pablo Alcaraz

Date:

28 November 2007, 10:15:21

Matthew wrote:
> On Tue, 27 Nov 2007, Pablo Alcaraz wrote:
>
>> it would be nice to do something with selects so we can recover a rowset
>> on huge tables using a criteria with indexes without fall running a full
>> scan.
>>
>
> You mean: Be able to tell Postgres "Don't ever do a sequential scan of
> this table. It's silly. I would rather the query failed than have to wait
> for a sequential scan of the entire table."
>
> Yes, that would be really useful, if you have huge tables in your
> database.
>

Thanks. That would be nice too. I want that Postgres does not fall so
easy to do sequential scan if a field are indexed. if it concludes that
the index is *huge* and it does not fit in ram I want that Postgresql
uses the index anyway because the table is *more than huge* and a
sequential scan will take hours.

I ll put some examples in a next mail.

Regards

Pablo

Re: TB-sized databases

From

Matthew

Date:

28 November 2007, 10:41:14

On Wed, 28 Nov 2007, Gregory Stark wrote:
> > Is there something wrong with:
> > set enable_seqscan = off
> > ?
>
> This does kind of the opposite of what you would actually want here. What you
> want is that if you give it a query which would be best satisfied by a
> sequential scan it should throw an error since you've obviously made an error
> in the query.
>
> What this does is it forces such a query to use an even *slower* method such
> as a large index scan. In cases where there isn't any other method it goes
> ahead and does the sequential scan anyways.

The query planner is not always right. I would like an option like
"set enable_seqscan = off" but with the added effect of making Postgres
return an error if there is no alternative to scanning the whole table,
because I have obviously made a mistake setting up my indexes. I would
effectively be telling Postgres "For this table, I *know* that a full
table scan is dumb for all of my queries, even if the statistics say
otherwise."

Of course, it would have to be slightly intelligent, because there are
circumstances where a sequential scan doesn't necessarily mean a full
table scan (for instance if there is a LIMIT), and where an index scan
*does* mean a full table scan (for instance, selecting the whole table and
ordering by an indexed field).

Matthew

--
Existence is a convenient concept to designate all of the files that an
executable program can potentially process.   -- Fortran77 standard

Re: TB-sized databases

From

Pablo Alcaraz

Date:

28 November 2007, 10:57:25

Pablo Alcaraz wrote:
> Simon Riggs wrote:
>> All of those responses have cooked up quite a few topics into one. Large
>> databases might mean text warehouses, XML message stores, relational
>> archives and fact-based business data warehouses.
>>
>> The main thing is that TB-sized databases are performance critical. So
>> it all depends upon your workload really as to how well PostgreSQL, or
>> another other RDBMS vendor can handle them.
>>
>>
>> Anyway, my reason for replying to this thread is that I'm planning
>> changes for PostgreSQL 8.4+ that will make allow us to get bigger and
>> faster databases. If anybody has specific concerns then I'd like to hear
>> them so I can consider those things in the planning stages
> it would be nice to do something with selects so we can recover a
> rowset on huge tables using a criteria with indexes without fall
> running a full scan.
>
> In my opinion, by definition, a huge database sooner or later will
> have tables far bigger than RAM available (same for their indexes). I
> think the queries need to be solved using indexes enough smart to be
> fast on disk.
>
> Pablo

I am dealing with a very huge database. I am not sure if all these
things could be solved with the current Postgres version using somes
configuration parameters. I ll be happy to read your suggestions and
ideas about these queries.

In my opinion there are queries that I think they ll need to be tuned
for "huge databases" (huge databases = a database which relevant
tables(indexes) are (will be) far bigger that all the ram available):

-- example table
CREATE TABLE homes (
        id bigserial,
        name text,
        location text,
        bigint money_win,
        int zipcode;
);
CREATE INDEX money_win_idx ON homes(money_win);
CREATE INDEX zipcode_idx ON homes(zipcode);

SELECT max( id) from homes;
I think the information to get the max row quickly could be found using
the pk index. Idem min( id).

SELECT max( id) from homes WHERE id > 8000000000;
Same, but useful to find out the same thing in partitioned tables (using
id like partition criteria). It would be nice if Postgres would not need
the WHERE clause to realize it does not need to scan every single
partition, but only the last. Idem min(id).

SELECT * from homes WHERE money_win = 1300000000;
Postgres thinks too easily to solve these kind of queries that it must
to do a sequential scan where the table (or the index) does not fix in
memory if the number of rows is not near 1 (example: if the query
returns 5000 rows). Same case with filters like 'WHERE money_win >= xx',
'WHERE money_win BETWEEN xx AND yy'. But I do not know if this behavior
is because I did a wrong posgresql's configuration or I missed something.

SELECT count( *) from homes;
it would be *cute* that Postgres stores this value and only recalculate
if it thinks the stored value is wrong (example: after an anormal
shutdown).

SELECT zipcode, count( zipcode) as n from homes GROUP BY zipcode;
it would be *very cute* that Postgres could store this value (or is this
there?) on the index or wherever and it only recalculates if it thinks
the stored value is wrong (example: after an anormal shutdown).

In my opinion, partitioned tables in "huge databases" would be the
usual, not the exception. It would be important (for me at least) that
these queries could be fast solved when they run in partitioned tables.

Maybe one or more of these queries could be solved using some kind of
optimization. But I do not discover which ones (I ll be happy to read
suggestions :D). I am sure a lot/all these queries could be solved using
some kind of triggers/sequence to store information to solve the stuff.
But in general the information is there right now (is it there?) and the
queries only need that the server could look in the right place. A
trigger/function using some pgsql supported languages probably will
consume far more CPU resources to find out the same information that
exist right now and we need to do it using transactions (more perfomance
costs) only to be sure we are fine if the server has an anormal shutdown.

Currently I have several 250Gb+ tables with billions of rows (little
rows like the homes table example). I partitioned and distributed the
partitions/index in different tablespaces, etc. I think "I did not need"
so much partitions like I have right now (300+ for some tables and
growing). I just would need enough partitions to distribute the tables
in differents tablespaces. I did so much partitions because the
perfomance with really big tables is not enough good for me when the
programs run these kind of queries and the insert/update speed is worst
and worst with the time.

I hope that a couple of tables will be 1Tb+ in a few months... buy more
and more RAM is an option but not a solution because eventually the
database will be far bigger than ram available.

Last but not least, it would be *excelent* that this kind of
optimization would be posible without weird non standard sql sentences.
I think that Postgresql would be better with huge databases if it can
solve for itself these kind of queries in the fastest way or at least we
are abled to tell it to choice a different criteria. I could help it
using postgresql.conf to activate/deactivate some behavior or to use
some system table to tell the criteria I want with some tables (like
autovacuum does right now with table exception vacuums) or using non
standard DDL to define that criteria.

But the thing is that the programmers must be able to use standard SQL
for selects/inserts/updates/deletes with 'where' and 'group by' clauses.
In my case the programs are builded with java + JPA, so standard SQL
(but no DDL) is important to keep the things like they are. :)

Well, that's my 2cents feedback.

Regards

Pablo

PD: Sorry my broken english.

Re: TB-sized databases

From

Alvaro Herrera

Date:

28 November 2007, 11:12:21

Pablo Alcaraz escribió:

> In my opinion there are queries that I think they ll need to be tuned for
> "huge databases" (huge databases = a database which relevant
> tables(indexes) are (will be) far bigger that all the ram available):
>
> -- example table
> CREATE TABLE homes (
>        id bigserial,
>        name text,
>        location text,
>        bigint money_win,
>        int zipcode;
> );
> CREATE INDEX money_win_idx ON homes(money_win);
> CREATE INDEX zipcode_idx ON homes(zipcode);

Your example does not work, so I created my own for your first item.

alvherre=# create table test (a int primary key);
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "test_pkey" for table "test"
CREATE TABLE
alvherre=# insert into test select * from generate_series(1, 100000);
INSERT 0 100000
alvherre=# analyze test;
ANALYZE

> SELECT max( id) from homes;
> I think the information to get the max row quickly could be found using the
> pk index. Idem min( id).

alvherre=# explain analyze select max(a) from test;
                                                                  QUERY PLAN
                       

-----------------------------------------------------------------------------------------------------------------------------------------------
 Result  (cost=0.03..0.04 rows=1 width=0) (actual time=0.054..0.057 rows=1 loops=1)
   InitPlan
     ->  Limit  (cost=0.00..0.03 rows=1 width=4) (actual time=0.041..0.043 rows=1 loops=1)
           ->  Index Scan Backward using test_pkey on test  (cost=0.00..3148.26 rows=100000 width=4) (actual
time=0.034..0.034rows=1 loops=1) 
                 Filter: (a IS NOT NULL)
 Total runtime: 0.143 ms
(6 rows)


> SELECT max( id) from homes WHERE id > 8000000000;
> Same, but useful to find out the same thing in partitioned tables (using id
> like partition criteria). It would be nice if Postgres would not need the
> WHERE clause to realize it does not need to scan every single partition,
> but only the last. Idem min(id).

Yeah, this could be improved.

> SELECT * from homes WHERE money_win = 1300000000;
> Postgres thinks too easily to solve these kind of queries that it must to
> do a sequential scan where the table (or the index) does not fix in memory
> if the number of rows is not near 1 (example: if the query returns 5000
> rows). Same case with filters like 'WHERE money_win >= xx', 'WHERE
> money_win BETWEEN xx AND yy'. But I do not know if this behavior is because
> I did a wrong posgresql's configuration or I missed something.

There are thresholds to switch from index scan to seqscans.  It depends
on the selectivity of the clauses.

> SELECT count( *) from homes;
> it would be *cute* that Postgres stores this value and only recalculate if
> it thinks the stored value is wrong (example: after an anormal shutdown).

This is not as easy as you put it for reasons that have been discussed
at length.  I'll only say that there are workarounds to make counting
quick.

> SELECT zipcode, count( zipcode) as n from homes GROUP BY zipcode;
> it would be *very cute* that Postgres could store this value (or is this
> there?) on the index or wherever and it only recalculates if it thinks the
> stored value is wrong (example: after an anormal shutdown).

Same as above.


> Last but not least, it would be *excelent* that this kind of optimization
> would be posible without weird non standard sql sentences.

Right.  If you can afford to sponsor development, it could make them a
reality sooner.

--
Alvaro Herrera       Valdivia, Chile   ICBM: S 39º 49' 18.1", W 73º 13' 56.4"
"You're _really_ hosed if the person doing the hiring doesn't understand
relational systems: you end up with a whole raft of programmers, none of
whom has had a Date with the clue stick."              (Andrew Sullivan)

Re: TB-sized databases

From

Simon Riggs

Date:

28 November 2007, 13:23:10

On Wed, 2007-11-28 at 14:48 +0100, Csaba Nagy wrote:

> In fact an even more useful option would be to ask the planner to throw
> error if the expected cost exceeds a certain threshold...

Well, I've suggested it before:

statement_cost_limit on pgsql-hackers, 1 March 2006

Would people like me to re-write and resubmit this patch for 8.4?

Tom's previous concerns were along the lines of "How would know what to
set it to?", given that the planner costs are mostly arbitrary numbers.

Any bright ideas, or is it we want it and we don't care about the
possible difficulties?

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com

Re: TB-sized databases

From

Simon Riggs

Date:

28 November 2007, 13:30:14

On Tue, 2007-11-27 at 18:06 -0500, Pablo Alcaraz wrote:
> Simon Riggs wrote:
> > All of those responses have cooked up quite a few topics into one. Large
> > databases might mean text warehouses, XML message stores, relational
> > archives and fact-based business data warehouses.
> >
> > The main thing is that TB-sized databases are performance critical. So
> > it all depends upon your workload really as to how well PostgreSQL, or
> > another other RDBMS vendor can handle them.
> >
> >
> > Anyway, my reason for replying to this thread is that I'm planning
> > changes for PostgreSQL 8.4+ that will make allow us to get bigger and
> > faster databases. If anybody has specific concerns then I'd like to hear
> > them so I can consider those things in the planning stages
> it would be nice to do something with selects so we can recover a rowset
> on huge tables using a criteria with indexes without fall running a full
> scan.
>
> In my opinion, by definition, a huge database sooner or later will have
> tables far bigger than RAM available (same for their indexes). I think
> the queries need to be solved using indexes enough smart to be fast on disk.

OK, I agree with this one.

I'd thought that index-only plans were only for OLTP, but now I see they
can also make a big difference with DW queries. So I'm very interested
in this area now.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com

Re: TB-sized databases

From

Matthew

Date:

28 November 2007, 13:34:55

On Wed, 28 Nov 2007, Simon Riggs wrote:
> statement_cost_limit on pgsql-hackers, 1 March 2006
>
> Would people like me to re-write and resubmit this patch for 8.4?

Yes please. The more options, the better.

> Tom's previous concerns were along the lines of "How would know what to
> set it to?", given that the planner costs are mostly arbitrary numbers.
>
> Any bright ideas, or is it we want it and we don't care about the
> possible difficulties?

I think this is something that the average person should just knuckle down
and work out.

At the moment on my work's system, we call EXPLAIN before queries to find
out if it will take too long. This would improve performance by stopping
us having to pass the query into the query planner twice.

Matthew

--
An ant doesn't have a lot of processing power available to it. I'm not trying
to be speciesist - I wouldn't want to detract you from such a wonderful
creature, but, well, there isn't a lot there, is there?
                                        -- Computer Science Lecturer

Re: TB-sized databases

From

david@lang.hm

Date:

28 November 2007, 16:49:33

On Wed, 28 Nov 2007, Simon Riggs wrote:

> On Wed, 2007-11-28 at 14:48 +0100, Csaba Nagy wrote:
>
>> In fact an even more useful option would be to ask the planner to throw
>> error if the expected cost exceeds a certain threshold...
>
> Well, I've suggested it before:
>
> statement_cost_limit on pgsql-hackers, 1 March 2006
>
> Would people like me to re-write and resubmit this patch for 8.4?
>
> Tom's previous concerns were along the lines of "How would know what to
> set it to?", given that the planner costs are mostly arbitrary numbers.

arbitrary numbers are fine if they are relativly consistant with each
other.

will a plan with a estimated cost of 1,000,000 take approximatly 100 times
as long as one with a cost of 10,000?

or more importantly, will a plan with an estimated cost of 2000 reliably
take longer then one with an estimated cost of 1000?

David Lang

> Any bright ideas, or is it we want it and we don't care about the
> possible difficulties?
>
>

Re: TB-sized databases

From

Gregory Stark

Date:

29 November 2007, 08:00:21

"Simon Riggs" <simon@2ndquadrant.com> writes:

> On Wed, 2007-11-28 at 14:48 +0100, Csaba Nagy wrote:
>
>> In fact an even more useful option would be to ask the planner to throw
>> error if the expected cost exceeds a certain threshold...
>
> Well, I've suggested it before:
>
> statement_cost_limit on pgsql-hackers, 1 March 2006
>
> Would people like me to re-write and resubmit this patch for 8.4?
>
> Tom's previous concerns were along the lines of "How would know what to
> set it to?", given that the planner costs are mostly arbitrary numbers.

Hm, that's only kind of true.

Since 8.mumble seq_page_cost is itself configurable meaning you can adjust the
base unit and calibrate all the parameters to be time in whatever unit you
choose.

But even assuming you haven't so adjusted seq_page_cost and all the other
parameters to match the numbers aren't entirely arbitrary. They represent time
in units of "however long a single sequential page read takes".

Obviously few people know how long such a page read takes but surely you would
just run a few sequential reads of large tables and set the limit to some
multiple of whatever you find.

This isn't going to precise to the level of being able to avoid executing any
query which will take over 1000ms. But it is going to be able to catch
unconstrained cross joins or large sequential scans or such.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's RemoteDBA services!

Re: TB-sized databases

From

Tom Lane

Date:

29 November 2007, 11:46:19

Gregory Stark <stark@enterprisedb.com> writes:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
>> Tom's previous concerns were along the lines of "How would know what to
>> set it to?", given that the planner costs are mostly arbitrary numbers.

> Hm, that's only kind of true.

The units are not the problem.  The problem is that you are staking
non-failure of your application on the planner's estimates being
pretty well in line with reality.  Not merely in line enough that
it picks a reasonably cheap plan, but in line enough that if it
thinks plan A is 10x more expensive than plan B, then the actual
ratio is indeed somewhere near 10.

Given that this list spends all day every day discussing cases where the
planner is wrong, I'd have to think that that's a bet I wouldn't take.

You could probably avoid this risk by setting the cutoff at something
like 100 or 1000 times what you really want to tolerate, but how
useful is it then?

            regards, tom lane

Re: TB-sized databases

From

Simon Riggs

Date:

29 November 2007, 12:14:55

On Thu, 2007-11-29 at 10:45 -0500, Tom Lane wrote:
> Gregory Stark <stark@enterprisedb.com> writes:
> > "Simon Riggs" <simon@2ndquadrant.com> writes:
> >> Tom's previous concerns were along the lines of "How would know what to
> >> set it to?", given that the planner costs are mostly arbitrary numbers.
>
> > Hm, that's only kind of true.
>
> The units are not the problem.  The problem is that you are staking
> non-failure of your application on the planner's estimates being
> pretty well in line with reality.  Not merely in line enough that
> it picks a reasonably cheap plan, but in line enough that if it
> thinks plan A is 10x more expensive than plan B, then the actual
> ratio is indeed somewhere near 10.
>
> Given that this list spends all day every day discussing cases where the
> planner is wrong, I'd have to think that that's a bet I wouldn't take.

I think you have a point, but the alternative is often much worse.

If an SQL statement fails because of too high cost, we can investigate
the problem and re-submit. If a website slows down because somebody
allowed a very large query to execute then everybody is affected, not
just the person who ran the bad query. Either way the guy that ran the
query loses, but without constraints in place one guy can kill everybody
else also.

> You could probably avoid this risk by setting the cutoff at something
> like 100 or 1000 times what you really want to tolerate, but how
> useful is it then?

Still fairly useful, as long as we understand its a blunt instrument.

If the whole performance of your system depends upon indexed access then
rogue queries can have disastrous, unpredictable consequences. Many
sites construct their SQL dynamically, so a mistake in a seldom used
code path can allow killer queries through. Even the best DBAs have been
known to make mistakes.

e.g. An 80GB table has 8 million blocks in it.
- So putting a statement_cost limit = 1 million would allow some fairly
large queries but prevent anything that did a SeqScan (or worse).
- Setting it 10 million is going to prevent things like sorting the
whole table without a LIMIT
- Setting it at 100 million is going to prevent unconstrained product
joins etc..

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com

Re: TB-sized databases

From

Stephen Frost

Date:

29 November 2007, 12:43:01

* Simon Riggs (simon@2ndquadrant.com) wrote:
> On Thu, 2007-11-29 at 10:45 -0500, Tom Lane wrote:
> > Given that this list spends all day every day discussing cases where the
> > planner is wrong, I'd have to think that that's a bet I wouldn't take.
>
> I think you have a point, but the alternative is often much worse.

I'm not convinced you've outlined the consequences of implementing a
plan cost limit sufficiently.

> If an SQL statement fails because of too high cost, we can investigate
> the problem and re-submit. If a website slows down because somebody
> allowed a very large query to execute then everybody is affected, not
> just the person who ran the bad query. Either way the guy that ran the
> query loses, but without constraints in place one guy can kill everybody
> else also.

It's entirely possible (likely even) that most of the users accessing a
webpage are using the same queries and the same tables.  If the estimates
for those tables ends up changing enough that PG adjusts the plan cost to
be above the plan cost limit then *all* of the users would be affected.

The plan cost isn't going to change for just one user if it's the same
query that a bunch of users are using.  I'm not sure if handling the
true 'rougue query' case with this limit would actually be a net
improvment overall in a website-based situation.

I could see it being useful to set a 'notice_on_high_cost_query'
variable where someone working in a data warehouse situation would get a
notice if the query he's hand-crafting has a very high cost (in which
case he could ctrl-c it if he thinks something is wrong, rather than
waiting 5 hours before realizing he forgot a join clause), but the
website with the one rougue query run by one user seems a stretch.

    Thanks,

        Stephen

Attachment

signature.asc

Re: TB-sized databases

From

Csaba Nagy

Date:

29 November 2007, 12:54:42

On Thu, 2007-11-29 at 10:45 -0500, Tom Lane wrote:
> Given that this list spends all day every day discussing cases where the
> planner is wrong, I'd have to think that that's a bet I wouldn't take.
>
> You could probably avoid this risk by setting the cutoff at something
> like 100 or 1000 times what you really want to tolerate, but how
> useful is it then?

It would still be useful in the sense that if the planner is taking
wrong estimates you must correct it somehow... raise statistics target,
rewrite query or other tweaking, you should do something. An error is
sometimes better than gradually decreasing performance because of too
low statistics target for example. So if the error is thrown because of
wrong estimate, it is still a valid error raising a signal that the DBA
has to do something about it.

It's still true that if the planner estimates too low, it will raise no
error and will take the resources. But that's just what we have now, so
it wouldn't be a regression of any kind...

Cheers,
Csaba.

Re: TB-sized databases

From

Mark Kirkwood

Date:

29 November 2007, 19:31:14

Simon Riggs wrote:
> On Wed, 2007-11-28 at 14:48 +0100, Csaba Nagy wrote:
>
>
>> In fact an even more useful option would be to ask the planner to throw
>> error if the expected cost exceeds a certain threshold...
>>
>
> Well, I've suggested it before:
>
> statement_cost_limit on pgsql-hackers, 1 March 2006
>
> Would people like me to re-write and resubmit this patch for 8.4?
>
> Tom's previous concerns were along the lines of "How would know what to
> set it to?", given that the planner costs are mostly arbitrary numbers.
>
> Any bright ideas, or is it we want it and we don't care about the
> possible difficulties?
>
>

Knowing how to set it is a problem - but a possibly bigger one is that
monster query crippling your DW system, so I'd say lets have it.

Cheers

Mark

Re: TB-sized databases

From

Russell Smith

Date:

30 November 2007, 02:42:38

Simon Riggs wrote:
> On Tue, 2007-11-27 at 18:06 -0500, Pablo Alcaraz wrote:
>
>> Simon Riggs wrote:
>>
>>> All of those responses have cooked up quite a few topics into one. Large
>>> databases might mean text warehouses, XML message stores, relational
>>> archives and fact-based business data warehouses.
>>>
>>> The main thing is that TB-sized databases are performance critical. So
>>> it all depends upon your workload really as to how well PostgreSQL, or
>>> another other RDBMS vendor can handle them.
>>>
>>>
>>> Anyway, my reason for replying to this thread is that I'm planning
>>> changes for PostgreSQL 8.4+ that will make allow us to get bigger and
>>> faster databases. If anybody has specific concerns then I'd like to hear
>>> them so I can consider those things in the planning stages
>>>
>> it would be nice to do something with selects so we can recover a rowset
>> on huge tables using a criteria with indexes without fall running a full
>> scan.
>>
>> In my opinion, by definition, a huge database sooner or later will have
>> tables far bigger than RAM available (same for their indexes). I think
>> the queries need to be solved using indexes enough smart to be fast on disk.
>>
>
> OK, I agree with this one.
>
> I'd thought that index-only plans were only for OLTP, but now I see they
> can also make a big difference with DW queries. So I'm very interested
> in this area now.
>
>
If that's true, then you want to get behind the work Gokulakannan
Somasundaram
(http://archives.postgresql.org/pgsql-hackers/2007-10/msg00220.php) has
done with relation to thick indexes.  I would have thought that concept
particularly useful in DW.  Only having to scan indexes on a number of
join tables would be a huge win for some of these types of queries.

My tiny point of view would say that is a much better investment than
setting up the proposed parameter.  I can see the use of the parameter
though.  Most of the complaints about indexes having visibility is about
update /delete contention.  I would expect in a DW that those things
aren't in the critical path like they are in many other applications.
Especially with partitioning and previous partitions not getting may
updates, I would think there could be great benefit.  I would think that
many of Pablo's requests up-thread would get significant performance
benefit from this type of index.  But as I mentioned at the start,
that's my tiny point of view and I certainly don't have the resources to
direct what gets looked at for PostgreSQL.

Regards

Russell Smith

Re: TB-sized databases

From

Simon Riggs

Date:

30 November 2007, 04:40:06

On Fri, 2007-11-30 at 17:41 +1100, Russell Smith wrote:
> Simon Riggs wrote:
> > On Tue, 2007-11-27 at 18:06 -0500, Pablo Alcaraz wrote:
> >
> >> Simon Riggs wrote:
> >>
> >>> All of those responses have cooked up quite a few topics into one. Large
> >>> databases might mean text warehouses, XML message stores, relational
> >>> archives and fact-based business data warehouses.
> >>>
> >>> The main thing is that TB-sized databases are performance critical. So
> >>> it all depends upon your workload really as to how well PostgreSQL, or
> >>> another other RDBMS vendor can handle them.
> >>>
> >>>
> >>> Anyway, my reason for replying to this thread is that I'm planning
> >>> changes for PostgreSQL 8.4+ that will make allow us to get bigger and
> >>> faster databases. If anybody has specific concerns then I'd like to hear
> >>> them so I can consider those things in the planning stages
> >>>
> >> it would be nice to do something with selects so we can recover a rowset
> >> on huge tables using a criteria with indexes without fall running a full
> >> scan.
> >>
> >> In my opinion, by definition, a huge database sooner or later will have
> >> tables far bigger than RAM available (same for their indexes). I think
> >> the queries need to be solved using indexes enough smart to be fast on disk.
> >>
> >
> > OK, I agree with this one.
> >
> > I'd thought that index-only plans were only for OLTP, but now I see they
> > can also make a big difference with DW queries. So I'm very interested
> > in this area now.
> >
> >
> If that's true, then you want to get behind the work Gokulakannan
> Somasundaram
> (http://archives.postgresql.org/pgsql-hackers/2007-10/msg00220.php) has
> done with relation to thick indexes.  I would have thought that concept
> particularly useful in DW.  Only having to scan indexes on a number of
> join tables would be a huge win for some of these types of queries.

Hmm, well I proposed that in Jan/Feb, but I'm sure others have also.

I don't think its practical to add visibility information to *all*
indexes, but I like Heikki's Visibility Map proposal much better.

> My tiny point of view would say that is a much better investment than
> setting up the proposed parameter.

They are different things entirely, with dissimilar dev costs also. We
can have both.

> I can see the use of the parameter
> though.

Good

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com

Re: TB-sized databases

From

"Trevor Talbot"

Date:

30 November 2007, 06:15:56

On 11/29/07, Gregory Stark <stark@enterprisedb.com> wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
> > On Wed, 2007-11-28 at 14:48 +0100, Csaba Nagy wrote:

> >> In fact an even more useful option would be to ask the planner to throw
> >> error if the expected cost exceeds a certain threshold...

> > Tom's previous concerns were along the lines of "How would know what to
> > set it to?", given that the planner costs are mostly arbitrary numbers.

> Hm, that's only kind of true.

> Obviously few people know how long such a page read takes but surely you would
> just run a few sequential reads of large tables and set the limit to some
> multiple of whatever you find.
>
> This isn't going to precise to the level of being able to avoid executing any
> query which will take over 1000ms. But it is going to be able to catch
> unconstrained cross joins or large sequential scans or such.

Isn't that what statement_timeout is for? Since this is entirely based
on estimates, using arbitrary fuzzy numbers for this seems fine to me;
precision isn't really the goal.

Re: TB-sized databases

From

Csaba Nagy

Date:

30 November 2007, 06:29:58

> Isn't that what statement_timeout is for? Since this is entirely based
> on estimates, using arbitrary fuzzy numbers for this seems fine to me;
> precision isn't really the goal.

There's an important difference to statement_timeout: this proposal
would avoid completely taking any resources if it estimates it can't be
executed in proper time, but statement_timeout will allow a bad query to
run at least statement_timeout long...

Cheers,
Csaba.

Re: TB-sized databases

From

Luke Lonergan

Date:

30 November 2007, 11:45:58

Hi Peter,

If you run into a scaling issue with PG (you will at those scales 1TB+), you
can deploy Greenplum DB which is PG 8.2.5 compatible.  A large internet
company (look for press soon) is in production with a 150TB database on a
system capable of doing 400TB and we have others in production at 60TB,
40TB, etc.  We can provide references when needed - note that we had 20
successful customer references supporting Gartner's magic quadrant report on
data warehouses which put Greenplum in the "upper visionary" area of the
magic quadrant - which only happens if your customers can scale (see this:
http://www.esj.com/business_intelligence/article.aspx?EditorialsID=8712)

In other words, no matter what happens you'll be able to scale up with your
Postgres strategy.

- Luke

On 11/26/07 10:44 AM, "Pablo Alcaraz" <pabloa@laotraesquina.com.ar> wrote:

> I had a client that tried to use Ms Sql Server to run a 500Gb+ database.
> The database simply colapsed. They switched to Teradata and it is
> running good. This database has now 1.5Tb+.
>
> Currently I have clients using postgresql huge databases and they are
> happy. In one client's database the biggest table has 237Gb+ (only 1
> table!) and postgresql run the database without problem using
> partitioning, triggers and rules (using postgresql 8.2.5).
>
> Pablo
>
> Peter Koczan wrote:
>> Hi all,
>>
>> I have a user who is looking to store 500+ GB of data in a database
>> (and when all the indexes and metadata are factored in, it's going to
>> be more like 3-4 TB). He is wondering how well PostgreSQL scales with
>> TB-sized databases and what can be done to help optimize them (mostly
>> hardware and config parameters, maybe a little advocacy). I can't
>> speak on that since I don't have any DBs approaching that size.
>>
>> The other part of this puzzle is that he's torn between MS SQL Server
>> (running on Windows and unsupported by us) and PostgreSQL (running on
>> Linux...which we would fully support). If any of you have ideas of how
>> well PostgreSQL compares to SQL Server, especially in TB-sized
>> databases, that would be much appreciated.
>>
>> We're running PG 8.2.5, by the way.
>>
>> Peter
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 4: Have you searched our list archives?
>>
>>                http://archives.postgresql.org
>>
>>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings

Re: TB-sized databases

From

Robert Treat

Date:

05 December 2007, 16:07:43

On Thursday 29 November 2007 11:14, Simon Riggs wrote:
> On Thu, 2007-11-29 at 10:45 -0500, Tom Lane wrote:
> > Gregory Stark <stark@enterprisedb.com> writes:
> > > "Simon Riggs" <simon@2ndquadrant.com> writes:
> > >> Tom's previous concerns were along the lines of "How would know what
> > >> to set it to?", given that the planner costs are mostly arbitrary
> > >> numbers.
> > >
> > > Hm, that's only kind of true.
> >
> > The units are not the problem.  The problem is that you are staking
> > non-failure of your application on the planner's estimates being
> > pretty well in line with reality.  Not merely in line enough that
> > it picks a reasonably cheap plan, but in line enough that if it
> > thinks plan A is 10x more expensive than plan B, then the actual
> > ratio is indeed somewhere near 10.
> >
> > Given that this list spends all day every day discussing cases where the
> > planner is wrong, I'd have to think that that's a bet I wouldn't take.
>
> I think you have a point, but the alternative is often much worse.
>
> If an SQL statement fails because of too high cost, we can investigate
> the problem and re-submit. If a website slows down because somebody
> allowed a very large query to execute then everybody is affected, not
> just the person who ran the bad query. Either way the guy that ran the
> query loses, but without constraints in place one guy can kill everybody
> else also.
>
> > You could probably avoid this risk by setting the cutoff at something
> > like 100 or 1000 times what you really want to tolerate, but how
> > useful is it then?
>
> Still fairly useful, as long as we understand its a blunt instrument.
>
> If the whole performance of your system depends upon indexed access then
> rogue queries can have disastrous, unpredictable consequences. Many
> sites construct their SQL dynamically, so a mistake in a seldom used
> code path can allow killer queries through. Even the best DBAs have been
> known to make mistakes.
>

If the whole performance of your system depends upon indexed access, then
maybe you need a database that gives you a way to force index access at the
query level?

> e.g. An 80GB table has 8 million blocks in it.
> - So putting a statement_cost limit = 1 million would allow some fairly
> large queries but prevent anything that did a SeqScan (or worse).
> - Setting it 10 million is going to prevent things like sorting the
> whole table without a LIMIT
> - Setting it at 100 million is going to prevent unconstrained product
> joins etc..

I think you're completly overlooking the effect of disk latency has on query
times.  We run queries all the time that can vary from 4 hours to 12 hours in
time based solely on the amount of concurrent load on the system, even though
they always plan with the same cost.

--
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

Re: TB-sized databases

From

Decibel!

Date:

05 December 2007, 20:08:23

On Nov 28, 2007, at 7:27 AM, Bill Moran wrote:
> Is there something wrong with:
> set enable_seqscan = off


Note that in cases of very heavy skew, that won't work. It only adds
10M to the cost estimate for a seqscan, and it's definitely possible
to have an index scan that looks even more expensive.
--
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Attachment

smime.p7s

Re: TB-sized databases

From

Simon Riggs

Date:

06 December 2007, 05:38:34

Robert,

On Wed, 2007-12-05 at 15:07 -0500, Robert Treat wrote:

> If the whole performance of your system depends upon indexed access, then
> maybe you need a database that gives you a way to force index access at the
> query level?

That sounds like a request for hints, which is OT here, ISTM.

The issue is that if somebody issues a "large query" then it will be a
problem whichever plan the query takes. Forcing index scans can make a
plan more expensive than a seq scan in many cases.

> > e.g. An 80GB table has 8 million blocks in it.
> > - So putting a statement_cost limit = 1 million would allow some fairly
> > large queries but prevent anything that did a SeqScan (or worse).
> > - Setting it 10 million is going to prevent things like sorting the
> > whole table without a LIMIT
> > - Setting it at 100 million is going to prevent unconstrained product
> > joins etc..
>
> I think you're completly overlooking the effect of disk latency has on query
> times.  We run queries all the time that can vary from 4 hours to 12 hours in
> time based solely on the amount of concurrent load on the system, even though
> they always plan with the same cost.

Not at all. If we had statement_cost_limit then it would be applied
after planning and before execution begins. The limit would be based
upon the planner's estimate, not the likely actual execution time.

So yes a query may vary in execution time by a large factor as you
suggest, and it would be difficult to set the proposed parameter
accurately. However, the same is also true of statement_timeout, which
we currently support, so I don't see this point as an blocker.

Which leaves us at the burning question: Would you use such a facility,
or would the difficulty in setting it exactly prevent you from using it
for real?

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com

Re: TB-sized databases

From

Michael Stone

Date:

06 December 2007, 11:42:29

On Thu, Dec 06, 2007 at 09:38:16AM +0000, Simon Riggs wrote:
>The issue is that if somebody issues a "large query" then it will be a
>problem whichever plan the query takes. Forcing index scans can make a
>plan more expensive than a seq scan in many cases.

OTOH, the planner can really screw up queries on really large databases.
IIRC, the planner can use things like unique constraints to get some
idea, e.g., of how many rows will result from a join. Unfortunately,
the planner can't apply those techniques to certain constructs common in
really large db's (e.g., partitioned tables--how do you do a unique
constraint on a partitioned table?) I've got some queries that the
planner thinks will return on the order of 10^30 rows for that sort of
reason. In practice, the query may return 10^3 rows, and the difference
between the seq scan and the index scan is the difference between a
query that takes a few seconds and a query that I will never run to
completion. I know the goal would be to make the planner understand
those queries better, but for now the answer seems to be to craft the
queries very carefully and run explain first, making sure you see index
scans in the right places.

Mike Stone

Re: TB-sized databases

From

Tom Lane

Date:

06 December 2007, 12:16:55

Michael Stone <mstone+postgres@mathom.us> writes:
> OTOH, the planner can really screw up queries on really large databases.
> IIRC, the planner can use things like unique constraints to get some
> idea, e.g., of how many rows will result from a join. Unfortunately,
> the planner can't apply those techniques to certain constructs common in
> really large db's (e.g., partitioned tables--how do you do a unique
> constraint on a partitioned table?) I've got some queries that the
> planner thinks will return on the order of 10^30 rows for that sort of
> reason. In practice, the query may return 10^3 rows, and the difference
> between the seq scan and the index scan is the difference between a
> query that takes a few seconds and a query that I will never run to
> completion. I know the goal would be to make the planner understand
> those queries better,

Indeed, and if you've got examples where it's that far off, you should
report them.

            regards, tom lane

Re: TB-sized databases

From

Matthew

Date:

06 December 2007, 13:46:56

On Thu, 6 Dec 2007, Tom Lane wrote:
> Indeed, and if you've got examples where it's that far off, you should
> report them.

Oo, oo, I have one!

So, this query bit us a while back. We had two tables being joined
together in a query by a key column. The key column was an integer, and
for the first table it had a range from zero to a bazillion. For the
second table, it had a range from half a bazillion to half a bazillion
plus a hundred. The first table had a bazillion rows, and the second table
had only about a hundred. Both tables were well ordered on the key column.
Both tables had an index on the key column.

So, our query was like this:

SELECT * FROM table1, table2 WHERE table1.key = table2.key LIMIT 1

... because we wanted to find out if there were *any* hits between the two
tables. The query took hours. To understand why, let's look at the query
without the LIMIT. For this query, Postgres would perform a nested loop,
iterating over all rows in the small table, and doing a hundred index
lookups in the big table. This completed very quickly. However, adding the
LIMIT meant that suddenly a merge join was very attractive to the planner,
as it estimated the first row to be returned within milliseconds, without
needing to sort either table.

The problem is that Postgres didn't know that the first hit in the big
table would be about half-way through, after doing a index sequential scan
for half a bazillion rows.

We fixed this query by changing it to:

SELECT * FROM table1, table2 WHERE table1.key = table2.key
       AND table1.key >= (SELECT MIN(key) FROM table2)
       AND table1.key <= (SELECT MAX(key) FROM table2)

... which artificially limited the index sequential scan of table2 to
start from the earliest possible hit and only continue to the last
possible hit. This query completed quickly, as the min and max could be
answered quickly by the indexes.

Still, it's a pity Postgres couldn't work that out for itself, having all
the information present in its statistics and indexes. AIUI the planner
doesn't peek into indexes - maybe it should.

Matthew

--
In the beginning was the word, and the word was unsigned,
and the main() {} was without form and void...

Re: TB-sized databases

From

Tom Lane

Date:

06 December 2007, 13:55:48

Matthew <matthew@flymine.org> writes:
> ... For this query, Postgres would perform a nested loop,
> iterating over all rows in the small table, and doing a hundred index
> lookups in the big table. This completed very quickly. However, adding the
> LIMIT meant that suddenly a merge join was very attractive to the planner,
> as it estimated the first row to be returned within milliseconds, without
> needing to sort either table.

> The problem is that Postgres didn't know that the first hit in the big
> table would be about half-way through, after doing a index sequential scan
> for half a bazillion rows.

Hmm.  IIRC, there are smarts in there about whether a mergejoin can
terminate early because of disparate ranges of the two join variables.
Seems like it should be straightforward to fix it to also consider
whether the time-to-return-first-row will be bloated because of
disparate ranges.  I'll take a look --- but it's probably too late
to consider this for 8.3.

            regards, tom lane

Re: TB-sized databases

From

Matthew

Date:

06 December 2007, 14:03:24

On Thu, 6 Dec 2007, Tom Lane wrote:
> Matthew <matthew@flymine.org> writes:
> > ... For this query, Postgres would perform a nested loop,
> > iterating over all rows in the small table, and doing a hundred index
> > lookups in the big table. This completed very quickly. However, adding the
> > LIMIT meant that suddenly a merge join was very attractive to the planner,
> > as it estimated the first row to be returned within milliseconds, without
> > needing to sort either table.
>
> > The problem is that Postgres didn't know that the first hit in the big
> > table would be about half-way through, after doing a index sequential scan
> > for half a bazillion rows.
>
> Hmm.  IIRC, there are smarts in there about whether a mergejoin can
> terminate early because of disparate ranges of the two join variables.
> Seems like it should be straightforward to fix it to also consider
> whether the time-to-return-first-row will be bloated because of
> disparate ranges.  I'll take a look --- but it's probably too late
> to consider this for 8.3.

Very cool. Would that be a planner cost estimate fix (so it avoids the
merge join), or a query execution fix (so it does the merge join on the
table subset)?

Matthew

--
I've run DOOM more in the last few days than I have the last few
months.  I just love debugging ;-)  -- Linus Torvalds

Re: TB-sized databases

From

Tom Lane

Date:

06 December 2007, 14:34:54

Matthew <matthew@flymine.org> writes:
> On Thu, 6 Dec 2007, Tom Lane wrote:
>> Hmm.  IIRC, there are smarts in there about whether a mergejoin can
>> terminate early because of disparate ranges of the two join variables.

> Very cool. Would that be a planner cost estimate fix (so it avoids the
> merge join), or a query execution fix (so it does the merge join on the
> table subset)?

Cost estimate fix.  Basically what I'm thinking is that the startup cost
attributed to a mergejoin ought to account for any rows that have to be
skipped over before we reach the first join pair.  In general this is
hard to estimate, but for mergejoin it can be estimated using the same
type of logic we already use at the other end.

After looking at the code a bit, I'm realizing that there's actually a
bug in there as of 8.3: mergejoinscansel() is expected to be able to
derive numbers for either direction of scan, but if it's asked to
compute numbers for a DESC-order scan, it looks for a pg_stats entry
sorted with '>', which isn't gonna be there.  It needs to know to
look for an '<' histogram and switch the min/max.  So the lack of
symmetry here is causing an actual bug in logic that already exists.
That makes the case for fixing this now a bit stronger ...

            regards, tom lane

Re: TB-sized databases

From

Michael Stone

Date:

06 December 2007, 15:50:45

On Thu, Dec 06, 2007 at 11:13:18AM -0500, Tom Lane wrote:
>Indeed, and if you've got examples where it's that far off, you should
>report them.

Yeah, the trick is to get it to a digestable test case. The basic
scenario (there are more tables & columns in the actual case) is a set
of tables partitioned by date with a number of columns in one table
referencing rows in the others:

Table A (~5bn rows / 100's of partitions)
time Bkey1 Ckey1 Bkey2 Ckey2

Table B (~1bn rows / 100's of partitions)
Bkey Bval

Table C (~.5bn rows / 100's of partitions)
Ckey Cval

Bkey and Ckey are unique, but the planner doesn't know that.

Mike Stone

Re: TB-sized databases

From

Ron Mayer

Date:

06 December 2007, 21:50:05

Tom Lane wrote:
> Michael Stone <mstone+postgres@mathom.us> writes:
>> OTOH, the planner can really screw up queries on really large databases.
>> ... I've got some queries that the
>> planner thinks will return on the order of 10^30 rows for that sort of
>> reason. In practice, the query may return 10^3 rows....
>
> Indeed, and if you've got examples where it's that far off, you should
> report them.

If I read this right, I've got quite a few cases where the planner
expects 1 row but gets over 2000.

And within the plan, it looks like there's a step where it expects
511 rows and gets 2390779 which seems to be off by a factor of 4600x.

Also shown below it seems that if I use "OFFSET 0" as a "hint"
I can force a much (10x) better plan.  I wonder if there's room for
a pgfoundry project for a patch set that lets us use more hints
than OFFSET 0.

    Ron

logs=# analyze;
ANALYZE
logs=# explain analyze  select * from fact natural join d_ref natural join d_uag where ref_host = 'download.com.com'
andref_path = '/[path_removed].html' and useragent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)'; 
                                                                       QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=2827.72..398919.05 rows=1 width=242) (actual time=69175.963..141550.628 rows=2474 loops=1)
   Hash Cond: (fact.ref_id = d_ref.ref_id)
   ->  Nested Loop  (cost=2819.88..398908.65 rows=511 width=119) (actual time=3094.740..139361.235 rows=2390779
loops=1)
         ->  Index Scan using i_uag__val on d_uag  (cost=0.00..6.38 rows=1 width=91) (actual time=45.937..45.948 rows=1
loops=1)
               Index Cond: ((useragent)::text = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)'::text)
         ->  Bitmap Heap Scan on fact  (cost=2819.88..396449.49 rows=196223 width=32) (actual time=3048.770..135653.875
rows=2390779loops=1) 
               Recheck Cond: (fact.uag_id = d_uag.uag_id)
               ->  Bitmap Index Scan on i__fact__uag_id  (cost=0.00..2770.83 rows=196223 width=0) (actual
time=1713.148..1713.148rows=2390779 loops=1) 
                     Index Cond: (fact.uag_id = d_uag.uag_id)
   ->  Hash  (cost=7.83..7.83 rows=1 width=127) (actual time=62.841..62.841 rows=2 loops=1)
         ->  Index Scan using i_ref__val on d_ref  (cost=0.00..7.83 rows=1 width=127) (actual time=62.813..62.823
rows=2loops=1) 
               Index Cond: (((ref_path)::text = '[path_removed].html'::text) AND ((ref_host)::text =
'download.com.com'::text))
 Total runtime: 141563.733 ms
(13 rows)



############ using "offset 0" to force a better plan.


logs=# explain analyze select * from fact natural join (select * from d_ref natural join d_uag where ref_host =
'download.com.com'and ref_path = '/[path_removed].html' and useragent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows
98)'offset 0) as a; 
                                                                       QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=6465.12..7575.91 rows=367 width=2096) (actual time=2659.251..14703.343 rows=2474 loops=1)
   ->  Limit  (cost=0.00..14.22 rows=1 width=218) (actual time=114.968..115.140 rows=2 loops=1)
         ->  Nested Loop  (cost=0.00..14.22 rows=1 width=218) (actual time=114.964..115.127 rows=2 loops=1)
               ->  Index Scan using i_ref__val on d_ref  (cost=0.00..7.83 rows=1 width=127) (actual time=75.891..75.900
rows=2loops=1) 
                     Index Cond: (((ref_path)::text = '[path_removed].html'::text) AND ((ref_host)::text =
'download.com.com'::text))
               ->  Index Scan using i_uag__val on d_uag  (cost=0.00..6.38 rows=1 width=91) (actual time=19.582..19.597
rows=1loops=2) 
                     Index Cond: ((useragent)::text = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)'::text)
   ->  Bitmap Heap Scan on fact  (cost=6465.12..7556.18 rows=367 width=32) (actual time=2240.090..7288.145 rows=1237
loops=2)
         Recheck Cond: ((fact.uag_id = a.uag_id) AND (fact.ref_id = a.ref_id))
         ->  BitmapAnd  (cost=6465.12..6465.12 rows=367 width=0) (actual time=2221.539..2221.539 rows=0 loops=2)
               ->  Bitmap Index Scan on i__fact__uag_id  (cost=0.00..2770.83 rows=196223 width=0) (actual
time=1633.032..1633.032rows=2390779 loops=2) 
                     Index Cond: (fact.uag_id = a.uag_id)
               ->  Bitmap Index Scan on i__fact__ref_id  (cost=0.00..3581.50 rows=253913 width=0) (actual
time=150.614..150.614rows=77306 loops=2) 
                     Index Cond: (fact.ref_id = a.ref_id)
 Total runtime: 14710.870 ms
(15 rows)

Re: TB-sized databases

From

Tom Lane

Date:

06 December 2007, 21:55:22

I wrote:
> Hmm.  IIRC, there are smarts in there about whether a mergejoin can
> terminate early because of disparate ranges of the two join variables.
> Seems like it should be straightforward to fix it to also consider
> whether the time-to-return-first-row will be bloated because of
> disparate ranges.

I've posted a proposed patch for this:
http://archives.postgresql.org/pgsql-patches/2007-12/msg00025.php

            regards, tom lane

Re: TB-sized databases

From

Tom Lane

Date:

07 December 2007, 00:14:39

Ron Mayer <rm_pg@cheapcomplexdevices.com> writes:
> Tom Lane wrote:
>> There's something fishy about this --- given that that plan has a lower
>> cost estimate, it should've picked it without any artificial
>> constraints.

> I think the reason it's not picking it was discussed back in this thread
> too.
> http://archives.postgresql.org/pgsql-performance/2005-03/msg00675.php
> http://archives.postgresql.org/pgsql-performance/2005-03/msg00684.php
> My offset 0 is forcing the outer join.
> [Edit: Ugh - meant cartesian join - which helps this kind of query.]

Ah; I missed the fact that the two relations you want to join first
don't have any connecting WHERE clause.

The concern I mentioned in the above thread was basically that I don't
want the planner to go off chasing Cartesian join paths in general ---
they're usually useless and would result in an exponential explosion
in the number of join paths considered in many-table queries.

However, in this case the reason that the Cartesian join might be
relevant is that both of them are needed in order to form an inner
indexscan on the big table.  I wonder if we could drive consideration
of the Cartesian join off of noticing that.  It'd take some rejiggering
around best_inner_indexscan(), or somewhere in that general vicinity.

Way too late for 8.3, but something to think about for next time.

            regards, tom lane

Re: TB-sized databases

From

Ron Mayer

Date:

07 December 2007, 00:20:08

Tom Lane wrote:
> Ron Mayer <rm_pg@cheapcomplexdevices.com> writes:
>
>> Also shown below it seems that if I use "OFFSET 0" as a "hint"
>> I can force a much (10x) better plan.  I wonder if there's room for
>> a pgfoundry project for a patch set that lets us use more hints
>> than OFFSET 0.
>>
> There's something fishy about this --- given that that plan has a lower
> cost estimate, it should've picked it without any artificial
> constraints.


I think the reason it's not picking it was discussed back in this thread
too.
http://archives.postgresql.org/pgsql-performance/2005-03/msg00675.php
http://archives.postgresql.org/pgsql-performance/2005-03/msg00684.php
My offset 0 is forcing the outer join.
[Edit: Ugh - meant cartesian join - which helps this kind of query.]

>   What PG version are you using?

 logs=# select version();

 version
 ----------------------------------------------------------------------------------------------------------------
  PostgreSQL 8.2.1 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
 (1 row)

> Do you perhaps have a low setting for join_collapse_limit?


 logs=# show join_collapse_limit;
  join_collapse_limit
 ---------------------
  8
 (1 row)

 Actually, IIRC back in that other thread,  "set join_collapse_limit =1;"
 helped
 http://archives.postgresql.org/pgsql-performance/2005-03/msg00663.php

Re: TB-sized databases

From

Ron Mayer

Date:

07 December 2007, 01:54:28

Tom Lane wrote:
> Ron Mayer <rm_pg@cheapcomplexdevices.com> writes:
>> Tom Lane wrote:
>>> ...given that that plan has a lower cost estimate, it
>>> should've picked it without any artificialconstraints.
>
>>I think the reason it's not picking it was discussed back...
>> http://archives.postgresql.org/pgsql-performance/2005-03/msg00675.php
> ...
>
> The concern I mentioned in the above thread was basically that I don't
> want the planner to go off chasing Cartesian join paths in general...
> However, in this case the reason that the Cartesian join might be
> relevant is that both of them are needed in order to form an inner
> indexscan on the big table....

Interesting....  I think Simon mentioned last time that this type of
query is quite common for standard star schema data warehouses.
And it seem to me the Cartesian join on the dimension tables is
often pretty harmless since each dimension table would often return
just 1 row; and the size of the fact table is such that it's useful
to touch it as little as possible.

> Way too late for 8.3, but something to think about for next time.

No problem.. we've been working around it since that last
thread in early '05 with early 8.0, IIRC.  :-)

Thanks to the excellent postgres hints system ("offset 0" and
"set join_collapse_limit=1") we can get the plans we want
pretty easily. :-)

Re: TB-sized databases

From

Robert Treat

Date:

07 December 2007, 13:45:51

On Thursday 06 December 2007 04:38, Simon Riggs wrote:
> Robert,
>
> On Wed, 2007-12-05 at 15:07 -0500, Robert Treat wrote:
> > If the whole performance of your system depends upon indexed access, then
> > maybe you need a database that gives you a way to force index access at
> > the query level?
>
> That sounds like a request for hints, which is OT here, ISTM.
>

If you want to eat peas, and someone suggests you use a knife, can I only
argue the validity of using a knife? I'd rather just recommend a spoon.

> > I think you're completly overlooking the effect of disk latency has on
> > query times.  We run queries all the time that can vary from 4 hours to
> > 12 hours in time based solely on the amount of concurrent load on the
> > system, even though they always plan with the same cost.
>
> Not at all. If we had statement_cost_limit then it would be applied
> after planning and before execution begins. The limit would be based
> upon the planner's estimate, not the likely actual execution time.
>

This is nice, but it doesnt prevent "slow queries" reliably (which seemed to
be in the original complaints), since query time cannot be directly traced
back to statement cost.

> So yes a query may vary in execution time by a large factor as you
> suggest, and it would be difficult to set the proposed parameter
> accurately. However, the same is also true of statement_timeout, which
> we currently support, so I don't see this point as an blocker.
>
> Which leaves us at the burning question: Would you use such a facility,
> or would the difficulty in setting it exactly prevent you from using it
> for real?

I'm not sure. My personal instincts are that the solution is too fuzzy for me
to rely on, and if it isnt reliable, it's not a good solution. If you look at
all of the things people seem to think this will solve, I think I can raise
an alternative option that would be a more definitive solution:

"prevent queries from taking longer than x" -> statement_timeout.

"prevent planner from switching to bad plan" -> hint system

"prevent query from consuming too many resources" -> true resource
restrictions at the database level

I'm not so much against the idea of a statement cost limit, but I think we
need to realize that it does not really solve as many problems as people
think, in cases where it will help it often will do so poorly, and that there
are probably better solutions available to those problems.  Of course if you
back me into a corner I'll agree a poor solution is better than no solution,
so...

--
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

Re: TB-sized databases

From

Ron Mayer

Date:

07 December 2007, 16:50:18

Tom Lane wrote:
> Ron Mayer <rm_pg@cheapcomplexdevices.com> writes:
>> Tom Lane wrote:
>>> There's something fishy about this --- given that that plan has a lower
>>> cost estimate, it should've picked it without any artificial
>>> constraints.

One final thing I find curious about this is that the estimated
number of rows is much closer in the "offset 0" form of the query.

Since the logic itself is identical, I would have expected the
estimated total number of rows for both forms of this query to
be identical.

Any reason the two plans estimate a different total number of rows?

(explain statements for the two forms of the same query
from earlier in the thread here:
http://archives.postgresql.org/pgsql-performance/2007-12/msg00088.php )

Re: TB-sized databases

From

Simon Riggs

Date:

11 December 2007, 19:38:08

On Fri, 2007-12-07 at 12:45 -0500, Robert Treat wrote:
> On Thursday 06 December 2007 04:38, Simon Riggs wrote:

> > > I think you're completly overlooking the effect of disk latency has on
> > > query times.  We run queries all the time that can vary from 4 hours to
> > > 12 hours in time based solely on the amount of concurrent load on the
> > > system, even though they always plan with the same cost.
> >
> > Not at all. If we had statement_cost_limit then it would be applied
> > after planning and before execution begins. The limit would be based
> > upon the planner's estimate, not the likely actual execution time.
> >
>
> This is nice, but it doesnt prevent "slow queries" reliably (which seemed to
> be in the original complaints), since query time cannot be directly traced
> back to statement cost.

Hmm, well it can be directly traced, just not with the accuracy you
desire.

We can improve the accuracy, but then we would need to run the query
first in order to find out it was killing us.

> > So yes a query may vary in execution time by a large factor as you
> > suggest, and it would be difficult to set the proposed parameter
> > accurately. However, the same is also true of statement_timeout, which
> > we currently support, so I don't see this point as an blocker.
> >
> > Which leaves us at the burning question: Would you use such a facility,
> > or would the difficulty in setting it exactly prevent you from using it
> > for real?
>
> I'm not sure. My personal instincts are that the solution is too fuzzy for me
> to rely on, and if it isnt reliable, it's not a good solution. If you look at
> all of the things people seem to think this will solve, I think I can raise
> an alternative option that would be a more definitive solution:
>
> "prevent queries from taking longer than x" -> statement_timeout.
>
> "prevent planner from switching to bad plan" -> hint system
>
> "prevent query from consuming too many resources" -> true resource
> restrictions at the database level

I like and agree with your list, as an overview. I differ slightly on
specifics.

> I'm not so much against the idea of a statement cost limit, but I think we
> need to realize that it does not really solve as many problems as people
> think, in cases where it will help it often will do so poorly, and that there
> are probably better solutions available to those problems.  Of course if you
> back me into a corner I'll agree a poor solution is better than no solution,
> so...

statement_cost_limit isn't a panacea for all performance ills, its just
one weapon in the armoury. I'm caught somewhat in that whatever I
propose as a concrete next step, somebody says I should have picked
another. Oh well.

On specific points:

With hints I prefer a declarative approach, will discuss later in
release cycle.

The true resource restrictions sound good, but its still magic numbers.
How many I/Os are allowed before you kill the query? How much CPU? Those
are still going to be guessed at. How do we tell the difference between
a random I/O and a sequential I/O - there's no difference as far as
Postgres is concerned in the buffer manager, but it can cause a huge
performance difference. Whether you use real resource limits or
statement cost limits you still need to work out the size of your table
and then guess at appropriate limits.

Every other system I've seen uses resource limits, but the big problem
is that they are applied after something has been running for a long
time. It's kinda like saying I'll treat the gangrene when it reaches my
knee. I prefer to avoid the problem before it starts to hurt at all, so
I advocate learning the lessons from other systems, not simply follow
them. But having said that, I'm not against having them; its up to the
administrator how they want to manage their database, not me.

What resource limit parameters would you choose? (temp disk space etc..)

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com

Re: TB-sized databases

From

Bruce Momjian

Date:

17 March 2008, 20:47:17

I have _not_ added a TODO for this item.  Let me know if one is needed.

---------------------------------------------------------------------------

Tom Lane wrote:
> Ron Mayer <rm_pg@cheapcomplexdevices.com> writes:
> > Tom Lane wrote:
> >> There's something fishy about this --- given that that plan has a lower
> >> cost estimate, it should've picked it without any artificial
> >> constraints.
>
> > I think the reason it's not picking it was discussed back in this thread
> > too.
> > http://archives.postgresql.org/pgsql-performance/2005-03/msg00675.php
> > http://archives.postgresql.org/pgsql-performance/2005-03/msg00684.php
> > My offset 0 is forcing the outer join.
> > [Edit: Ugh - meant cartesian join - which helps this kind of query.]
>
> Ah; I missed the fact that the two relations you want to join first
> don't have any connecting WHERE clause.
>
> The concern I mentioned in the above thread was basically that I don't
> want the planner to go off chasing Cartesian join paths in general ---
> they're usually useless and would result in an exponential explosion
> in the number of join paths considered in many-table queries.
>
> However, in this case the reason that the Cartesian join might be
> relevant is that both of them are needed in order to form an inner
> indexscan on the big table.  I wonder if we could drive consideration
> of the Cartesian join off of noticing that.  It'd take some rejiggering
> around best_inner_indexscan(), or somewhere in that general vicinity.
>
> Way too late for 8.3, but something to think about for next time.
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: You can help support the PostgreSQL project by donating at
>
>                 http://www.postgresql.org/about/donate

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: TB-sized databases

From

Tom Lane

Date:

17 March 2008, 21:29:05

Bruce Momjian <bruce@momjian.us> writes:
> I have _not_ added a TODO for this item.  Let me know if one is needed.

Please do, I think it's an open issue.

* Consider Cartesian joins when both relations are needed to form an
  indexscan qualification for a third relation

            regards, tom lane

Re: TB-sized databases

From

Bruce Momjian

Date:

17 March 2008, 21:43:15

Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > I have _not_ added a TODO for this item.  Let me know if one is needed.
>
> Please do, I think it's an open issue.
>
> * Consider Cartesian joins when both relations are needed to form an
>   indexscan qualification for a third relation

Done, with URL added.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: TB-sized databases

From

Pablo Alcaraz

Date:

17 March 2008, 22:30:31

All that helps to pgsql to perform good in a TB-sized database
enviroment is a Good Think (r)  :D

Pablo

Bruce Momjian wrote:
> I have _not_ added a TODO for this item.  Let me know if one is needed.

Re: TB-sized databases

From

Ron Mayer

Date:

18 March 2008, 02:48:41

Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
>> I have _not_ added a TODO for this item.  Let me know if one is needed.
>
> Please do, I think it's an open issue.
>
> * Consider Cartesian joins when both relations are needed to form an
>   indexscan qualification for a third relation
>

Would another possible condition for considering
Cartesian joins be be:

   * Consider Cartesian joins when a unique constraint can prove
     that at most one row will be pulled from one of the tables
     that would be part of this join?

In the couple cases where this happened to me it was
in queries on a textbook star schema like this:

  select * from fact
           join dim1 using (dim1_id)
           join dim2 using (dim2_id)
          where dim1.value = 'something'
            and dim2.valuex = 'somethingelse'
            and dim2.valuey = 'more';

and looking up all the IDs before hitting the huge
fact table.  Often in these cases the where clause
on the dimension tables are on values with a unique
constraint.

If I understand right - if the constraint can prove
it'll return at most 1 row - that means the cartesian
join is provably safe from blowing up.

Not sure if that's redundant with the condition you
mentioned, or if it's yet a separate condition where
we might also want to consider cartesian joins.

Ron M

Re: TB-sized databases

From

Tom Lane

Date:

18 March 2008, 11:49:44

Ron Mayer <rm_pg@cheapcomplexdevices.com> writes:
> Would another possible condition for considering
> Cartesian joins be be:

>    * Consider Cartesian joins when a unique constraint can prove
>      that at most one row will be pulled from one of the tables
>      that would be part of this join?

What for?  That would still lead us to consider large numbers of totally
useless joins.

            regards, tom lane

Re: TB-sized databases

From

Ron Mayer

Date:

18 March 2008, 14:33:40

Tom Lane wrote:
> Ron Mayer <rm_pg@cheapcomplexdevices.com> writes:
>> Would another possible condition for considering
>> Cartesian joins be be:
>
>>    * Consider Cartesian joins when a unique constraint can prove
>>      that at most one row will be pulled from one of the tables
>>      that would be part of this join?
>
> What for?  That would still lead us to consider large numbers of totally
> useless joins.
>
>             regards, tom lane

Often I get order-of-magnitude better queries by forcing the cartesian
join even without multi-column indexes.

Explain analyze results below.



Here's an example with your typical star schema.
   fact is the central fact table.
   d_ref is a dimension table for the referrer
   d_uag is a dimension table for the useragent.

Forcing the cartesan join using "offset 0" makes
the the query take 14 ms (estimated cost 7575).

If I don't force the cartesian join the query takes
over 100ms (estimated cost 398919).

Indexes are on each dimension; but no multi-column
indexes (since the ad-hoc queries can hit any permutation
of dimensions).

logs=# explain analyze select * from fact natural join (select * from d_ref natural join d_uag where ref_host =
'www.real.com'and ref_path = '/products/player/more_info/moreinfo.html' and ref_query =
'?ID=370&DC=&LANG=&PN=RealOne%20Player&PV=6.0.11.818&PT=&OS=&CM=&CMV=&LS=&RE=&RA=&RV='and useragent = 'Mozilla/4.08
[en](WinNT; U ;Nav)' offset 0 ) as a; 

          QUERY PLAN 

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Nested Loop  (cost=6465.12..7575.91 rows=367 width=2096) (actual time=14.152..14.192 rows=4 loops=1)
    ->  Limit  (cost=0.00..14.22 rows=1 width=218) (actual time=0.084..0.102 rows=1 loops=1)
          ->  Nested Loop  (cost=0.00..14.22 rows=1 width=218) (actual time=0.082..0.096 rows=1 loops=1)
                ->  Index Scan using i_ref__val on d_ref  (cost=0.00..7.83 rows=1 width=127) (actual time=0.056..0.058
rows=1loops=1) 
                      Index Cond: (((ref_path)::text = '/products/player/more_info/moreinfo.html'::text) AND
((ref_host)::text= 'www.real.com'::text) AND ((ref_query)::text =
'?ID=370&DC=&LANG=&PN=RealOne%20Player&PV=6.0.11.818&PT=&OS=&CM=&CMV=&LS=&RE=&RA=&RV='::text))
                ->  Index Scan using i_uag__val on d_uag  (cost=0.00..6.38 rows=1 width=91) (actual time=0.020..0.029
rows=1loops=1) 
                      Index Cond: ((useragent)::text = 'Mozilla/4.08 [en] (WinNT; U ;Nav)'::text)
    ->  Bitmap Heap Scan on fact  (cost=6465.12..7556.18 rows=367 width=32) (actual time=14.053..14.066 rows=4 loops=1)
          Recheck Cond: ((fact.uag_id = a.uag_id) AND (fact.ref_id = a.ref_id))
          ->  BitmapAnd  (cost=6465.12..6465.12 rows=367 width=0) (actual time=14.016..14.016 rows=0 loops=1)
                ->  Bitmap Index Scan on i__fact__uag_id  (cost=0.00..2770.83 rows=196223 width=0) (actual
time=2.258..2.258rows=7960 loops=1) 
                      Index Cond: (fact.uag_id = a.uag_id)
                ->  Bitmap Index Scan on i__fact__ref_id  (cost=0.00..3581.50 rows=253913 width=0) (actual
time=9.960..9.960rows=13751 loops=1) 
                      Index Cond: (fact.ref_id = a.ref_id)
  Total runtime: 14.332 ms
(15 rows)

logs=#



logs=# explain analyze select * from fact natural join (select * from d_ref natural join d_uag where ref_host =
'www.real.com'and ref_path = '/products/player/more_info/moreinfo.html' and ref_query =
'?ID=370&DC=&LANG=&PN=RealOne%20Player&PV=6.0.11.818&PT=&OS=&CM=&CMV=&LS=&RE=&RA=&RV='and useragent = 'Mozilla/4.08
[en](WinNT; U ;Nav)' ) as a; 

       QUERY PLAN 

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Hash Join  (cost=2827.72..398919.05 rows=1 width=242) (actual time=78.777..107.038 rows=4 loops=1)
    Hash Cond: (fact.ref_id = d_ref.ref_id)
    ->  Nested Loop  (cost=2819.88..398908.65 rows=511 width=119) (actual time=6.311..101.843 rows=7960 loops=1)
          ->  Index Scan using i_uag__val on d_uag  (cost=0.00..6.38 rows=1 width=91) (actual time=0.021..0.029 rows=1
loops=1)
                Index Cond: ((useragent)::text = 'Mozilla/4.08 [en] (WinNT; U ;Nav)'::text)
          ->  Bitmap Heap Scan on fact  (cost=2819.88..396449.49 rows=196223 width=32) (actual time=6.273..91.645
rows=7960loops=1) 
                Recheck Cond: (fact.uag_id = d_uag.uag_id)
                ->  Bitmap Index Scan on i__fact__uag_id  (cost=0.00..2770.83 rows=196223 width=0) (actual
time=5.117..5.117rows=7960 loops=1) 
                      Index Cond: (fact.uag_id = d_uag.uag_id)
    ->  Hash  (cost=7.83..7.83 rows=1 width=127) (actual time=0.069..0.069 rows=1 loops=1)
          ->  Index Scan using i_ref__val on d_ref  (cost=0.00..7.83 rows=1 width=127) (actual time=0.059..0.062 rows=1
loops=1)
                Index Cond: (((ref_path)::text = '/products/player/more_info/moreinfo.html'::text) AND
((ref_host)::text= 'www.real.com'::text) AND ((ref_query)::text =
'?ID=370&DC=&LANG=&PN=RealOne%20Player&PV=6.0.11.818&PT=&OS=&CM=&CMV=&LS=&RE=&RA=&RV='::text))
  Total runtime: 107.193 ms
(13 rows)

Re: TB-sized databases

From

Ron Mayer

Date:

18 March 2008, 22:17:49

Ron Mayer wrote:
> Tom Lane wrote:
>> Ron Mayer <rm_pg@cheapcomplexdevices.com> writes:
>>> Would another possible condition for considering
>>> Cartesian joins be be:
>>>    * Consider Cartesian joins when a unique constraint can prove
>>>      that at most one row will be pulled from one of the tables
>>>      that would be part of this join?
>>
>> What for?  That would still lead us to consider large numbers of totally
>> useless joins.
>
> Often I get order-of-magnitude better queries by forcing the cartesian
> join even without multi-column indexes.

Ah - and sometimes even 2 order of magnitude improvements.

1.1 seconds with Cartesian join, 200 seconds if it
doesn't use it.



logs=# explain analyze select * from fact natural join (select * from d_ref natural join d_uag where ref_host =
'www.real.com'and ref_path = '/products/player/more_info/moreinfo.html' and ref_query =
'?ID=370&DC=&LANG=&PN=RealOne%20Player&PV=6.0.11.818&PT=&OS=&CM=&CMV=&LS=&RE=&RA=&RV='and useragent = 'Mozilla/4.0
(compatible;MSIE 5.01; Windows 98)' offset 0 ) as a; 

          QUERY PLAN 

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Nested Loop  (cost=6465.12..7575.91 rows=367 width=2096) (actual time=1118.741..1119.207 rows=122 loops=1)
    ->  Limit  (cost=0.00..14.22 rows=1 width=218) (actual time=0.526..0.542 rows=1 loops=1)
          ->  Nested Loop  (cost=0.00..14.22 rows=1 width=218) (actual time=0.524..0.537 rows=1 loops=1)
                ->  Index Scan using i_ref__val on d_ref  (cost=0.00..7.83 rows=1 width=127) (actual time=0.168..0.170
rows=1loops=1) 
                      Index Cond: (((ref_path)::text = '/products/player/more_info/moreinfo.html'::text) AND
((ref_host)::text= 'www.real.com'::text) AND ((ref_query)::text =
'?ID=370&DC=&LANG=&PN=RealOne%20Player&PV=6.0.11.818&PT=&OS=&CM=&CMV=&LS=&RE=&RA=&RV='::text))
                ->  Index Scan using i_uag__val on d_uag  (cost=0.00..6.38 rows=1 width=91) (actual time=0.347..0.355
rows=1loops=1) 
                      Index Cond: ((useragent)::text = 'Mozilla/4.0 (compatible; MSIE 5.01; Windows 98)'::text)
    ->  Bitmap Heap Scan on fact  (cost=6465.12..7556.18 rows=367 width=32) (actual time=1118.196..1118.491 rows=122
loops=1)
          Recheck Cond: ((fact.uag_id = a.uag_id) AND (fact.ref_id = a.ref_id))
          ->  BitmapAnd  (cost=6465.12..6465.12 rows=367 width=0) (actual time=1115.565..1115.565 rows=0 loops=1)
                ->  Bitmap Index Scan on i__fact__uag_id  (cost=0.00..2770.83 rows=196223 width=0) (actual
time=813.859..813.859rows=1183470 loops=1) 
                      Index Cond: (fact.uag_id = a.uag_id)
                ->  Bitmap Index Scan on i__fact__ref_id  (cost=0.00..3581.50 rows=253913 width=0) (actual
time=8.667..8.667rows=13751 loops=1) 
                      Index Cond: (fact.ref_id = a.ref_id)
  Total runtime: 1122.245 ms
(15 rows)

logs=# explain analyze select * from fact natural join (select * from d_ref natural join d_uag where ref_host =
'www.real.com'and ref_path = '/products/player/more_info/moreinfo.html' and ref_query =
'?ID=370&DC=&LANG=&PN=RealOne%20Player&PV=6.0.11.818&PT=&OS=&CM=&CMV=&LS=&RE=&RA=&RV='and useragent = 'Mozilla/4.0
(compatible;MSIE 5.01; Windows 98)' ) as a; 

       QUERY PLAN 

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Hash Join  (cost=2827.72..398919.05 rows=1 width=242) (actual time=114138.193..200622.416 rows=122 loops=1)
    Hash Cond: (fact.ref_id = d_ref.ref_id)
    ->  Nested Loop  (cost=2819.88..398908.65 rows=511 width=119) (actual time=1524.600..199522.182 rows=1183470
loops=1)
          ->  Index Scan using i_uag__val on d_uag  (cost=0.00..6.38 rows=1 width=91) (actual time=0.023..0.033 rows=1
loops=1)
                Index Cond: ((useragent)::text = 'Mozilla/4.0 (compatible; MSIE 5.01; Windows 98)'::text)
          ->  Bitmap Heap Scan on fact  (cost=2819.88..396449.49 rows=196223 width=32) (actual
time=1524.562..197627.135rows=1183470 loops=1) 
                Recheck Cond: (fact.uag_id = d_uag.uag_id)
                ->  Bitmap Index Scan on i__fact__uag_id  (cost=0.00..2770.83 rows=196223 width=0) (actual
time=758.888..758.888rows=1183470 loops=1) 
                      Index Cond: (fact.uag_id = d_uag.uag_id)
    ->  Hash  (cost=7.83..7.83 rows=1 width=127) (actual time=0.067..0.067 rows=1 loops=1)
          ->  Index Scan using i_ref__val on d_ref  (cost=0.00..7.83 rows=1 width=127) (actual time=0.058..0.060 rows=1
loops=1)
                Index Cond: (((ref_path)::text = '/products/player/more_info/moreinfo.html'::text) AND
((ref_host)::text= 'www.real.com'::text) AND ((ref_query)::text =
'?ID=370&DC=&LANG=&PN=RealOne%20Player&PV=6.0.11.818&PT=&OS=&CM=&CMV=&LS=&RE=&RA=&RV='::text))
  Total runtime: 200625.636 ms
(13 rows)

logs=#