Thread: What is the right way to deal with a table with rows that are not in a random order?
What is the right way to deal with a table with rows that are not in a random order?
From
Douglas Alan
Date:
We have a very large table (150 million rows) where the rows are not in a random order. Some common queries will have millions of results, and this of course is slow. For an interactive interface to the database, we want to put a limit on all queries so that queries will return quickly even if there are millions of results.
The problem we are seeing at the moment is that the statistics histograms are actually often hurting us. Postgres sees that for some queries there are so many results that it decides to do a sequential scan, rather than using the index. Unfortunately, sometimes all of these millions of results are at the end of the table, rather than being randomly distributed in the table, so a sequential scan is the worst possible approach.
To fix this, we could use an "order by" clause to force Postgres to use the index, but this isn't so easy as we are using Postgres through an ORM (i.e, Django in this case), and it's not so easy to try to jury-rig things this way on a per-query basis.
Alternatively, we could randomize the ordering of the table rows, but that seems like an awfully big hammer, and will also prevent us from ordering the table on a specific index, which is different from the index in question, but might be correlated with it.
Is there a way of telling Postgres not to assume that the table rows are in a random order, so that Postgres won't make the mistake of ever doing a sequence scan on a huge table when there is a small limit?
Thanks!
|>ouglas
The problem we are seeing at the moment is that the statistics histograms are actually often hurting us. Postgres sees that for some queries there are so many results that it decides to do a sequential scan, rather than using the index. Unfortunately, sometimes all of these millions of results are at the end of the table, rather than being randomly distributed in the table, so a sequential scan is the worst possible approach.
To fix this, we could use an "order by" clause to force Postgres to use the index, but this isn't so easy as we are using Postgres through an ORM (i.e, Django in this case), and it's not so easy to try to jury-rig things this way on a per-query basis.
Alternatively, we could randomize the ordering of the table rows, but that seems like an awfully big hammer, and will also prevent us from ordering the table on a specific index, which is different from the index in question, but might be correlated with it.
Is there a way of telling Postgres not to assume that the table rows are in a random order, so that Postgres won't make the mistake of ever doing a sequence scan on a huge table when there is a small limit?
Thanks!
|>ouglas
Re: What is the right way to deal with a table with rows that are not in a random order?
From
Jeff Davis
Date:
On Wed, 2009-05-27 at 19:53 -0400, Douglas Alan wrote: > We have a very large table (150 million rows) where the rows are not > in a random order. Some common queries will have millions of results, > and this of course is slow. For an interactive interface to the > database, we want to put a limit on all queries so that queries will > return quickly even if there are millions of results. > > The problem we are seeing at the moment is that the statistics > histograms are actually often hurting us. Postgres sees that for some > queries there are so many results that it decides to do a sequential > scan, rather than using the index. Unfortunately, sometimes all of > these millions of results are at the end of the table, rather than > being randomly distributed in the table, so a sequential scan is the > worst possible approach. If you're putting a LIMIT on it, why does it return millions of results? > To fix this, we could use an "order by" clause to force Postgres to > use the index, but this isn't so easy as we are using Postgres through > an ORM (i.e, Django in this case), and it's not so easy to try to > jury-rig things this way on a per-query basis. > ORDER BY won't always choose an index scan, for instance if the correlation is low and the expected results are many. Can you pick out an interesting query and give some specifics, like: * the query * the EXPLAIN ANALYZE output (or EXPLAIN without ANALYZE if it takes too long to even run once) * EXPLAIN ANALYZE output if you force the index scan * the statistics for the relevant columns, such as histogram and correlation Regards, Jeff Davis
Re: What is the right way to deal with a table with rows that are not in a random order?
From
Douglas Alan
Date:
On Wed, May 27, 2009 at 8:54 PM, Jeff Davis <pgsql@j-davis.com> wrote:
It doesn't return millions of results with the LIMIT on it. It just does a sequential scan of the table and doesn't find any results until it gets to the last quarter of the table. Sequentially scanning through 3/4 of the huge table before it gets a single match takes a very long time.
As I said, in my original post, Postgres's approach would be completely reasonable in this case, if the rows that it was looking for were sprinkled randomly throughout the table. But they're not in this case -- they're all at the end.
As I mentioned, the situation is very simple, and easy to understand what is going on. There's absolutely no mystery as to why Postgres is doing what it's doing. 25% of the table matches the query. Postgres knows this due to the statistics histogram for the column. Postgres is deciding to do a sequential scan because it knows that 25% of the rows match the query.
Unfortunately, in this case, that's a poor approach.
|>ouglas
If you're putting a LIMIT on it, why does it return millions of results?
It doesn't return millions of results with the LIMIT on it. It just does a sequential scan of the table and doesn't find any results until it gets to the last quarter of the table. Sequentially scanning through 3/4 of the huge table before it gets a single match takes a very long time.
As I said, in my original post, Postgres's approach would be completely reasonable in this case, if the rows that it was looking for were sprinkled randomly throughout the table. But they're not in this case -- they're all at the end.
Can you pick out an interesting query and give some specifics, like:
* the query
* the EXPLAIN ANALYZE output (or EXPLAIN without ANALYZE if it takes too
long to even run once)
* EXPLAIN ANALYZE output if you force the index scan
* the statistics for the relevant columns, such as histogram and
correlation
As I mentioned, the situation is very simple, and easy to understand what is going on. There's absolutely no mystery as to why Postgres is doing what it's doing. 25% of the table matches the query. Postgres knows this due to the statistics histogram for the column. Postgres is deciding to do a sequential scan because it knows that 25% of the rows match the query.
Unfortunately, in this case, that's a poor approach.
|>ouglas
Re: What is the right way to deal with a table with rows that are not in a random order?
From
Simon Riggs
Date:
On Wed, 2009-05-27 at 19:53 -0400, Douglas Alan wrote: > We have a very large table (150 million rows) where the rows are not > in a random order. Some common queries will have millions of results, > and this of course is slow. For an interactive interface to the > database, we want to put a limit on all queries so that queries will > return quickly even if there are millions of results. Partition the table, then scan the correct partition. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Re: What is the right way to deal with a table with rows that are not in a random order?
From
Tom Lane
Date:
Douglas Alan <darkwater42@gmail.com> writes: > As I said, in my original post, Postgres's approach would be completely > reasonable in this case,* if* the rows that it was looking for were > sprinkled randomly throughout the table. But they're *not* in this case -- > they're all at the end. There's been some talk of penalizing the seqscan+limit combination (perhaps by increasing the estimated start cost for the seqscan) if the WHERE clause involves any variables that have a correlation stat significantly different from zero. But nobody's done the legwork to see if this would really be useful or what an appropriate penalty curve might be. regards, tom lane
Re: What is the right way to deal with a table with rows that are not in a random order?
From
Scott Marlowe
Date:
On Wed, May 27, 2009 at 8:43 PM, Douglas Alan <darkwater42@gmail.com> wrote: > On Wed, May 27, 2009 at 8:54 PM, Jeff Davis <pgsql@j-davis.com> wrote: > >> >> If you're putting a LIMIT on it, why does it return millions of results? > > It doesn't return millions of results with the LIMIT on it. It just does a > sequential scan of the table and doesn't find any results until it gets to > the last quarter of the table. Sequentially scanning through 3/4 of the huge > table before it gets a single match takes a very long time. > > As I said, in my original post, Postgres's approach would be completely > reasonable in this case, if the rows that it was looking for were sprinkled > randomly throughout the table. But they're not in this case -- they're all > at the end. > >> Can you pick out an interesting query and give some specifics, like: >> * the query >> * the EXPLAIN ANALYZE output (or EXPLAIN without ANALYZE if it takes too >> long to even run once) >> * EXPLAIN ANALYZE output if you force the index scan >> * the statistics for the relevant columns, such as histogram and >> correlation > > As I mentioned, the situation is very simple, and easy to understand what is > going on. There's absolutely no mystery as to why Postgres is doing what > it's doing. 25% of the table matches the query. Postgres knows this due to > the statistics histogram for the column. Postgres is deciding to do a > sequential scan because it knows that 25% of the rows match the query. > > Unfortunately, in this case, that's a poor approach. Postgres has a very smart query planner, that has a lot of knobs you can fiddle with to change how and when it changes from one plan to another. It's possible by changing some of those that pgsql will use a different query plan that is more efficient. Seeing things like explain analyze might help someone make a suggestion. OTOH, if you've got it all sussed out, then ignore the request for more information.
Re: What is the right way to deal with a table with rows that are not in a random order?
From
Scott Marlowe
Date:
Note that in the OPs case I'd probably try testing things like turning off seqscan, or lowering random_page_cost. I'd also look at clustering on the index for the field you're selecting on.
Re: What is the right way to deal with a table with rows that are not in a random order?
From
Douglas Alan
Date:
On Thu, May 28, 2009 at 4:14 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > Partition the table, then scan the correct partition. If I do that, will Postgres figure out the "right thing" to do if the parent table is queried instead? Also, what are the performance implications then for doing queries that span all the partitions, which will be the norm for our application? The application in question is a kind of data warehousing thing (of astronomical stars), and there's an ORM in the middle, so it's not easy for us to hand-tune how individual queries are specified. Instead, we have to structure the database and the indexes so that things generally perform well, without having to tweak specific queries. Users can specify fairly arbitrary search criteria. All of the queries should perform well. By "well", I mean within 10 seconds or so. Scanning all of the 150 million rows takes much longer than 10 seconds, unfortunately. Any one of these "solutions" will cause Postgres to do an index scan in the problematic case where Postgres is deciding to a sequential scan. The index scan performs snappily enough: - Using "order by" on the query. - Changing the search value for the column to a value that occurs less frequently. - Fetching the value to search for via a sub-query so that Postgres can't determine a priori that the value being searched value occurs so commonly. Unfortunately, as I mentioned, due to the ORM, none of these solutions really work for us in practice, as opposed to at a psql prompt. |>ouglas
Re: What is the right way to deal with a table with rows that are not in a random order?
From
Douglas Alan
Date:
On Thu, May 28, 2009 at 10:24 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > OTOH, if you've got it all sussed out, then ignore the request for more information. I don't *know* if I have it "all sussed out", but I *do* know why Postgres is doing what it is doing in this particular case. It's assuming that the value in question is evenly distributed throughout the table, when in actuality, the value in question is clustered at the very end of the table. What I don't have sussed out his how best to address this issue. The most obvious solution would be an option to tell Postgres not to assume that the value is evenly distributed throughout the table, and to take account of the fact that the data in question might very well be clustered at the very end of the table. |>ouglas
Re: What is the right way to deal with a table with rows that are not in a random order?
From
Douglas Alan
Date:
On Thu, May 28, 2009 at 10:41 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> Note that in the OPs case I'd probably try testing things like turning
> off seqscan, or lowering random_page_cost. I'd also look at
> clustering on the index for the field you're selecting on.
I'm worried that turning off seqscan would distort other queries. (Remember, I don't have control of the individual queries. The user of the application can specify all sorts of queries, and there's an ORM in the middle.)
In order to force Postgres to do an index scan for this query, I had to set random_page_cost to 0. Even 0.1 was not small enough. Alternatively, I could set seq_page_cost to 39. (38 was not big enough.) Again, I'm worried that by using such a big hammer, I would distort Postgres's query planning for other queries.
|>ouglas
P.S. Here is the actual data that some people have been clamoring for:
Slow query:
Here's the explanation:
The same query made fast by removing Postgres's ability to know a priori what particular value is being searched for:
Here's the explanation for the above query showing that it is using the index:
P.P.S. Many other queries were very slow due to Postgres wanting to use hash joins and merge joins rather than nested index scans. The nested index scans were, in fact, much faster. I just turned off hash joins and merge joins: problem solved. It might be nice at some point to figure out what is going with Postgres trying to use these fancier joins that turn out to be much slower for us, but that's a worry for another day, since I have a perfectly good work-around at the moment.
> Note that in the OPs case I'd probably try testing things like turning
> off seqscan, or lowering random_page_cost. I'd also look at
> clustering on the index for the field you're selecting on.
I'm worried that turning off seqscan would distort other queries. (Remember, I don't have control of the individual queries. The user of the application can specify all sorts of queries, and there's an ORM in the middle.)
In order to force Postgres to do an index scan for this query, I had to set random_page_cost to 0. Even 0.1 was not small enough. Alternatively, I could set seq_page_cost to 39. (38 was not big enough.) Again, I'm worried that by using such a big hammer, I would distort Postgres's query planning for other queries.
|>ouglas
P.S. Here is the actual data that some people have been clamoring for:
Slow query:
psql> select * from maindb_astobject
where survey_id = 2
limit 20 offset 1000;
Time: 18073.691 ms
where survey_id = 2
limit 20 offset 1000;
Time: 18073.691 ms
Here's the explanation:
QUERY PLAN
-------------------------------------------------------------------------------------
Limit (cost=47.99..48.95 rows=20 width=153)
-> Seq Scan on maindb_astobject (cost=0.00..3538556.10 rows=73736478 width=153)
Filter: (survey_id = 2)
-------------------------------------------------------------------------------------
Limit (cost=47.99..48.95 rows=20 width=153)
-> Seq Scan on maindb_astobject (cost=0.00..3538556.10 rows=73736478 width=153)
Filter: (survey_id = 2)
The same query made fast by removing Postgres's ability to know a priori what particular value is being searched for:
psql> select * from maindb_astobject join maindb_enumentity
on maindb_astobject.survey_id = maindb_enumentity.id
where entityname = 'MACHO'
limit 20 offet 1000;
Time: 1.638 ms
on maindb_astobject.survey_id = maindb_enumentity.id
where entityname = 'MACHO'
limit 20 offet 1000;
Time: 1.638 ms
Here's the explanation for the above query showing that it is using the index:
QUERY PLAN
--------------------------------------------------------------------------
Limit (cost=164.97..168.27 rows=20 width=215)
-> Nested Loop (cost=0.00..1233523.72 rows=7477081 width=215)
-> Seq Scan on maindb_enumentity (cost=0.00..1.12 rows=1 width=62)
Filter: ((entityname)::text = 'MACHO'::text)
-> Index Scan using maindb_astobject_survey_id on maindb_astobject (cost=0.00..1046595.57 rows=14954162 width=153)
Index Cond: (maindb_astobject.survey_id = maindb_enumentity.id)
--------------------------------------------------------------------------
Limit (cost=164.97..168.27 rows=20 width=215)
-> Nested Loop (cost=0.00..1233523.72 rows=7477081 width=215)
-> Seq Scan on maindb_enumentity (cost=0.00..1.12 rows=1 width=62)
Filter: ((entityname)::text = 'MACHO'::text)
-> Index Scan using maindb_astobject_survey_id on maindb_astobject (cost=0.00..1046595.57 rows=14954162 width=153)
Index Cond: (maindb_astobject.survey_id = maindb_enumentity.id)
P.P.S. Many other queries were very slow due to Postgres wanting to use hash joins and merge joins rather than nested index scans. The nested index scans were, in fact, much faster. I just turned off hash joins and merge joins: problem solved. It might be nice at some point to figure out what is going with Postgres trying to use these fancier joins that turn out to be much slower for us, but that's a worry for another day, since I have a perfectly good work-around at the moment.
Re: What is the right way to deal with a table with rows that are not in a random order?
From
Scott Marlowe
Date:
On Thu, May 28, 2009 at 1:12 PM, Douglas Alan <darkwater42@gmail.com> wrote: > On Thu, May 28, 2009 at 10:24 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > >> OTOH, if you've got it all sussed out, then ignore the request for more information. > > I don't *know* if I have it "all sussed out", but I *do* know why > Postgres is doing what it is doing in this particular case. It's > assuming that the value in question is evenly distributed throughout > the table, when in actuality, the value in question is clustered at > the very end of the table. It's doing way more than that. My point above was that the query planner is not JUST assuming the values are wel ordered. It's assuming random_page_cost is x times more than sequential page cost, it's assuming the table doesn't fit in effective cache, or shared buffers, it's assuming lots of things based on how you've tuned (or not) your database. I'll finish in reply to your other post.
Re: What is the right way to deal with a table with rows that are not in a random order?
From
Scott Marlowe
Date:
On Thu, May 28, 2009 at 1:45 PM, Douglas Alan <darkwater42@gmail.com> wrote: > On Thu, May 28, 2009 at 10:41 AM, Scott Marlowe <scott.marlowe@gmail.com> > wrote: > >> Note that in the OPs case I'd probably try testing things like turning >> off seqscan, or lowering random_page_cost. I'd also look at >> clustering on the index for the field you're selecting on. > > I'm worried that turning off seqscan would distort other queries. > (Remember, I don't have control of the individual queries. The user of the > application can specify all sorts of queries, and there's an ORM in the > middle.) You are aware you can turn off seq scans for just the current connection, right? set enable_seqscan=off; > In order to force Postgres to do an index scan for this query, I had to set > random_page_cost to 0. Even 0.1 was not small enough. Alternatively, I > could set seq_page_cost to 39. (38 was not big enough.) Again, I'm worried > that by using such a big hammer, I would distort Postgres's query planning > for other queries. No one's saying to do it all the time. They're saying to do it and then run explain analyze on your query, then post the results of both let us have a look. > P.S. Here is the actual data that some people have been clamoring for: No, it's not. It's just explain output. Unless the actual query running takes hours, it's much more useful to have explain analyze output. With the analyze part, it's only showing what the planner expects, not what actually happens. > Slow query: > > psql> select * from maindb_astobject > where survey_id = 2 > limit 20 offset 1000; You'll notice that in your previous posts you never mentioned offset, which greatly affects the plan chosen. Is survey_id indexed? > Time: 18073.691 ms > > Here's the explanation: > > QUERY PLAN > > ------------------------------------------------------------------------------------- > Limit (cost=47.99..48.95 rows=20 width=153) > -> Seq Scan on maindb_astobject (cost=0.00..3538556.10 rows=73736478 > width=153) > Filter: (survey_id = 2) > > The same query made fast by removing Postgres's ability to know a priori > what particular value is being searched for: > > psql> select * from maindb_astobject join maindb_enumentity > on maindb_astobject.survey_id = maindb_enumentity.id > where entityname = 'MACHO' > limit 20 offet 1000; That's not the same query. > Time: 1.638 ms > > Here's the explanation for the above query showing that it is using the > index: So, can we see explain analyze? > QUERY PLAN > -------------------------------------------------------------------------- > Limit (cost=164.97..168.27 rows=20 width=215) > -> Nested Loop (cost=0.00..1233523.72 rows=7477081 width=215) > -> Seq Scan on maindb_enumentity (cost=0.00..1.12 rows=1 width=62) > Filter: ((entityname)::text = 'MACHO'::text) > -> Index Scan using maindb_astobject_survey_id on maindb_astobject > (cost=0.00..1046595.57 rows=14954162 width=153) > Index Cond: (maindb_astobject.survey_id = maindb_enumentity.id) > > > P.P.S. Many other queries were very slow due to Postgres wanting to use hash > joins and merge joins rather than nested index scans. Then it's quite possible you have a problem with misestimation of values in your db. Since we have no explain ANALYZE output, we cannot judge if this is the case. > The nested index > scans were, in fact, much faster. I just turned off hash joins and merge > joins: problem solved. One problem solved, another one created is more likely. > It might be nice at some point to figure out what > is going with Postgres trying to use these fancier joins that turn out to be > much slower for us, but that's a worry for another day, since I have a > perfectly good work-around at the moment. Yes, let us see explain analyze output and maybe we can help. You know, the thing that was asked for at the beginning.
Re: What is the right way to deal with a table with rows that are not in a random order?
From
Martin Gainty
Date:
ORM should'nt be a factor as you have many configureable properties available in hibernate.cfg.xml or hibernate.properties
Hibernate's own connection pooling algorithm is however quite rudimentary. It is intended to help you get started and is not intended for use in a production system or even for performance testing. You should use a third party pool for best performance and stability. Just replace the hibernate.connection.pool_size property with connection pool specific settings. This will turn off Hibernate's internal pool. For example, you might like to use C3P0.
C3P0 is an open source JDBC connection pool distributed along with Hibernate in the lib directory. Hibernate will use its C3P0ConnectionProvider for connection pooling if you set hibernate.c3p0.* properties. If you'd like to use Proxool refer to the packaged hibernate.properties and the Hibernate web site for more information.
Here is an example hibernate.properties file for C3P0:
Martin Gainty
______________________________________________
Verzicht und Vertraulichkeitanmerkung/Note de déni et de confidentialité
Diese Nachricht ist vertraulich. Sollten Sie nicht der vorgesehene Empfaenger sein, so bitten wir hoeflich um eine Mitteilung. Jede unbefugte Weiterleitung oder Fertigung einer Kopie ist unzulaessig. Diese Nachricht dient lediglich dem Austausch von Informationen und entfaltet keine rechtliche Bindungswirkung. Aufgrund der leichten Manipulierbarkeit von E-Mails koennen wir keine Haftung fuer den Inhalt uebernehmen.
> From: darkwater42@gmail.com
> Date: Thu, 28 May 2009 15:03:32 -0400
> Subject: Re: [GENERAL] What is the right way to deal with a table with rows that are not in a random order?
> To: pgsql-general@postgresql.org
> CC: simon@2ndquadrant.com
>
> On Thu, May 28, 2009 at 4:14 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> > Partition the table, then scan the correct partition.
>
> If I do that, will Postgres figure out the "right thing" to do if the
> parent table is queried instead? Also, what are the performance
> implications then for doing queries that span all the partitions,
> which will be the norm for our application?
>
> The application in question is a kind of data warehousing thing (of
> astronomical stars), and there's an ORM in the middle, so it's not
> easy for us to hand-tune how individual queries are specified.
> Instead, we have to structure the database and the indexes so that
> things generally perform well, without having to tweak specific
> queries.
>
> Users can specify fairly arbitrary search criteria. All of the
> queries should perform well. By "well", I mean within 10 seconds or
> so. Scanning all of the 150 million rows takes much longer than 10
> seconds, unfortunately.
>
> Any one of these "solutions" will cause Postgres to do an index scan
> in the problematic case where Postgres is deciding to a sequential
> scan. The index scan performs snappily enough:
>
> - Using "order by" on the query.
>
> - Changing the search value for the column to a value that occurs
> less frequently.
>
> - Fetching the value to search for via a sub-query so that Postgres
> can't determine a priori that the
> value being searched value occurs so commonly.
>
> Unfortunately, as I mentioned, due to the ORM, none of these solutions
> really work for us in practice, as opposed to at a psql prompt.
>
> |>ouglas
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
Hotmail® has ever-growing storage! Don’t worry about storage limits. Check it out.
hibernate.connection.driver_class | jdbc driver class |
hibernate.connection.url | jdbc URL |
hibernate.connection.username | database user |
hibernate.connection.password | database user password |
hibernate.connection.pool_size | maximum number of pooled connections |
C3P0 is an open source JDBC connection pool distributed along with Hibernate in the lib directory. Hibernate will use its C3P0ConnectionProvider for connection pooling if you set hibernate.c3p0.* properties. If you'd like to use Proxool refer to the packaged hibernate.properties and the Hibernate web site for more information.
Here is an example hibernate.properties file for C3P0:
hibernate.connection.driver_class = org.postgresql.Driverhttps://www.hibernate.org/214.html
hibernate.connection.url = jdbc:postgresql://localhost/mydatabase
hibernate.connection.username = myuser
hibernate.connection.password = secret
hibernate.c3p0.min_size=5
hibernate.c3p0.max_size=20
hibernate.c3p0.timeout=1800
hibernate.c3p0.max_statements=50
hibernate.dialect = org.hibernate.dialect.PostgreSQLDialect
Martin Gainty
______________________________________________
Verzicht und Vertraulichkeitanmerkung/Note de déni et de confidentialité
Diese Nachricht ist vertraulich. Sollten Sie nicht der vorgesehene Empfaenger sein, so bitten wir hoeflich um eine Mitteilung. Jede unbefugte Weiterleitung oder Fertigung einer Kopie ist unzulaessig. Diese Nachricht dient lediglich dem Austausch von Informationen und entfaltet keine rechtliche Bindungswirkung. Aufgrund der leichten Manipulierbarkeit von E-Mails koennen wir keine Haftung fuer den Inhalt uebernehmen.
Ce message est confidentiel et peut être privilégié. Si vous n'êtes pas le destinataire prévu, nous te demandons avec bonté que pour satisfaire informez l'expéditeur. N'importe quelle diffusion non autorisée ou la copie de ceci est interdite. Ce message sert à l'information seulement et n'aura pas n'importe quel effet légalement obligatoire. Étant donné que les email peuvent facilement être sujets à la manipulation, nous ne pouvons accepter aucune responsabilité pour le contenu fourni.
> From: darkwater42@gmail.com
> Date: Thu, 28 May 2009 15:03:32 -0400
> Subject: Re: [GENERAL] What is the right way to deal with a table with rows that are not in a random order?
> To: pgsql-general@postgresql.org
> CC: simon@2ndquadrant.com
>
> On Thu, May 28, 2009 at 4:14 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> > Partition the table, then scan the correct partition.
>
> If I do that, will Postgres figure out the "right thing" to do if the
> parent table is queried instead? Also, what are the performance
> implications then for doing queries that span all the partitions,
> which will be the norm for our application?
>
> The application in question is a kind of data warehousing thing (of
> astronomical stars), and there's an ORM in the middle, so it's not
> easy for us to hand-tune how individual queries are specified.
> Instead, we have to structure the database and the indexes so that
> things generally perform well, without having to tweak specific
> queries.
>
> Users can specify fairly arbitrary search criteria. All of the
> queries should perform well. By "well", I mean within 10 seconds or
> so. Scanning all of the 150 million rows takes much longer than 10
> seconds, unfortunately.
>
> Any one of these "solutions" will cause Postgres to do an index scan
> in the problematic case where Postgres is deciding to a sequential
> scan. The index scan performs snappily enough:
>
> - Using "order by" on the query.
>
> - Changing the search value for the column to a value that occurs
> less frequently.
>
> - Fetching the value to search for via a sub-query so that Postgres
> can't determine a priori that the
> value being searched value occurs so commonly.
>
> Unfortunately, as I mentioned, due to the ORM, none of these solutions
> really work for us in practice, as opposed to at a psql prompt.
>
> |>ouglas
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
Hotmail® has ever-growing storage! Don’t worry about storage limits. Check it out.
Re: What is the right way to deal with a table with rows that are not in a random order?
From
Simon Riggs
Date:
On Thu, 2009-05-28 at 15:03 -0400, Douglas Alan wrote: > The application in question is a kind of data warehousing thing (of > astronomical stars), and there's an ORM in the middle, so it's not > easy for us to hand-tune how individual queries are specified. > Instead, we have to structure the database and the indexes so that > things generally perform well, without having to tweak specific > queries. You should look at indexing solutions, or take professional advice on that. > Users can specify fairly arbitrary search criteria. All of the > queries should perform well. By "well", I mean within 10 seconds or > so. That's an unrealistic expectation, unfortunately. Fast search requires some form of preparatory action and without any clue as to what that should be you cannot guarantee response times. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Re: What is the right way to deal with a table with rows that are not in a random order?
From
Douglas Alan
Date:
On Thu, May 28, 2009 at 5:52 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
I'm not sure that I know what an "indexing solution" is.
Unrealistic or not, it mostly currently works, modulo Postgres sometimes deciding to do a slow sequence scan when there is a perfectly good index.
For instance, I just tried a query that results in 137,042 results (out of 150 million rows) is constrained in several different ways, involves a table join on another large table, and it took nine seconds.
|>ouglas
You should look at indexing solutions, or take professional advice on
On Thu, 2009-05-28 at 15:03 -0400, Douglas Alan wrote:
> The application in question is a kind of data warehousing thing (of
> astronomical stars), and there's an ORM in the middle, so it's not
> easy for us to hand-tune how individual queries are specified.
> Instead, we have to structure the database and the indexes so that
> things generally perform well, without having to tweak specific
> queries.
that.
I'm not sure that I know what an "indexing solution" is.
That's an unrealistic expectation, unfortunately. Fast search requires
> Users can specify fairly arbitrary search criteria. All of the
> queries should perform well. By "well", I mean within 10 seconds or
> so.
some form of preparatory action and without any clue as to what that
should be you cannot guarantee response times.
Unrealistic or not, it mostly currently works, modulo Postgres sometimes deciding to do a slow sequence scan when there is a perfectly good index.
For instance, I just tried a query that results in 137,042 results (out of 150 million rows) is constrained in several different ways, involves a table join on another large table, and it took nine seconds.
|>ouglas
Re: What is the right way to deal with a table with rows that are not in a random order?
From
Jeff Davis
Date:
On Thu, 2009-05-28 at 15:12 -0400, Douglas Alan wrote: > The most obvious solution would be an option to tell Postgres not to > assume that the value is evenly distributed throughout the table, and > to take account of the fact that the data in question might very well > be clustered at the very end of the table. There's no use adding a new statistic (user supplied or collected) to PostgreSQL until we know that correlation isn't useful for that purpose. Can you tell us the correlation that PostgreSQL has already collected (I apologize if I missed this elsewhere in the thread)? Hopefully, correlation is useful enough. With some planner tweaks similar to the ones Tom mentioned, and a few more data points, maybe we'll have a real solution. Regards, Jeff Davis
Re: What is the right way to deal with a table with rows that are not in a random order?
From
Douglas Alan
Date:
Scott Marlowe <scott.marlowe@gmail.com> wrote: > Douglas Alan wrote: >> I'm worried that turning off seqscan would distort other queries. >> (Remember, I don't have control of the individual queries. The >> user of the application can specify all sorts of queries, and >> there's an ORM in the middle.) > You are aware you can turn off seq scans for just the current > connection, right? Yes, of course. I thought that the suggestion was to change this setting in production, not just for diagnostic purposes. As I previously reported, I've already changed some settings for diagnostic purposes and also reported what values I had to set them to to force an index scan for the query in question. > No one's saying to do it all the time. They're saying to do it and > then run explain analyze on your query, then post the results of > both let us have a look. Okay -- no problem: set enable_seqscan = on; explain analyze select * from maindb_astobject where survey_id = 2 limit 1000; "Limit (cost=0.00..48.03 rows=1000 width=78) (actual time=84837.835..265938.258 rows=1000 loops=1)" " -> Seq Scan on maindb_astobject (cost=0.00..3538556.10 rows=73675167 width=78) (actual time=84837.825..265932.121 rows=1000 loops=1)" " Filter: (survey_id = 2)" "Total runtime: 265942.416 ms" set enable_seqscan = off; explain analyze select * from maindb_astobject where survey_id = 2 limit 1000; "Limit (cost=0.00..67.37 rows=1000 width=78) (actual time=172.248..225.219 rows=1000 loops=1)" " -> Index Scan using maindb_astobject_survey_id on maindb_astobject (cost=0.00..4963500.87 rows=73675167 width=78) (actual time=172.240..221.078 rows=1000 loops=1)" " Index Cond: (survey_id = 2)" "Total runtime: 227.412 ms" >> Slow query: >> >> psql> select * from maindb_astobject >> where survey_id = 2 >> limit 20 offset 1000; > You'll notice that in your previous posts you never mentioned offset, > which greatly affects the plan chosen. It makes little difference for this query. The plan is the same with or without the offset. The purpose of the offset was just to more or less simulate a "limit 1020" without getting deluged by 1,020 results. > Is survey_id indexed? Yes, of course. >> P.P.S. Many other queries were very slow due to Postgres wanting to use hash >> joins and merge joins rather than nested index scans. > Then it's quite possible you have a problem with misestimation of > values in your db. Since we have no explain ANALYZE output, we cannot > judge if this is the case. Okay, well if you can also tell me how to fix the hash join / merge join problem we are seeing without resorting to the current hack I am using to fix it, I will be very thankful! >> The nested index >> scans were, in fact, much faster. I just turned off hash joins and merge >> joins: problem solved. >One problem solved, another one created is more likely. Such as? I haven't yet seen a nested index scan perform poorly yet. Though if there is some database parameter that is set incorrectly, and this is causing the hash join / merge join problem, then I can see that having that parameter not be set correctly could be a source of many future woes, so I certainly would appreciate any insight into that. |>ouglas P.S. Here are the stats on the column. It appears that my recollection of 25% of the table matching was a bit off. It's actually 98.5%! That might explain more why Postgres wants to do a sequential scan. The problem is that still means that it has to scan a million rows sequentially before it finds a single matching row, as the matching rows are at the end of the database: select * from pg_stats where tablename='maindb_astobject' and attname = 'survey_id'; schemaname | tablename | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation ------------+------------------+-----------+-----------+-----------+------------+------------------+--------------------------------------------------+------------------+------------- public | maindb_astobject | survey_id | 0 | 4 | 5 | {2,4,10,3,5} | {0.985347,0.00966,0.00286667,0.0019,0.000226667} | | 0.998872
Re: What is the right way to deal with a table with rows that are not in a random order?
From
Scott Marlowe
Date:
On Thu, May 28, 2009 at 6:45 PM, Douglas Alan <darkwater42@gmail.com> wrote: > Scott Marlowe <scott.marlowe@gmail.com> wrote: > >> Douglas Alan wrote: > >>> I'm worried that turning off seqscan would distort other queries. >>> (Remember, I don't have control of the individual queries. The >>> user of the application can specify all sorts of queries, and >>> there's an ORM in the middle.) > >> You are aware you can turn off seq scans for just the current >> connection, right? > > Yes, of course. I thought that the suggestion was to change this > setting in production, not just for diagnostic purposes. As I > previously reported, I've already changed some settings for diagnostic > purposes and also reported what values I had to set them to to force > an index scan for the query in question. Well there are varying degrees of doing this in production, if needed. You can use a separate account for these queries that has different settings so the query planner makes the right decision etc. >> No one's saying to do it all the time. They're saying to do it and >> then run explain analyze on your query, then post the results of >> both let us have a look. > > Okay -- no problem: > > set enable_seqscan = on; > explain analyze select * from maindb_astobject > where survey_id = 2 > limit 1000; > > "Limit (cost=0.00..48.03 rows=1000 width=78) (actual > time=84837.835..265938.258 rows=1000 loops=1)" > " -> Seq Scan on maindb_astobject (cost=0.00..3538556.10 > rows=73675167 width=78) (actual time=84837.825..265932.121 rows=1000 > loops=1)" > " Filter: (survey_id = 2)" > "Total runtime: 265942.416 ms" > > set enable_seqscan = off; > explain analyze select * from maindb_astobject > where survey_id = 2 > limit 1000; > > "Limit (cost=0.00..67.37 rows=1000 width=78) (actual > time=172.248..225.219 rows=1000 loops=1)" > " -> Index Scan using maindb_astobject_survey_id on > maindb_astobject (cost=0.00..4963500.87 rows=73675167 width=78) > (actual time=172.240..221.078 rows=1000 loops=1)" > " Index Cond: (survey_id = 2)" > "Total runtime: 227.412 ms" What was the random_page_cost during these two queries? Assuming seq_page_cost is 1, and random_page_cost was 4 or something, lowering it should force the move to an index scan. If you were already at 1.0 or so, then yeah, the cost estimation is off. Since index scans cost CPU (at least I think they do), you might try lowering your cpu_* costs to see if that helps > >>> Slow query: >>> >>> psql> select * from maindb_astobject >>> where survey_id = 2 >>> limit 20 offset 1000; > >> You'll notice that in your previous posts you never mentioned offset, >> which greatly affects the plan chosen. > > It makes little difference for this query. The plan is the same with > or without the offset. The purpose of the offset was just to more or > less simulate a "limit 1020" without getting deluged by 1,020 results. Ah yes, I can see that now. >>> P.P.S. Many other queries were very slow due to Postgres wanting to use hash >>> joins and merge joins rather than nested index scans. > >> Then it's quite possible you have a problem with misestimation of >> values in your db. Since we have no explain ANALYZE output, we cannot >> judge if this is the case. > > Okay, well if you can also tell me how to fix the hash join / merge > join problem we are seeing without resorting to the current hack I am > using to fix it, I will be very thankful! Most of the time that happens, it's a question of estimates being way off, OR various cost values. Explain analyze with hash joins enabled and disabled can give you an idea where the planner might be making a mistake. With large tables it's a good idea to crank up your default stats target and reanalyzing anyway, so that's worth a first try. >>> The nested index >>> scans were, in fact, much faster. I just turned off hash joins and merge >>> joins: problem solved. > >>One problem solved, another one created is more likely. > > Such as? I haven't yet seen a nested index scan perform poorly yet. Who knows? Pgsql has enough other join methods it's not likely to be a huge issue. But if your type of query changes a lot, the nested loops might get really slow really fast. > Though if there is some database parameter that is set incorrectly, > and this is causing the hash join / merge join problem, then I can see > that having that parameter not be set correctly could be a source of > many future woes, so I certainly would appreciate any insight into > that. Most common cause of these being the wrong choice are default stats target too low for your data. But you've only got like 4 values in this one field, so I doubt that has a big effect. > > |>ouglas > > > P.S. Here are the stats on the column. It appears that my recollection > of 25% of the table matching was a bit off. It's actually 98.5%! That > might explain more why Postgres wants to do a sequential scan. The > problem is that still means that it has to scan a million rows > sequentially before it finds a single matching row, as the matching > rows are at the end of the database: Yeah, that's a really skewed distribution. Partitioning may work out, especially if you often select on that one field.
Re: What is the right way to deal with a table with rows that are not in a random order?
From
Douglas Alan
Date:
Scott Marlowe wrote: > Douglas Alan wrote: >> Okay -- no problem: >> >> set enable_seqscan = on; >> explain analyze select * from maindb_astobject >> where survey_id = 2 >> limit 1000; >> >> "Limit (cost=0.00..48.03 rows=1000 width=78) (actual >> time=84837.835..265938.258 rows=1000 loops=1)" >> " -> Seq Scan on maindb_astobject (cost=0.00..3538556.10 >> rows=73675167 width=78) (actual time=84837.825..265932.121 rows=1000 >> loops=1)" >> " Filter: (survey_id = 2)" >> "Total runtime: 265942.416 ms" >> >> set enable_seqscan = off; >> explain analyze select * from maindb_astobject >> where survey_id = 2 >> limit 1000; >> >> "Limit (cost=0.00..67.37 rows=1000 width=78) (actual >> time=172.248..225.219 rows=1000 loops=1)" >> " -> Index Scan using maindb_astobject_survey_id on >> maindb_astobject (cost=0.00..4963500.87 rows=73675167 width=78) >> (actual time=172.240..221.078 rows=1000 loops=1)" >> " Index Cond: (survey_id = 2)" >> "Total runtime: 227.412 ms" > What was the random_page_cost during these two queries? 4 > Assuming seq_page_cost is 1, Yes, it is. > and random_page_cost was 4 or something, lowering it should > force the move to an index scan. I just tried changing random_page_cost to 1, but the query still does a seq scan. > If you were already at 1.0 or so, then yeah, the cost > estimation is off. Since index scans cost CPU (at least I > think they do), you might try lowering your cpu_* costs to see > if that helps How would lowering random_page_cost and all the cpu costs differ from just increasing seq_page cost? I have to raise seq_page_cost from 1 to 34 to force an index scan. I can't imagine that changing this value so radically be a good idea. Alternatively, if I set random_page_cost to 1, and cpu_tuple_cost, cpu_index_tuple_cost, and cpu_operator_cost all to 0, this still doesn't cause Postgres to do an index scan for this query. >> P.S. Here are the stats on the column. It appears that my recollection >> of 25% of the table matching was a bit off. It's actually 98.5%! That >> might explain more why Postgres wants to do a sequential scan. The >> problem is that still means that it has to scan a million rows >> sequentially before it finds a single matching row, as the matching >> rows are at the end of the database: > Yeah, that's a really skewed distribution. Partitioning may work out, > especially if you often select on that one field. Is there a way for me to alter the statistics table? I tried changing the values in pg_stats, but that table is just a view, so Postgres won't let me do it. pg_statistic, on the other hand, is rather opaque. Alternatively, can I delete the statistics for the column. It's the statistics that are hurting me here. To delete the statistics, I tried setting statistics for the column to 0 and analyzing the column, but that just left the current statistics in place. So I tried setting statistics to 1, but that's one value to many to eliminate this problem! |>ouglas