Thread: performance expectations for table(s) with 2B recs
On 2021-12-08 14:44:47 -0500, David Gauthier wrote: > So far, the tables I have in my DB have relatively low numbers of records (most > are < 10K, all are < 10M). Things have been running great in terms of > performance. But a project is being brainstormed which may require some tables > to contain a couple billion records. [...] > What else should I be worried about ? > > I suspect that part of why things are running really well so far is that the > relatively small amounts of data in these tables ends up in the DB cache and > disk I/O is kept at a minimum. Will that no longer be the case once queries > start running on these big tables ? Depends a lot on how good the locality of your queries is. If most read only the same parts of the same indexes, those will still be in the cache. If they are all over the place or if you have queries which need to read large parts of your tables, cache misses will make your performance a lot less predictable, yes. That stuff is also hard to test, because when you are testing a query twice in a row, the second time it will likely hit the cache and be quite fast. But in my experience the biggest problem with large tables are unstable execution plans - for most of the parameters the optimizer will choose to use an index, but for some it will erroneously think that a full table scan is faster. That can lead to a situation where a query normally takes less than a second, but sometimes (seemingly at random) it takes several minutes - users will understandably be upset about such behaviour. It is in any case a good idea to monitor execution times to find such problems (ideally before users complain), but each needs to be treated on an individual basis, and sometimes there seems to be no good solution. hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | hjp@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
Attachment
11.5 on linuxserver = VM provided by our IT dept (IOW, can be grown if needed)DB itself is on NFSSo far, the tables I have in my DB have relatively low numbers of records (most are < 10K, all are < 10M). Things have been running great in terms of performance. But a project is being brainstormed which may require some tables to contain a couple billion records.I'm familiar with the need to properly index columns in these tables which will be used for table joining and query performance (candidates are columns used in query predicate, etc...). Let's assume that's done right. And let's assume that the biggest table will have 20 varchars (<= 16 chars per) and 20 ints. No blobs or jsonb or anything like that.What else should I be worried about ?I suspect that part of why things are running really well so far is that the relatively small amounts of data in these tables ends up in the DB cache and disk I/O is kept at a minimum. Will that no longer be the case once queries start running on these big tables ?What about DBA stuff... vacuum and fragmentation and index maintenance, etc... ?I don't want to step into this completely blind. Any warnings/insights would be appreciated.
RE: [EXTERNAL] Re: performance expectations for table(s) with 2B recs
On 2021-12-08 14:44:47 -0500, David Gauthier wrote:
> So far, the tables I have in my DB have relatively low numbers of records (most
> are < 10K, all are < 10M). Things have been running great in terms of
> performance. But a project is being brainstormed which may require some tables
> to contain a couple billion records.
[...]
> What else should I be worried about ?
>
> I suspect that part of why things are running really well so far is that the
> relatively small amounts of data in these tables ends up in the DB cache and
> disk I/O is kept at a minimum. Will that no longer be the case once queries
> start running on these big tables ?
>
>Depends a lot on how good the locality of your queries is. If most read
>only the same parts of the same indexes, those will still be in the
>cache. If they are all over the place or if you have queries which need
>to read large parts of your tables, cache misses will make your
>performance a lot less predictable, yes. That stuff is also hard to
>test, because when you are testing a query twice in a row, the second
>time it will likely hit the cache and be quite fast.
>
>But in my experience the biggest problem with large tables are unstable
>execution plans - for most of the parameters the optimizer will choose
>to use an index, but for some it will erroneously think that a full
>table scan is faster. That can lead to a situation where a query
>normally takes less than a second, but sometimes (seemingly at random)
>it takes several minutes - users will understandably be upset about such
>behaviour. It is in any case a good idea to monitor execution times to
>find such problems (ideally before users complain), but each needs to be
>treated on an individual basis, and sometimes there seems to be no good
>solution.
To the OP, that’s is a tall order to answer – basically that’s wjhy DBA’s still have
Jobs…
For Peter I have a question. What exactly causes ‘unstable execution plans’ ??
Besides not using bind variables, bad statistics, would you elaborate in what would
contribute to that instability?
On 2021-12-10 18:04:07 +0000, Godfrin, Philippe E wrote: > >But in my experience the biggest problem with large tables are unstable > >execution plans - for most of the parameters the optimizer will choose > >to use an index, but for some it will erroneously think that a full > >table scan is faster. That can lead to a situation where a query > >normally takes less than a second, but sometimes (seemingly at random) > >it takes several minutes [...] > For Peter I have a question. What exactly causes ‘unstable execution plans’ ?? > > Besides not using bind variables, bad statistics, would you elaborate > in what would contribute to that instability? Not using bind variables and bad statistics are certainly big factors: On one hand not using bind variables gives a lot more information to the optimizer, so it can choose a better plan at run time. On the other hand that makes hard to predict what plan it will choose. Bad statistics come in many flavours: They might just be wrong, that's usually easy to fix. More problematic are statistics which just don't describe reality very well - they may not show a correlation, causing the optimizer to assume that two distributions are independent when they really aren't (since PostgreSQL 10 you can create statistics on multiple columns which helps in many but not all cases) or not show some other peculiarity of the data. Or they may be just so close to a flipping point that a small change causes the optimizer to choose a wildly different plan. Another source is dynamically generated SQL. Your application may just put together SQL from fragments or it might use something like SQLalchemy or an ORM. In any of these cases what looks like one query from a user's perspective may really be a whole family of related queries - and PostgreSQL will try to find the optimal plan for each of them. Which is generally a good thing, but it adds opportunities to mess up. hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | hjp@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
Attachment
RE: [EXTERNAL] Re: performance expectations for table(s) with 2B recs
>
>
>From: Peter J. Holzer hjp-pgsql@hjp.at
>Sent: Friday, December 10, 2021 3:43 PM
>To: pgsql-general@lists.postgresql.org
>Subject: Re: [EXTERNAL] Re: performance expectations for table(s) with 2B recs
>
>On 2021-12-10 18:04:07 +0000, Godfrin, Philippe E wrote:
>> >But in my experience the biggest problem with large tables are unstable
>> >execution plans - for most of the parameters the optimizer will choose
>> >to use an index, but for some it will erroneously think that a full
>> >table scan is faster. That can lead to a situation where a query
>> >normally takes less than a second, but sometimes (seemingly at random)
>> >it takes several minutes
>[...]
>> For Peter I have a question. What exactly causes ‘unstable execution plans’ ??
>>
>> Besides not using bind variables, bad statistics, would you elaborate
>> in what would contribute to that instability?
>
>Not using bind variables and bad statistics are certainly big factors:
>
>On one hand not using bind variables gives a lot more information to the
>optimizer, so it can choose a better plan at run time. On the other hand
>that makes hard to predict what plan it will choose.
>
>Bad statistics come in many flavours: They might just be wrong, that's
>usually easy to fix. More problematic are statistics which just don't
>describe reality very well - they may not show a correlation, causing
>the optimizer to assume that two distributions are independent when they
>really aren't (since PostgreSQL 10 you can create statistics on multiple
>columns which helps in many but not all cases) or not show some other
>peculiarity of the data. Or they may be just so close to a flipping
>point that a small change causes the optimizer to choose a wildly
>different plan.
>
>Another source is dynamically generated SQL. Your application may just
>put together SQL from fragments or it might use something like
>SQLalchemy or an ORM. In any of these cases what looks like one query
>from a user's perspective may really be a whole family of related
>queries - and PostgreSQL will try to find the optimal plan for each of
>them. Which is generally a good thing, but it adds opportunities to mess
>up.
>
>hp
>
>--
>_ | Peter J. Holzer | Story must make more sense than reality.
>|_|_) | |
>| | | hjp@hjp.at | -- Charles Stross, "Creative writing
>__/ | http://www.hjp.at/ | challenge!"
Good answer Peter, I agree wholeheartedly. I was curious if there was something specific to Postgresql .
phil