Thread: performance expectations for table(s) with 2B recs

performance expectations for table(s) with 2B recs

From
David Gauthier
Date:
11.5 on linux
server = VM provided by our IT dept (IOW, can be grown if needed)
DB itself is on NFS

So far, the tables I have in my DB have relatively low numbers of records (most are < 10K, all are < 10M).  Things have been running great in terms of performance.  But a project is being brainstormed which may require some tables to contain a couple billion records.

I'm familiar with the need to properly index columns in these tables which will be used for table joining and query performance (candidates are columns used in query predicate, etc...).  Let's assume that's done right.  And let's assume that the biggest table will have 20 varchars (<= 16 chars per) and 20 ints.  No blobs or jsonb or anything like that.  

What else should I be worried about ?

I suspect that part of why things are running really well so far is that the relatively small amounts of data in these tables ends up in the DB cache and disk I/O is kept at a minimum.  Will that no longer be the case once queries start running on these big tables ?

What about DBA stuff... vacuum and fragmentation and index maintenance, etc... ?

I don't want to step into this completely blind.  Any warnings/insights would be appreciated.

Re: performance expectations for table(s) with 2B recs

From
"Peter J. Holzer"
Date:
On 2021-12-08 14:44:47 -0500, David Gauthier wrote:
> So far, the tables I have in my DB have relatively low numbers of records (most
> are < 10K, all are < 10M).  Things have been running great in terms of
> performance.  But a project is being brainstormed which may require some tables
> to contain a couple billion records.
[...]
> What else should I be worried about ?
>
> I suspect that part of why things are running really well so far is that the
> relatively small amounts of data in these tables ends up in the DB cache and
> disk I/O is kept at a minimum.  Will that no longer be the case once queries
> start running on these big tables ?

Depends a lot on how good the locality of your queries is. If most read
only the same parts of the same indexes, those will still be in the
cache. If they are all over the place or if you have queries which need
to read large parts of your tables, cache misses will make your
performance a lot less predictable, yes. That stuff is also hard to
test, because when you are testing a query twice in a row, the second
time it will likely hit the cache and be quite fast.

But in my experience the biggest problem with large tables are unstable
execution plans - for most of the parameters the optimizer will choose
to use an index, but for some it will erroneously think that a full
table scan is faster. That can lead to a situation where a query
normally takes less than a second, but sometimes (seemingly at random)
it takes several minutes - users will understandably be upset about such
behaviour. It is in any case a good idea to monitor execution times to
find such problems (ideally before users complain), but each needs to be
treated on an individual basis, and sometimes there seems to be no good
solution.

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Attachment

Re: performance expectations for table(s) with 2B recs

From
Marc Millas
Date:
Partitioning ?
if  you have some ideas about how this data is accessed, splitting those big tables into partitions may help:
-vaccum done at the partition level,
-index partitioned too, so much easier to manage (reindex local to a given partition, so quite easy to reindex the whole thing one partition at a time)

great on perf, too IF the partition key is in the where clause. if not, postgres will need to aggregate and sort the results gathered for each partition.
Could still be efficient if it allows to parallel execution.

my 2 cents


Marc MILLAS
Senior Architect
+33607850334



On Wed, Dec 8, 2021 at 8:45 PM David Gauthier <davegauthierpg@gmail.com> wrote:
11.5 on linux
server = VM provided by our IT dept (IOW, can be grown if needed)
DB itself is on NFS

So far, the tables I have in my DB have relatively low numbers of records (most are < 10K, all are < 10M).  Things have been running great in terms of performance.  But a project is being brainstormed which may require some tables to contain a couple billion records.

I'm familiar with the need to properly index columns in these tables which will be used for table joining and query performance (candidates are columns used in query predicate, etc...).  Let's assume that's done right.  And let's assume that the biggest table will have 20 varchars (<= 16 chars per) and 20 ints.  No blobs or jsonb or anything like that.  

What else should I be worried about ?

I suspect that part of why things are running really well so far is that the relatively small amounts of data in these tables ends up in the DB cache and disk I/O is kept at a minimum.  Will that no longer be the case once queries start running on these big tables ?

What about DBA stuff... vacuum and fragmentation and index maintenance, etc... ?

I don't want to step into this completely blind.  Any warnings/insights would be appreciated.

RE: [EXTERNAL] Re: performance expectations for table(s) with 2B recs

From
"Godfrin, Philippe E"
Date:

 

On 2021-12-08 14:44:47 -0500, David Gauthier wrote:
> So far, the tables I have in my DB have relatively low numbers of records (most
> are < 10K, all are < 10M).  Things have been running great in terms of
> performance.  But a project is being brainstormed which may require some tables
> to contain a couple billion records.
[...]
> What else should I be worried about ?
>
> I suspect that part of why things are running really well so far is that the
> relatively small amounts of data in these tables ends up in the DB cache and
> disk I/O is kept at a minimum.  Will that no longer be the case once queries
> start running on these big tables ?

>

>Depends a lot on how good the locality of your queries is. If most read

>only the same parts of the same indexes, those will still be in the

>cache. If they are all over the place or if you have queries which need

>to read large parts of your tables, cache misses will make your

>performance a lot less predictable, yes. That stuff is also hard to

>test, because when you are testing a query twice in a row, the second

>time it will likely hit the cache and be quite fast.

>But in my experience the biggest problem with large tables are unstable

>execution plans - for most of the parameters the optimizer will choose

>to use an index, but for some it will erroneously think that a full

>table scan is faster. That can lead to a situation where a query

>normally takes less than a second, but sometimes (seemingly at random)

>it takes several minutes - users will understandably be upset about such

>behaviour. It is in any case a good idea to monitor execution times to

>find such problems (ideally before users complain), but each needs to be

>treated on an individual basis, and sometimes there seems to be no good

>solution.

 

To the OP, that’s is a tall order to answer – basically that’s wjhy DBA’s still have

Jobs…

 

For Peter I have a question. What exactly causes ‘unstable execution plans’ ??

 

Besides not using bind variables, bad statistics, would you elaborate in what would

contribute to that instability?




Re: [EXTERNAL] Re: performance expectations for table(s) with 2B recs

From
"Peter J. Holzer"
Date:
On 2021-12-10 18:04:07 +0000, Godfrin, Philippe E wrote:
> >But in my experience the biggest problem with large tables are unstable
> >execution plans - for most of the parameters the optimizer will choose
> >to use an index, but for some it will erroneously think that a full
> >table scan is faster. That can lead to a situation where a query
> >normally takes less than a second, but sometimes (seemingly at random)
> >it takes several minutes
[...]
> For Peter I have a question. What exactly causes ‘unstable execution plans’ ??
>
> Besides not using bind variables, bad statistics, would you elaborate
> in what would contribute to that instability?

Not using bind variables and bad statistics are certainly big factors:

On one hand not using bind variables gives a lot more information to the
optimizer, so it can choose a better plan at run time. On the other hand
that makes hard to predict what plan it will choose.

Bad statistics come in many flavours: They might just be wrong, that's
usually easy to fix. More problematic are statistics which just don't
describe reality very well - they may not show a correlation, causing
the optimizer to assume that two distributions are independent when they
really aren't (since PostgreSQL 10 you can create statistics on multiple
columns which helps in many but not all cases) or not show some other
peculiarity of the data. Or they may be just so close to a flipping
point that a small change causes the optimizer to choose a wildly
different plan.

Another source is dynamically generated SQL. Your application may just
put together SQL from fragments or it might use something like
SQLalchemy or an ORM. In any of these cases what looks like one query
from a user's perspective may really be a whole family of related
queries - and PostgreSQL will try to find the optimal plan for each of
them. Which is generally a good thing, but it adds opportunities to mess
up.

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Attachment

RE: [EXTERNAL] Re: performance expectations for table(s) with 2B recs

From
"Godfrin, Philippe E"
Date:

>From: Peter J. Holzer hjp-pgsql@hjp.at

>Sent: Friday, December 10, 2021 3:43 PM

>To: pgsql-general@lists.postgresql.org

>Subject: Re: [EXTERNAL] Re: performance expectations for table(s) with 2B recs

>On 2021-12-10 18:04:07 +0000, Godfrin, Philippe E wrote:

>> >But in my experience the biggest problem with large tables are unstable

>> >execution plans - for most of the parameters the optimizer will choose

>> >to use an index, but for some it will erroneously think that a full

>> >table scan is faster. That can lead to a situation where a query

>> >normally takes less than a second, but sometimes (seemingly at random)

>> >it takes several minutes

>[...]

>> For Peter I have a question. What exactly causes ‘unstable execution plans’ ??

>>

>> Besides not using bind variables, bad statistics, would you elaborate

>> in what would contribute to that instability?

>Not using bind variables and bad statistics are certainly big factors:

>On one hand not using bind variables gives a lot more information to the

>optimizer, so it can choose a better plan at run time. On the other hand

>that makes hard to predict what plan it will choose.

>Bad statistics come in many flavours: They might just be wrong, that's

>usually easy to fix. More problematic are statistics which just don't

>describe reality very well - they may not show a correlation, causing

>the optimizer to assume that two distributions are independent when they

>really aren't (since PostgreSQL 10 you can create statistics on multiple

>columns which helps in many but not all cases) or not show some other

>peculiarity of the data. Or they may be just so close to a flipping

>point that a small change causes the optimizer to choose a wildly

>different plan.

>Another source is dynamically generated SQL. Your application may just

>put together SQL from fragments or it might use something like

>SQLalchemy or an ORM. In any of these cases what looks like one query

>from a user's perspective may really be a whole family of related

>queries - and PostgreSQL will try to find the optimal plan for each of

>them. Which is generally a good thing, but it adds opportunities to mess

>up.

>hp

>--

>_ | Peter J. Holzer | Story must make more sense than reality.

>|_|_) | |

>| | | hjp@hjp.at | -- Charles Stross, "Creative writing

>__/ | http://www.hjp.at/ | challenge!"

 

Good answer Peter, I agree wholeheartedly. I was curious if there was something specific to Postgresql .

phil