Thread: need suggestion on querying big tables

need suggestion on querying big tables

From
Ebin Jozer
Date:
Hi Team,
in postgresql 11 version we have two tables of size 435 GB and 347 GB. 
if we query on single table or if we are doing inner join on both the big tables, it is not displacing any output, it keeps running 

We can see the wait event is IO and directDatafile .

Server Spec : 8 cores and  64GB RAM
PG config : 53 GB(effective_cache), 12 GB(shared buffer) 

can you please suggest some ideas , how we can query on big tables and fasten them to get the output??

Thanks & Regards,
Ebin

Re: need suggestion on querying big tables

From
Holger Jakobs
Date:
Am 05.12.22 um 07:29 schrieb Ebin Jozer:
Hi Team,
in postgresql 11 version we have two tables of size 435 GB and 347 GB. 
if we query on single table or if we are doing inner join on both the big tables, it is not displacing any output, it keeps running 

We can see the wait event is IO and directDatafile .

Server Spec : 8 cores and  64GB RAM
PG config : 53 GB(effective_cache), 12 GB(shared buffer) 

can you please suggest some ideas , how we can query on big tables and fasten them to get the output??

Thanks & Regards,
Ebin

There are basically two "secrets" to performance in this case:

  1. make sure that the join operation uses indexes
  2. make sure that you have enough work_mem available

work_mem can be set in postgresql.conf, but also in every session individually. So don't set it too high generally, because it will be allocated for every sorting and hashing operation.

Kind Regards,

Holger

-- 
Holger Jakobs, Bergisch Gladbach, Tel. +49-178-9759012
Attachment

Re: need suggestion on querying big tables

From
Bo Guo
Date:
Partitioning the table if possible should also help.

Cheers,

Bo Guo


On Mon, Dec 5, 2022 at 2:35 AM Holger Jakobs <holger@jakobs.com> wrote:
Am 05.12.22 um 07:29 schrieb Ebin Jozer:
Hi Team,
in postgresql 11 version we have two tables of size 435 GB and 347 GB. 
if we query on single table or if we are doing inner join on both the big tables, it is not displacing any output, it keeps running 

We can see the wait event is IO and directDatafile .

Server Spec : 8 cores and  64GB RAM
PG config : 53 GB(effective_cache), 12 GB(shared buffer) 

can you please suggest some ideas , how we can query on big tables and fasten them to get the output??

Thanks & Regards,
Ebin

There are basically two "secrets" to performance in this case:

  1. make sure that the join operation uses indexes
  2. make sure that you have enough work_mem available

work_mem can be set in postgresql.conf, but also in every session individually. So don't set it too high generally, because it will be allocated for every sorting and hashing operation.

Kind Regards,

Holger

-- 
Holger Jakobs, Bergisch Gladbach, Tel. +49-178-9759012

Re: need suggestion on querying big tables

From
Paul Smith
Date:
On 05/12/2022 12:14, Bo Guo wrote:
> Partitioning the table if possible should also help.

This really only helps if the partitioning is relevant to the query. 
Otherwise it can slow things down.

Eg, if you partition on date, and search on date, the partitioning will 
probably help, but if you partition on date, and search on name, it'll 
be no quicker, and will possibly be slower.


Paul


-- 


Paul Smith Computer Services
Tel: 01484 855800
Vat No: GB 685 6987 53

Sign up for news & updates at http://www.pscs.co.uk/go/subscribe



Re: need suggestion on querying big tables

From
Ron
Date:
On 12/5/22 01:35, Holger Jakobs wrote:
Am 05.12.22 um 07:29 schrieb Ebin Jozer:
Hi Team,
in postgresql 11 version we have two tables of size 435 GB and 347 GB. 
if we query on single table or if we are doing inner join on both the big tables, it is not displacing any output, it keeps running 

We can see the wait event is IO and directDatafile .

Server Spec : 8 cores and  64GB RAM
PG config : 53 GB(effective_cache), 12 GB(shared buffer) 

can you please suggest some ideas , how we can query on big tables and fasten them to get the output??

Thanks & Regards,
Ebin

There are basically two "secrets" to performance in this case:

  1. make sure that the join operation uses indexes
  2. make sure that you have enough work_mem available

work_mem can be set in postgresql.conf, but also in every session individually. So don't set it too high generally, because it will be allocated for every sorting and hashing operation.


A "just smart enough" user could perform an inadvertent Denial of Service attack on the database by cranking his session work_mem really high, no?

--
Angular momentum makes the world go 'round.

Re: need suggestion on querying big tables

From
John Wiencek
Date:
Verify that your statistics are current.


John

On Dec 5, 2022, at 12:29 AM, Ebin Jozer <ebinjozer@gmail.com> wrote:

Hi Team,
in postgresql 11 version we have two tables of size 435 GB and 347 GB. 
if we query on single table or if we are doing inner join on both the big tables, it is not displacing any output, it keeps running 

We can see the wait event is IO and directDatafile .

Server Spec : 8 cores and  64GB RAM
PG config : 53 GB(effective_cache), 12 GB(shared buffer) 

can you please suggest some ideas , how we can query on big tables and fasten them to get the output??

Thanks & Regards,
Ebin


Re: need suggestion on querying big tables

From
Mladen Gogala
Date:
On 12/5/22 01:29, Ebin Jozer wrote:
Hi Team,
in postgresql 11 version we have two tables of size 435 GB and 347 GB. 
if we query on single table or if we are doing inner join on both the big tables, it is not displacing any output, it keeps running 

We can see the wait event is IO and directDatafile .

Server Spec : 8 cores and  64GB RAM
PG config : 53 GB(effective_cache), 12 GB(shared buffer) 

can you please suggest some ideas , how we can query on big tables and fasten them to get the output??

Thanks & Regards,
Ebin


The first thing you should realize is that your server is really small. Things like NVME, more memory and more cores should help. It's a business decision. If your application is important to your business, you should invest some money in the appropriate hardware. You should also consider partitioning both tables and upgrading to at least Postgres 14. The fear of the latest version is understandable and I feel it myself. Newer versions behave much better with respect to partitioning. Be aware, though, that you will have to add partitioning key as a leading column to any global index. Also, parallelism in a query helps.

64 GB and 8 cores is a desktop class machine. Here is my desktop machine:

[mgogala@umajor ~]$ tail -20 /proc/cpuinfo
cache size    : 512 KB
physical id    : 0
siblings    : 16
core id        : 7
cpu cores    : 8
apicid        : 15
initial apicid    : 15
fpu        : yes
fpu_exception    : yes
cpuid level    : 13
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca sev sev_es
bugs        : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed
bogomips    : 7385.36
TLB size    : 2560 4K pages
clflush size    : 64
cache_alignment    : 64
address sizes    : 43 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]

[mgogala@umajor ~]$ head /proc/meminfo
MemTotal:       65781164 kB
MemFree:        48364252 kB
MemAvailable:   47110532 kB
Buffers:           14812 kB
Cached:         10199584 kB
SwapCached:            0 kB
Active:          3949732 kB
Inactive:        9434600 kB
Active(anon):       4304 kB
Inactive(anon):  3211636 kB

It's faster than your DB server and it has a NVME drive. And this is a $2K machine from the late 2018. Today, I would go for a DDR5 machine, with at least 64 cores, 1TB RAM and at least 20TB of NVME drives. Such a machine would cost around $50K.

-- 
Mladen Gogala
Database Consultant
Tel: (347) 321-1217
https://dbwhisperer.wordpress.com

Re: need suggestion on querying big tables

From
Mladen Gogala
Date:
On 12/5/22 09:45, Ron wrote:
A "just smart enough" user could perform an inadvertent Denial of Service attack on the database by cranking his session work_mem really high, no?

Well, some of the applications that I've seen during my career could qualify as Denial of Service suicide attack.

-- 
Mladen Gogala
Database Consultant
Tel: (347) 321-1217
https://dbwhisperer.wordpress.com

Re: need suggestion on querying big tables

From
Ebin Jozer
Date:
Hi all,
Thank you for your suggestions.

I have created the index, if i run explain plan on individual table , the cost is reduced and count on the both the table is giving result.

But when we do inner join on two big table. it is increasing the /dev/shm usage and failing with error "could not resize shared memory segment : no space left on device" some time (i think I ran 2 query on the same table )and some time memory is reaching 100 percentage.

Early (in the beginning)when we run the query , it ended up with Same error. We increased work_mem to 10gb and reduced max_worker_processes to 1. After that I didn't see that error.


Do we need to increase the memory? Will it help??

Partition I don't think app team to confirm on that. I have proposed it 

On Tue, 6 Dec, 2022, 10:29 am Mladen Gogala, <gogala.mladen@gmail.com> wrote:
On 12/5/22 09:45, Ron wrote:
A "just smart enough" user could perform an inadvertent Denial of Service attack on the database by cranking his session work_mem really high, no?

Well, some of the applications that I've seen during my career could qualify as Denial of Service suicide attack.

-- 
Mladen Gogala
Database Consultant
Tel: (347) 321-1217
https://dbwhisperer.wordpress.com

Re: need suggestion on querying big tables

From
Ron
Date:
On 12/5/22 00:29, Ebin Jozer wrote:
Hi Team,
in postgresql 11 version we have two tables of size 435 GB and 347 GB. 
if we query on single table or if we are doing inner join on both the big tables, it is not displacing any output, it keeps running 

We can see the wait event is IO and directDatafile .

Server Spec : 8 cores and  64GB RAM
PG config : 53 GB(effective_cache), 12 GB(shared buffer) 

can you please suggest some ideas , how we can query on big tables and fasten them to get the output??

https://www.postgresql.org/docs/11/runtime-config-query.html

effective_cache_size is how much RAM a single query gets.  "When setting this parameter you should consider both PostgreSQL's shared buffers".

https://www.postgresql.org/docs/11/runtime-config-resource.html

"a reasonable starting value for shared_buffers is 25% of the memory in your system. There are some workloads where even larger settings for shared_buffers are effective, but because PostgreSQL also relies on the operating system cache, it is unlikely that an allocation of more than 40% of RAM to shared_buffers will work better than a smaller amount."

You've set effective_cache_size to 83%, when it should be at most 20%.

--
Angular momentum makes the world go 'round.

Re: need suggestion on querying big tables

From
Thomas Kellerer
Date:
Ron schrieb am 06.12.2022 um 15:51:
>> We can see the wait event is IO and directDatafile .
>>
>> Server Spec : 8 cores and  64GB RAM PG config : 53
>> GB(effective_cache), 12 GB(shared buffer)
>>
>> can you please suggest some ideas , how we can query on big tables
>> and fasten them to get the output??
>
> https://www.postgresql.org/docs/11/runtime-config-query.html
>
> effective_cache_size is how much RAM *a single* query gets.  "When
> setting this parameter you should consider both *PostgreSQL's shared
> buffers*".
>
> https://www.postgresql.org/docs/11/runtime-config-resource.html
>
> "a reasonable starting value for *shared_buffers is 25% of the
> memory* in your system. There are some workloads where even larger
> settings for shared_buffers are effective, but because PostgreSQL
> also relies on the operating system cache, *it is unlikely that an
> allocation of more than 40% of RAM to shared_buffers will work
> better* than a smaller amount."
>
> You've set effective_cache_size to 83%, when it should be at most
> 20%.


effective_cache_size does not allocate any memory.
It's not a setting that controls "how much memory a query gets"

effective_cache_size merely a hint to the optimizer on the expected amount
of data that is likely to be in the cache.
Either the file system cache or shared_buffers.

The current setting of 12GB for shared_buffers is absolutely inside the
recommended limits. (40% would be about 25GB)





Re: need suggestion on querying big tables

From
Tom Lane
Date:
Ron <ronljohnsonjr@gmail.com> writes:
> You've set effective_cache_size to 83%, when it should be at most 20%.

I think you're confusing effective_cache_size with the size of
shared buffers.  It's meant to be an estimate of the total disk
cache space available, in both shared buffers and OS caches.

In any case, it's a fairly spongy parameter.  It doesn't correspond
to any allocation actually made or requested in Postgres; it just
serves to tweak the planner's guesses about how expensive repeated
indexscans are likely to be.  83% of available RAM might be too
much, but it's probably not causing any real problems.

            regards, tom lane



Re: need suggestion on querying big tables

From
Ron
Date:
On 12/6/22 09:03, Tom Lane wrote:
> Ron <ronljohnsonjr@gmail.com> writes:
>> You've set effective_cache_size to 83%, when it should be at most 20%.
> I think you're confusing effective_cache_size with the size of
> shared buffers.  It's meant to be an estimate of the total disk
> cache space available, in both shared buffers and OS caches.
> In any case, it's a fairly spongy parameter.  It doesn't correspond
> to any allocation actually made or requested in Postgres; it just
> serves to tweak the planner's guesses about how expensive repeated
> indexscans are likely to be.  83% of available RAM might be too
> much, but it's probably not causing any real problems.

The docs say that ECS is an estimate of how much cache one query can use, 
no?  If more than one query is running on a server at a time, then wouldn't 
a huge ECS confuse things?

-- 
Angular momentum makes the world go 'round.



Re: need suggestion on querying big tables

From
Scott Ribe
Date:
> On Dec 6, 2022, at 8:51 AM, Ron <ronljohnsonjr@gmail.com> wrote:
>
> https://www.postgresql.org/docs/11/runtime-config-query.html
>
> effective_cache_size is how much RAM a single query gets.  "When setting this parameter you should consider both
PostgreSQL'sshared buffers". 
>
> https://www.postgresql.org/docs/11/runtime-config-resource.html
>
> "a reasonable starting value for shared_buffers is 25% of the memory in your system. There are some workloads where
evenlarger settings for shared_buffers are effective, but because PostgreSQL also relies on the operating system cache,
itis unlikely that an allocation of more than 40% of RAM to shared_buffers will work better than a smaller amount." 
>
> You've set effective_cache_size to 83%, when it should be at most 20%.

effective_cache_size is a hint as to how much RAM is available for the operating system file cache after all other uses
aretaken into account. 


Re: need suggestion on querying big tables

From
Samed YILDIRIM
Date:
Hi Ebin,

If you can share your tables' definitions along with index informations on them and your query with execution plan, it would lead more fruitful discussions.

In addition to what folks have already told, I recommend you to take into account other columns too, such as the ones in where clause, order by clause, and returned by the query. Adding those columns into column list of index or having them in include definition may help.

Another hint is about the expressions you use in the query. You can create an index by using expressions. It will help PostgreSQL not to calculate them while executing the query.

Regarding to indexes, you can consider using partitial indexes. If some portion of data is not relevant with your query, you can eliminate in advance. And, you will have smaller index with better performance. If it is small enough to fit in the memory, it is even better.

Another significant topic about indexes is type of index. Multiple indexing method and many opclases are available in Postgresql out of box. Depending on characteristic of your data and your query, the solution geting results lightning fast can be hiding there.

Last but not least is vacuum. People already mentioned statistics of your table. What about vacuum? When have you vacuumed those tables last time?

If you want to go even deeper, columns's storage settings can create a significant difference for some edge cases.

I haven't even started talking about partitioning, check constraints, constraint exclusions, partition pruning, jit, or other kind of hacks, such as offset 0 :) 

Best regards.
Samed YILDIRIM

On Mon, 5 Dec 2022, 08:30 Ebin Jozer, <ebinjozer@gmail.com> wrote:
Hi Team,
in postgresql 11 version we have two tables of size 435 GB and 347 GB. 
if we query on single table or if we are doing inner join on both the big tables, it is not displacing any output, it keeps running 

We can see the wait event is IO and directDatafile .

Server Spec : 8 cores and  64GB RAM
PG config : 53 GB(effective_cache), 12 GB(shared buffer) 

can you please suggest some ideas , how we can query on big tables and fasten them to get the output??

Thanks & Regards,
Ebin