[PATCH] Resolve Parallel Hash Join Performance Issue - Mailing list pgsql-hackers

From Deng, Gang
Subject [PATCH] Resolve Parallel Hash Join Performance Issue
Date
Msg-id 0F44E799048C4849BAE4B91012DB910462E9897A@SHSMSX103.ccr.corp.intel.com
Whole thread Raw
Responses Re: [PATCH] Resolve Parallel Hash Join Performance Issue  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers

Hi all,

 

Attached is a patch to resolve parallel hash join performance issue. This is my first time to contribute patch to PostgreSQL community, I referred one of previous thread as template to report the issue and patch. Please let me know if need more information of the problem and patch.

 

A. Problem Summary

When we ran query which was executed by hash join operation, we can not achieve good performance improvement with more number of threads. More specifically, when we ran query02 of TPC-DS workload using scale 500 (500GB dataset), execution time using 8 threads was 124.6 sec while time using 28 threads was 103.5 sec. Here is execution time by different number of threads:

 

number of thread:    1      4      8        16      28

time used(sec):     460.4   211  124.6    101.9   103.5 

 

The test was  made on a server with 384GB DRAM, 56 cores/112 HT. Data has been cached into OS page cache, so there was no disk I/O during execution, and there were enough physical CPU cores to support 28 threads to run in parallel.

 

We investigated this problem with perf c2c (http://man7.org/linux/man-pages/man1/perf-c2c.1.html) tool, confirmed the problem was caused by false sharing cache coherence. And we located the code write cache line is at line 457 of nodeHashjoin.c (pg version 12.0).

 

B. Patch

    change line 457 in ExecHashJoinImpl function of nodeHashJoin.c. (be applicable to both 12.0 and 12.1)

    original code:

        HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple));

    changed to:

        if (!HeapTupleHeaderHasMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple)))

        {

            HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple));

        }

    Compared with original code, modified code can avoid unnecessary write to memory/cache.

 

C. Test case:

  1. Use https://github.com/pivotalguru/TPC-DS to setup TPC-DS benchmark for postgreSQL
  2. run below command to ensure query will be executed with expected parallelism:

            psql postgres -h localhost -U postgres -c "alter table web_sales set (parallel_workers =28);";psql postgres -h localhost -U postgres -c "alter table catalog_sales set (parallel_workers =28);"

      3. run query: psql postgres -h localhost -U postgres -f 102.tpcds.02.sql first to ensure data is loaded into page cache.

      4. run query " time psql postgres -h localhost -U postgres -f 102.tpcds.02.sql" again to measure performance.

 

D. Result

With the modified code, performance of hash join operation can scale better with number of threads. Here is result of query02 after patch. For example, performance improved ~2.5x when run 28 threads.

 

number of thread:    1       4        8     16    28

time used(sec):    465.1  193.1   97.9   55.9  41 

 

I attached 5 files for more information:

  1. query_plan_q2_no_opt_28_thread: query plan using 28 threads without patch 
  2. query_plan_q2_opt_28_thread: query plan using 28 threads with patch
  3. perf_c2c_no_opt.txt: perf c2c output before patch
  4. perf_c2c_opt.txt: perf c2c output after patch
  5. git diff of the patch

 

 

Thanks and Best Regards

 

Deng, Gang (邓刚)

IAGS-CPDP-CEE PRC Enterprise

Mobile: 13161337000

Office: 010-57511964

 

Attachment

pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: Allow 'sslkey' and 'sslcert' in postgres_fdw user mappings
Next
From: godjan •
Date:
Subject: Verify true root on replicas with amcheck