600 million rows of data. Bad hardware or need partitioning? - Mailing list pgsql-performance

From Arya F
Subject 600 million rows of data. Bad hardware or need partitioning?
Date
Msg-id CAFoK1ay01_ND0QrCcAAHX7U=rTCssB8vfOiH+iv5b8Qh6WN80g@mail.gmail.com
Whole thread Raw
Responses Re: 600 million rows of data. Bad hardware or need partitioning?  (Michael Lewis <mlewis@entrata.com>)
List pgsql-performance
I have created the following table to duplicate my performance
numbers, but I have simplified the table for this question.

I'm running PostgreSQL 12 on the following hardware.

Dual Xeon Quad-Core E5320 1.86GHz
4GB of RAM

The table structure is

 id uuid
 address_api_url text
 check_timestamp timestamp with time zone
 address text
Indexes:
    "new_table_pkey" PRIMARY KEY, btree (id)
    "test_table_check_timestamp_idx" btree (check_timestamp)


Right now the table has 100 Million rows, but I expect it to reach
about 600-700 Million. I am faced with slow updates/inserts and the
issue is caused by the indices as it gets updates on each
insert/update, If I remove the indexes the insert performance remains
excellent with millions of rows.

To demonstrate the update performance I have constructed the following
query which updates the timestamp of 10000 rows

UPDATE test_table set check_timestamp = now() FROM(select id from
test_table limit 10000) AS subquery where test_table.id = subquery.id;

That update took about 1 minute and 44 seconds
Time: 104254.392 ms (01:44.254)

Below is the EXPLAIN ANALYZE


EXPLAIN ANALYZE UPDATE test_table set check_timestamp = now()
FROM(select id from test_table limit 10000) AS subquery where
test_table.id = subquery.id;

  QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------------
 Update on test_table  (cost=0.57..28234.86 rows=10000 width=160)
(actual time=102081.905..102081.905 rows=0 loops=1)
   ->  Nested Loop  (cost=0.57..28234.86 rows=10000 width=160) (actual
time=32.286..101678.652 rows=10000 loops=1)
         ->  Subquery Scan on subquery  (cost=0.00..514.96 rows=10000
width=56) (actual time=0.048..45.127 rows=10000 loops=1)
               ->  Limit  (cost=0.00..414.96 rows=10000 width=16)
(actual time=0.042..26.319 rows=10000 loops=1)
                     ->  Seq Scan on test_table test_table_1
(cost=0.00..4199520.04 rows=101204004 width=16) (actual
time=0.040..21.542 rows=10000 loops=1)
         ->  Index Scan using new_table_pkey on test_table
(cost=0.57..2.77 rows=1 width=92) (actual time=10.160..10.160 rows=1
loops=10000)
               Index Cond: (id = subquery.id)
 Planning Time: 0.319 ms
 Execution Time: 102081.967 ms
(9 rows)

Time: 102122.421 ms (01:42.122)



with the right hardware can one partition handle 600 millions of rows
with good insert/update performance? if so what kind of hardware
should I be looking at? Or would I need to create partitions? I'd like
to hear some recommendations.



pgsql-performance by date:

Previous
From: Justin Pryzby
Date:
Subject: Re: Please help! Query jumps from 1s -> 4m
Next
From: Michael Lewis
Date:
Subject: Re: 600 million rows of data. Bad hardware or need partitioning?