Thread: Why could different data in a table be processed with different performance?

Why could different data in a table be processed with different performance?

From
Vladimir Ryabtsev
Date:
I am experiencing a strange performance problem when accessing JSONB content by primary key.

My DB version() is PostgreSQL 10.3 (Ubuntu 10.3-1.pgdg14.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 4.8.4-2ubuntu1~14.04.4) 4.8.4, 64-bit
postgres.conf: https://justpaste.it/6pzz1
uname -a
: Linux postgresnlpslave 4.4.0-62-generic #83-Ubuntu SMP Wed Jan 18 14:10:15 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
The machine is virtual, running under Hyper-V.
Processor: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz, 1x1 cores
Disk storage: the host has two vmdx drives, first shared between the root partition and an LVM PV, second is a single LVM PV. Both PVs are in a VG containing swap and postgres data partitions. The data is mostly on the first PV.

I have such a table:

CREATE TABLE articles
(
    article_id bigint NOT NULL,
    content jsonb NOT NULL,
    published_at timestamp without time zone NOT NULL,
    appended_at timestamp without time zone NOT NULL,
    source_id integer NOT NULL,
    language character varying(2) NOT NULL,
    title text NOT NULL,
    topicstopic[] NOT NULL,
    objects object[] NOT NULL,
    cluster_id bigint NOT NULL,
    CONSTRAINT articles_pkey PRIMARY KEY (article_id)
)


We have a Python lib (using psycopg2 driver) to access this table. It executes simple queries to the table, one of them is used for bulk downloading of content and looks like this:

select content from articles where id between $1 and $2

I noticed that with some IDs it works pretty fast while with other it is 4-5 times slower. It is suitable to note, there are two main 'categories' of IDs in this table: first is range 270000000-500000000, and second is range 10000000000-100030000000. For the first range it is 'fast' and for the second it is 'slow'. Besides larger absolute numbers withdrawing them from int to bigint, values in the second range are more 'sparse', which means in the first range values are almost consequent (with very few 'holes' of missing values) while in the second range there are much more 'holes' (average filling is 35%). Total number of rows in the first range: ~62M, in the second range: ~10M.

I conducted several experiments to eliminate possible influence of library's code and network throughput, I omit some of them. I ended up with iterating over table with EXPLAIN to simulate read load:

explain (analyze, buffers)
select count(*), sum(length(content::text)) from articles where article_id between %s and %s


Sample output:

Aggregate  (cost=8635.91..8635.92 rows=1 width=16) (actual time=6625.993..6625.995 rows=1 loops=1)
  Buffers: shared hit=26847 read=3914
  ->  Index Scan using articles_pkey on articles  (cost=0.57..8573.35 rows=5005 width=107) (actual time=21.649..1128.004 rows=5000 loops=1)
        Index Cond: ((article_id >= 438000000) AND (article_id <= 438005000))
        Buffers: shared hit=4342 read=671
Planning time: 0.393 ms
Execution time: 6626.136 ms

Aggregate  (cost=5533.02..5533.03 rows=1 width=16) (actual time=33219.100..33219.102 rows=1 loops=1)
  Buffers: shared hit=6568 read=7104
  ->  Index Scan using articles_pkey on articles  (cost=0.57..5492.96 rows=3205 width=107) (actual time=22.167..12082.624 rows=2416 loops=1)
        Index Cond: ((article_id >= '100021000000'::bigint) AND (article_id <= '100021010000'::bigint))
        Buffers: shared hit=50 read=2378
Planning time: 0.517 ms
Execution time: 33219.218 ms


During iteration, I parse the result of EXPLAIN and collect series of following metrics:

- buffer hits/reads for the table,
- buffer hits/reads for the index,
- number of rows (from "Index Scan..."),
- duration of execution.

Based on metrics above I calculate inherited metrics:

- disk read rate: (index reads + table reads) * 8192 / duration,
- reads ratio: (index reads + table reads) / (index reads + table reads + index hits + table hits),
- data rate: (index reads + table reads + index hits + table hits) * 8192 / duration,
- rows rate: number of rows / duration.

Since "density" of IDs is different in "small" and "big" ranges, I adjusted size of chunks in order to get around 5000 rows on each iteration in both cases, though my experiments show that chunk size does not really matter a lot.

The issue posted at the very beginning of my message was confirmed for the *whole* first and second ranges (so it was not just caused by randomly cached data).

To eliminate cache influence, I restarted Postgres server with flushing buffers:

/$ postgresql stop; sync; echo 3 > /proc/sys/vm/drop_caches; postgresql start

After this I repeated the test and got next-to-same picture.

"Small' range: disk read rate is around 10-11 MB/s uniformly across the test. Output rate was 1300-1700 rows/s. Read ratio is around 13% (why? Shouldn't it be ~ 100% after drop_caches?).
"Big" range: In most of time disk read speed was about 2 MB/s but sometimes it jumped to 26-30 MB/s. Output rate was 70-80 rows/s (but varied a lot and reached 8000 rows/s). Read ratio also varied a lot.

I rendered series from the last test into charts:
"Small" range: https://i.stack.imgur.com/3Zfml.png
"Big" range (insane): https://i.stack.imgur.com/VXdID.png

During the tests I verified disk read speed with iotop and found its indications very close to ones calculated by me based on EXPLAIN BUFFERS. I cannot say I was monitoring it all the time, but I confirmed it when it was 2 MB/s and 22 MB/s on the second range and 10 MB/s on the first range. I also checked with htop that CPU was not a bottleneck and was around 3% during the tests.

The issue is reproducible on both master and slave servers. My tests were conducted on slave, while there were no any other load on DBMS, or disk activity on the host unrelated to DBMS.

My only assumption is that different fragments of data are being read with different speed due to virtualization or something, but... why is it so strictly bound to these ranges? Why is it the same on two different machines?

The file system performance measured by dd:

root@postgresnlpslave:/# echo 3 > /proc/sys/vm/drop_caches 
root@postgresnlpslave:/# dd if=/dev/mapper/postgresnlpslave--vg-root of=/dev/null bs=8K count=128K
131072+0 records in
131072+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 2.12304 s, 506 MB/s

Am I missing something? What else can I do to narrow down the cause?

Re: Why could different data in a table be processed with differentperformance?

From
Justin Pryzby
Date:
On Thu, Sep 20, 2018 at 05:07:21PM -0700, Vladimir Ryabtsev wrote:
> I am experiencing a strange performance problem when accessing JSONB
> content by primary key.

> I noticed that with some IDs it works pretty fast while with other it is
> 4-5 times slower. It is suitable to note, there are two main 'categories'
> of IDs in this table: first is range 270000000-500000000, and second is
> range 10000000000-100030000000. For the first range it is 'fast' and for
> the second it is 'slow'.

Was the data populated differently, too ?
Has the table been reindexed (or pg_repack'ed) since loading (or vacuumed for
that matter) ?
Were the tests run when the DB was otherwise idle?

You can see the index scan itself takes an additional 11sec, the "heap" portion
takes the remaining, additional 14sec (33s-12s-7s).

So it seems to me like the index itself is slow to scan.  *And*, the heap
referenced by the index is slow to scan, probably due to being referenced by
the index less consecutively.

> "Small' range: disk read rate is around 10-11 MB/s uniformly across the
> test. Output rate was 1300-1700 rows/s. Read ratio is around 13% (why?
> Shouldn't it be ~ 100% after drop_caches?).

I guess you mean buffers cache hit ratio: read/hit, which I think should
actually be read/(hit+read).

It's because a given buffer can be requested multiple times.  For example, if
an index page is read which references multiple items on the same heap page,
each heap access is counted separately.  If the index is freshly built, that'd
happen nearly every item.

Justin

> Aggregate  (cost=8635.91..8635.92 rows=1 width=16) (actual time=6625.993..6625.995 rows=1 loops=1)
>   Buffers: shared hit=26847 read=3914
>   ->  Index Scan using articles_pkey on articles  (cost=0.57..8573.35 rows=5005 width=107) (actual
time=21.649..1128.004rows=5000 loops=1)
 
>         Index Cond: ((article_id >= 438000000) AND (article_id <= 438005000))
>         Buffers: shared hit=4342 read=671

> Aggregate  (cost=5533.02..5533.03 rows=1 width=16) (actual time=33219.100..33219.102 rows=1 loops=1)
>   Buffers: shared hit=6568 read=7104
>   ->  Index Scan using articles_pkey on articles  (cost=0.57..5492.96 rows=3205 width=107) (actual
time=22.167..12082.624rows=2416 loops=1)
 
>         Index Cond: ((article_id >= '100021000000'::bigint) AND (article_id <= '100021010000'::bigint))
>         Buffers: shared hit=50 read=2378


Re: Why could different data in a table be processed with different performance?

From
Vladimir Ryabtsev
Date:
> Was the data populated differently, too ?
Here is how new records were coming in last two month, by days: https://i.stack.imgur.com/zp9WP.png During a day, records come evenly (in both ranges), slightly faster in Europe and American work time.

Since Jul 1, 2018, when we started population by online records, trend was approximately same as before Aug 04, 2018 (see picture). Then it changed for "big" range, we now in some transition period until it stabilizes.

We also have imported historical data massively from another system. First part was the range with big numbers, they were added in couple of days, second part was range with small numbers, it took around a week. Online records were coming uninterruptedly during the import.

Rows are updated rarely and almost never deleted.

Here is distribution of JSONB field length (if converted to ::text) in last 5 days:
<10KB: 665066
10-20KB: 225697
20-30KB: 25640
30-40KB: 6678
40-50KB: 2100
50-60KB: 1028
Other (max 2.7MB): 2248 (only single exemplars larger than 250KB)

> Has the table been reindexed (or pg_repack'ed) since loading (or vacuumed for that matter) ?
Not sure what you mean... We created indexes on some fields (on appended_at, published_at, source_id).
When I came across the problem I noticed that table is not being vacuumed. I then ran VACUUM ANALYZE manually but it did not change anything about the issue.

> Were the tests run when the DB was otherwise idle?
Yes, like I said, my test were performed on slave, the were no any other users connected (only me monitoring sessions from pgAdmin), and I never noticed any significant I/O from processes other than postgres (only light load from replication).

> You can see the index scan itself takes an additional 11sec, the "heap" portion takes the remaining, additional 14sec (33s-12s-7s).
Sorry, I see 33 s total and 12 s for index, where do you see 7 s?

> I guess you mean buffers cache hit ratio: read/hit, which I think should actually be read/(hit+read).
I will quote myself:
> reads ratio: (index reads + table reads) / (index reads + table reads + index hits + table hits)
So yes, you are right, it is.

+ Some extra info about my system from QA recommendations:

OS version: Ubuntu 16.04.2 LTS / xenial

~$ time dd if=/dev/mapper/postgresnlpslave--vg-root of=/dev/null bs=1M count=32K skip=$((128*$RANDOM/32))
32768+0 records in
32768+0 records out
34359738368 bytes (34 GB, 32 GiB) copied, 62.1574 s, 553 MB/s
0.05user 23.13system 1:02.15elapsed 37%CPU (0avgtext+0avgdata 3004maxresident)k
67099496inputs+0outputs (0major+335minor)pagefaults 0swaps


DBMS is accessed directly (no pgpool, pgbouncer, etc).

RAM: 58972 MB

On physical device level RAID10 is used.

Table metadata: (relname, relpages, reltuples, relallvisible, relkind, relnatts, relhassubclass, reloptions, pg_table_size(oid)) = (articles, 7824944, 6.74338e+07, 7635864, 10, false, 454570926080)

Regards,
Vlad

чт, 20 сент. 2018 г. в 17:42, Justin Pryzby <pryzby@telsasoft.com>:
On Thu, Sep 20, 2018 at 05:07:21PM -0700, Vladimir Ryabtsev wrote:
> I am experiencing a strange performance problem when accessing JSONB
> content by primary key.

> I noticed that with some IDs it works pretty fast while with other it is
> 4-5 times slower. It is suitable to note, there are two main 'categories'
> of IDs in this table: first is range 270000000-500000000, and second is
> range 10000000000-100030000000. For the first range it is 'fast' and for
> the second it is 'slow'.

Was the data populated differently, too ?
Has the table been reindexed (or pg_repack'ed) since loading (or vacuumed for
that matter) ?
Were the tests run when the DB was otherwise idle?

You can see the index scan itself takes an additional 11sec, the "heap" portion
takes the remaining, additional 14sec (33s-12s-7s).

So it seems to me like the index itself is slow to scan.  *And*, the heap
referenced by the index is slow to scan, probably due to being referenced by
the index less consecutively.

> "Small' range: disk read rate is around 10-11 MB/s uniformly across the
> test. Output rate was 1300-1700 rows/s. Read ratio is around 13% (why?
> Shouldn't it be ~ 100% after drop_caches?).

I guess you mean buffers cache hit ratio: read/hit, which I think should
actually be read/(hit+read).

It's because a given buffer can be requested multiple times.  For example, if
an index page is read which references multiple items on the same heap page,
each heap access is counted separately.  If the index is freshly built, that'd
happen nearly every item.

Justin

> Aggregate  (cost=8635.91..8635.92 rows=1 width=16) (actual time=6625.993..6625.995 rows=1 loops=1)
>   Buffers: shared hit=26847 read=3914
>   ->  Index Scan using articles_pkey on articles  (cost=0.57..8573.35 rows=5005 width=107) (actual time=21.649..1128.004 rows=5000 loops=1)
>         Index Cond: ((article_id >= 438000000) AND (article_id <= 438005000))
>         Buffers: shared hit=4342 read=671

> Aggregate  (cost=5533.02..5533.03 rows=1 width=16) (actual time=33219.100..33219.102 rows=1 loops=1)
>   Buffers: shared hit=6568 read=7104
>   ->  Index Scan using articles_pkey on articles  (cost=0.57..5492.96 rows=3205 width=107) (actual time=22.167..12082.624 rows=2416 loops=1)
>         Index Cond: ((article_id >= '100021000000'::bigint) AND (article_id <= '100021010000'::bigint))
>         Buffers: shared hit=50 read=2378

Re: Why could different data in a table be processed with differentperformance?

From
Justin Pryzby
Date:
Sorry, dropped -performance.

>>>> Has the table been reindexed (or pg_repack'ed) since loading (or vacuumed
>>>> for that matter) ?
>>> Not sure what you mean... We created indexes on some fields (on
>> I mean REINDEX INDEX articles_pkey;
>> Or (from "contrib"): /usr/pgsql-10/bin/pg_repack -i articles_pkey
>I never did it... Do you recommend to try it? Which variant is preferable?

REINDEX is likely to block access to the table [0], and pg_repack is "online"
(except for briefly acquiring an exclusive lock).

[0] https://www.postgresql.org/docs/10/static/sql-reindex.html

>>>> You can see the index scan itself takes an additional 11sec, the "heap"
>>>> portion takes the remaining, additional 14sec (33s-12s-7s).
>>> Sorry, I see 33 s total and 12 s for index, where do you see 7 s?
>> 6625 ms (for short query).
>> So the heap component of the long query is 14512 ms slower.
> Yes, I see, thanks.
> So reindex can help only with index component? What should I do for heap?
> May be reindex the corresponding toast table?

I think reindex will improve the heap access..and maybe the index access too.
I don't see why it would be bloated without UPDATE/DELETE, but you could check
to see if its size changes significantly after reindex.

Justin


Re: Why could different data in a table be processed with differentperformance?

From
Laurenz Albe
Date:
Vladimir Ryabtsev wrote:
> explain (analyze, buffers)
> select count(*), sum(length(content::text)) from articles where article_id between %s and %s
> 
> Sample output:
> 
> Aggregate  (cost=8635.91..8635.92 rows=1 width=16) (actual time=6625.993..6625.995 rows=1 loops=1)
>   Buffers: shared hit=26847 read=3914
>   ->  Index Scan using articles_pkey on articles  (cost=0.57..8573.35 rows=5005 width=107) (actual
time=21.649..1128.004rows=5000 loops=1)
 
>         Index Cond: ((article_id >= 438000000) AND (article_id <= 438005000))
>         Buffers: shared hit=4342 read=671
> Planning time: 0.393 ms
> Execution time: 6626.136 ms
> 
> Aggregate  (cost=5533.02..5533.03 rows=1 width=16) (actual time=33219.100..33219.102 rows=1 loops=1)
>   Buffers: shared hit=6568 read=7104
>   ->  Index Scan using articles_pkey on articles  (cost=0.57..5492.96 rows=3205 width=107) (actual
time=22.167..12082.624rows=2416 loops=1)
 
>         Index Cond: ((article_id >= '100021000000'::bigint) AND (article_id <= '100021010000'::bigint))
>         Buffers: shared hit=50 read=2378
> Planning time: 0.517 ms
> Execution time: 33219.218 ms
> 
> During iteration, I parse the result of EXPLAIN and collect series of following metrics:
> 
> - buffer hits/reads for the table,
> - buffer hits/reads for the index,
> - number of rows (from "Index Scan..."),
> - duration of execution.
> 
> Based on metrics above I calculate inherited metrics:
> 
> - disk read rate: (index reads + table reads) * 8192 / duration,
> - reads ratio: (index reads + table reads) / (index reads + table reads + index hits + table hits),
> - data rate: (index reads + table reads + index hits + table hits) * 8192 / duration,
> - rows rate: number of rows / duration.
> 
> Since "density" of IDs is different in "small" and "big" ranges, I adjusted
> size of chunks in order to get around 5000 rows on each iteration in both cases,
> though my experiments show that chunk size does not really matter a lot.
> 
> The issue posted at the very beginning of my message was confirmed for the
> *whole* first and second ranges (so it was not just caused by randomly cached data).
> 
> To eliminate cache influence, I restarted Postgres server with flushing buffers:
> 
> /$ postgresql stop; sync; echo 3 > /proc/sys/vm/drop_caches; postgresql start
> 
> After this I repeated the test and got next-to-same picture.
> 
> "Small' range: disk read rate is around 10-11 MB/s uniformly across the test.
> Output rate was 1300-1700 rows/s. Read ratio is around 13% (why? Shouldn't it be
> ~ 100% after drop_caches?).
> "Big" range: In most of time disk read speed was about 2 MB/s but sometimes
> it jumped to 26-30 MB/s. Output rate was 70-80 rows/s (but varied a lot and
> reached 8000 rows/s). Read ratio also varied a lot.
> 
> I rendered series from the last test into charts:
> "Small" range: https://i.stack.imgur.com/3Zfml.png
> "Big" range (insane): https://i.stack.imgur.com/VXdID.png
> 
> During the tests I verified disk read speed with iotop and found its indications
> very close to ones calculated by me based on EXPLAIN BUFFERS. I cannot say I was
> monitoring it all the time, but I confirmed it when it was 2 MB/s and 22 MB/s on
> the second range and 10 MB/s on the first range. I also checked with htop that
> CPU was not a bottleneck and was around 3% during the tests.
> 
> The issue is reproducible on both master and slave servers. My tests were conducted
> on slave, while there were no any other load on DBMS, or disk activity on the
> host unrelated to DBMS.
> 
> My only assumption is that different fragments of data are being read with different
> speed due to virtualization or something, but... why is it so strictly bound
> to these ranges? Why is it the same on two different machines?

What is the storage system?

Setting "track_io_timing = on" should measure the time spent doing I/O
more accurately.

One problem with measuring read speed that way is that "buffers read" can
mean "buffers read from storage" or "buffers read from the file system cache",
but you say you observe a difference even after dropping the cache.

To verify if the difference comes from the physical placement, you could
run VACUUM (FULL) which rewrites the table and see if that changes the behavior.

Another idea is that the operating system rearranges I/O in a way that
is not ideal for your storage.

Try a different I/O scheduler by running

echo deadline > /sys/block/sda/queue/scheduler

(replace "sda" with the disk where your database resides)

See if that changes the observed I/O speed.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com



Re: Why could different data in a table be processed with different performance?

From
Vladimir Ryabtsev
Date:
> Setting "track_io_timing = on" should measure the time spent doing I/O more accurately.
I see I/O timings after this. It shows that 96.5% of long queries is spent on I/O. If I subtract I/O time from total I get ~1,4 s for 5000 rows, which is SAME for both ranges if I adjust segment borders accordingly (to match ~5000 rows). Only I/O time differs, and differs significantly.

> One problem with measuring read speed that way is that "buffers read" can mean "buffers read from storage" or "buffers read from the file system cache",
I understand, that's why I conducted experiments with drop_caches.

> but you say you observe a difference even after dropping the cache.
No, I say I see NO significant difference (accurate to measurement error) between "with caches" and after dropping caches. And this is explainable, I think. Since I read consequently almost all data from the huge table, no cache can fit this data, thus it cannot influence significantly on results. And whilst the PK index *could* be cached (in theory) I think its data is being displaced from buffers by bulkier JSONB data.

Vlad

Re: Why could different data in a table be processed with differentperformance?

From
Fabio Pardi
Date:
Hi Vladimir,



On 09/21/2018 02:07 AM, Vladimir Ryabtsev wrote:

> 
> I have such a table:
> 
> CREATE TABLE articles
> (
>     article_id bigint NOT NULL,
>     content jsonb NOT NULL,
>     published_at timestamp without time zone NOT NULL,
>     appended_at timestamp without time zone NOT NULL,
>     source_id integer NOT NULL,
>     language character varying(2) NOT NULL,
>     title text NOT NULL,
>     topicstopic[] NOT NULL,
>     objects object[] NOT NULL,
>     cluster_id bigint NOT NULL,
>     CONSTRAINT articles_pkey PRIMARY KEY (article_id)
> )
> 
> select content from articles where id between $1 and $2
> 
> I noticed that with some IDs it works pretty fast while with other it is
> 4-5 times slower. It is suitable to note, there are two main
> 'categories' of IDs in this table: first is range 270000000-500000000,
> and second is range 10000000000-100030000000. For the first range it is
> 'fast' and for the second it is 'slow'. Besides larger absolute numbers
> withdrawing them from int to bigint, values in the second range are more
> 'sparse', which means in the first range values are almost consequent
> (with very few 'holes' of missing values) while in the second range
> there are much more 'holes' (average filling is 35%). Total number of
> rows in the first range: ~62M, in the second range: ~10M.
> 
> 
> explain (analyze, buffers)
> select count(*), sum(length(content::text)) from articles where
> article_id between %s and %s
> 

is the length of the text equally distributed over the 2 partitions?

> Sample output:
> 
> Aggregate  (cost=8635.91..8635.92 rows=1 width=16) (actual
> time=6625.993..6625.995 rows=1 loops=1)
>   Buffers: shared hit=26847 read=3914
>   ->  Index Scan using articles_pkey on articles  (cost=0.57..8573.35
> rows=5005 width=107) (actual time=21.649..1128.004 rows=5000 loops=1)
>         Index Cond: ((article_id >= 438000000) AND (article_id <=
> 438005000))
>         Buffers: shared hit=4342 read=671
> Planning time: 0.393 ms
> Execution time: 6626.136 ms
> 
> Aggregate  (cost=5533.02..5533.03 rows=1 width=16) (actual
> time=33219.100..33219.102 rows=1 loops=1)
>   Buffers: shared hit=6568 read=7104
>   ->  Index Scan using articles_pkey on articles  (cost=0.57..5492.96
> rows=3205 width=107) (actual time=22.167..12082.624 rows=2416 loops=1)
>         Index Cond: ((article_id >= '100021000000'::bigint) AND
> (article_id <= '100021010000'::bigint))
>         Buffers: shared hit=50 read=2378
> Planning time: 0.517 ms
> Execution time: 33219.218 ms
> 

> 
> Since "density" of IDs is different in "small" and "big" ranges, I
> adjusted size of chunks in order to get around 5000 rows on each
> iteration in both cases, though my experiments show that chunk size does
> not really matter a lot.
> 

From what you posted, the first query retrieves 5005 rows, but the
second 2416. It might be helpful if we are able to compare 5000 vs 5000

Also is worth noticing that the 'estimated' differs from 'actual' on the
second query.
I think that happens because data is differently distributed over the
ranges.
Probably the analyzer does not have enough samples to understand the
real distribution. You might try to increase the number of samples (and
run analyze) or to create partial indexes on the 2 ranges.
Can you give a try to both options and let us know?

> The issue posted at the very beginning of my message was confirmed for
> the *whole* first and second ranges (so it was not just caused by
> randomly cached data).
> 
> To eliminate cache influence, I restarted Postgres server with flushing
> buffers:
> 
> /$ postgresql stop; sync; echo 3 > /proc/sys/vm/drop_caches; postgresql
> start
> 

i would do a sync at the end, after dropping caches.

But the problem here is that you are virtualizing. I think that you
might want to consider the physical layer. Eg:

- does the raid controller have a cache?

- how big is the cache? (when you measure disk speed, that will
influence the result very much, if  you do not run the test on
big-enough data chunk) best if is disabled during your tests

- is the OS caching disk blocks too? maybe you want to drop everything
from there too.

I think that you should be pragmatic and try to run the tests on a
physical machine. If results are then reproducible there too, then you
can exclude the whole virtual layer.

> After this I repeated the test and got next-to-same picture.
> 
> "Small' range: disk read rate is around 10-11 MB/s uniformly across the
> test. Output rate was 1300-1700 rows/s. Read ratio is around 13% (why?
> Shouldn't it be ~ 100% after drop_caches?).
> "Big" range: In most of time disk read speed was about 2 MB/s but
> sometimes it jumped to 26-30 MB/s. Output rate was 70-80 rows/s (but
> varied a lot and reached 8000 rows/s). Read ratio also varied a lot.
> 
> I rendered series from the last test into charts:
> "Small" range: https://i.stack.imgur.com/3Zfml.png
> "Big" range (insane): https://i.stack.imgur.com/VXdID.png
> 
> During the tests I verified disk read speed with iotop

on the VM or on the physical host?


regards,

fabio pardi


Re: Why could different data in a table be processed with differentperformance?

From
Fabio Pardi
Date:

On 09/21/2018 08:28 AM, Vladimir Ryabtsev wrote:

>> but you say you observe a difference even after dropping the cache.
> No, I say I see NO significant difference (accurate to measurement
> error) between "with caches" and after dropping caches. And this is
> explainable, I think. Since I read consequently almost all data from the
> huge table, no cache can fit this data, thus it cannot influence
> significantly on results. And whilst the PK index *could* be cached (in
> theory) I think its data is being displaced from buffers by bulkier
> JSONB data.
> 
> Vlad

I think this is not accurate. If you fetch from an index, then only the
blocks containing the matching records are red from disk and therefore
cached in RAM.

regards,

fabio pardi


Re: Why could different data in a table be processed with different performance?

From
Vladimir Ryabtsev
Date:
> is the length of the text equally distributed over the 2 partitions?
Not 100% equally, but to me it does not seem to be a big deal... Considering the ranges independently:
First range: ~70% < 10 KB, ~25% for 10-20 KB, ~3% for 20-30 KB, everything else is less than 1% (with 10 KB steps).
Second range: ~80% < 10 KB, ~18% for 10-20 KB, ~2% for 20-30 KB, everything else is less than 1% (with 10 KB steps).

>From what you posted, the first query retrieves 5005 rows, but the second 2416. It might be helpful if we are able to compare 5000 vs 5000
Yes it was just an example, here are the plans for approximately same number of rows:

Aggregate  (cost=9210.12..9210.13 rows=1 width=16) (actual time=4265.478..4265.479 rows=1 loops=1)
  Buffers: shared hit=27027 read=4311
  I/O Timings: read=2738.728
  ->  Index Scan using articles_pkey on articles  (cost=0.57..9143.40 rows=5338 width=107) (actual time=12.254..873.081 rows=5001 loops=1)
        Index Cond: ((article_id >= 438030000) AND (article_id <= 438035000))
        Buffers: shared hit=4282 read=710
        I/O Timings: read=852.547
Planning time: 0.235 ms
Execution time: 4265.554 ms

Aggregate  (cost=11794.59..11794.60 rows=1 width=16) (actual time=62298.559..62298.559 rows=1 loops=1)
  Buffers: shared hit=15071 read=14847
  I/O Timings: read=60703.859
  ->  Index Scan using articles_pkey on articles  (cost=0.57..11709.13 rows=6837 width=107) (actual time=24.686..24582.221 rows=5417 loops=1)
        Index Cond: ((article_id >= '100021040000'::bigint) AND (article_id <= '100021060000'::bigint))
        Buffers: shared hit=195 read=5244
        I/O Timings: read=24507.621
Planning time: 0.494 ms
Execution time: 62298.630 ms

If we subtract I/O from total time, we get 1527 ms vs 1596 ms — very close timings for other than I/O operations (considering slightly higher number of rows in second case). But  I/O time differs dramatically.

> Also is worth noticing that the 'estimated' differs from 'actual' on the second query. I think that happens because data is differently distributed over the ranges. Probably the analyzer does not have enough samples to understand the real distribution.
I think we should not worry about it unless the planner chose poor plan, should we? Statistics affects on picking a proper plan, but not on execution of the plan, doesn't it?

> You might try to increase the number of samples (and run analyze)
To be honest, I don't understand it... As I know, in Postgres we have two options: set column target percentile and set n_distinct. We can't increase fraction of rows analyzed (like in other DBMSs we can set ANALYZE percentage explicitly). Moreover, in our case the problem column is PRIMARY KEY with all distinct values, Could you point me, what exactly should I do?

> or to create partial indexes on the 2 ranges.
Sure, will try it with partial indexes. Should I drop existing PK index, or ensuring that planner picks range index is enough?

> i would do a sync at the end, after dropping caches.
A bit off-topic, but why? Doing sync may put something to cache again.

> - does the raid controller have a cache?
> - how big is the cache? (when you measure disk speed, that will influence the result very much, if  you do not run the test on big-enough data chunk) best if is disabled during your tests 
I am pretty sure there is some, usually it's several tens of megabytes, but I ran disk read tests several times with chunks that could not be fit in the cache and with random offset, so I am pretty sure that something around 500 MB/s is enough reasonably accurate (but it is only for sequential read).

> - is the OS caching disk blocks too? maybe you want to drop everything from there too.
How can I find it out? And how to drop it? Or you mean hypervisor OS?
Anyway, don't you think that caching specifics could not really explain these issues?

> I think that you should be pragmatic and try to run the tests on a physical machine.
I wish I could do it, but hardly it is possible. In some future we may migrate the DB to physical hosts, but now we need to make it work in virtual.

> on the VM or on the physical host?
On the VM. The physical host is Windows (no iotop) and I have no access to it.

Vlad

Re: Why could different data in a table be processed with different performance?

From
Vladimir Ryabtsev
Date:
> I think reindex will improve the heap access..and maybe the index access too. I don't see why it would be bloated without UPDATE/DELETE, but you could check to see if its size changes significantly after reindex.
I tried REINDEX, and size of PK index changed from 2579 to 1548 MB.
But test don't show any significant improvement from what it was. May be read speed for the "big" range became just slightly faster in average.

Vlad

Hi,
Assuming DB is quiescent.

And if you run?

select count(*) from articles where article_id between %s and %s

ie without reading json, is your buffers hit count increasing?
20 000 8K blocks *2 is  500MB , should be in RAM after the first run.

Fast:
read=710 I/O Timings: read=852.547 ==> 1.3 ms /IO
800 IO/s some memory, sequential reads or a good raid layout.

Slow:
read=5244 I/O Timings: read=24507.621 ==> 4.7 ms /IO
200 IO/s more HD reads? more seeks? slower HD zones ?

Maybe you can play with PG cache size.


On Sat, Sep 22, 2018 at 12:32 PM Vladimir Ryabtsev <greatvovan@gmail.com> wrote:
> I think reindex will improve the heap access..and maybe the index access too. I don't see why it would be bloated without UPDATE/DELETE, but you could check to see if its size changes significantly after reindex.
I tried REINDEX, and size of PK index changed from 2579 to 1548 MB.
But test don't show any significant improvement from what it was. May be read speed for the "big" range became just slightly faster in average.

Vlad

Re: Why could different data in a table be processed with different performance?

From
Vladimir Ryabtsev
Date:
> Another idea is that the operating system rearranges I/O in a way that
is not ideal for your storage.
> Try a different I/O scheduler by running
echo deadline > /sys/block/sda/queue/scheduler

My scheduler was already "deadline".
In some places I read that in virtual environment sometimes "noop" scheduler is better, so I tried it. However the experiment shown NO noticeable difference between them (look "deadline": https://i.stack.imgur.com/wCOJW.png, "noop": https://i.stack.imgur.com/lB33u.png). At the same time tests show almost similar patterns in changing read speed when going over the "slow" range.

Vlad

чт, 20 сент. 2018 г. в 20:17, Laurenz Albe <laurenz.albe@cybertec.at>:
Vladimir Ryabtsev wrote:
> explain (analyze, buffers)
> select count(*), sum(length(content::text)) from articles where article_id between %s and %s
>
> Sample output:
>
> Aggregate  (cost=8635.91..8635.92 rows=1 width=16) (actual time=6625.993..6625.995 rows=1 loops=1)
>   Buffers: shared hit=26847 read=3914
>   ->  Index Scan using articles_pkey on articles  (cost=0.57..8573.35 rows=5005 width=107) (actual time=21.649..1128.004 rows=5000 loops=1)
>         Index Cond: ((article_id >= 438000000) AND (article_id <= 438005000))
>         Buffers: shared hit=4342 read=671
> Planning time: 0.393 ms
> Execution time: 6626.136 ms
>
> Aggregate  (cost=5533.02..5533.03 rows=1 width=16) (actual time=33219.100..33219.102 rows=1 loops=1)
>   Buffers: shared hit=6568 read=7104
>   ->  Index Scan using articles_pkey on articles  (cost=0.57..5492.96 rows=3205 width=107) (actual time=22.167..12082.624 rows=2416 loops=1)
>         Index Cond: ((article_id >= '100021000000'::bigint) AND (article_id <= '100021010000'::bigint))
>         Buffers: shared hit=50 read=2378
> Planning time: 0.517 ms
> Execution time: 33219.218 ms
>
> During iteration, I parse the result of EXPLAIN and collect series of following metrics:
>
> - buffer hits/reads for the table,
> - buffer hits/reads for the index,
> - number of rows (from "Index Scan..."),
> - duration of execution.
>
> Based on metrics above I calculate inherited metrics:
>
> - disk read rate: (index reads + table reads) * 8192 / duration,
> - reads ratio: (index reads + table reads) / (index reads + table reads + index hits + table hits),
> - data rate: (index reads + table reads + index hits + table hits) * 8192 / duration,
> - rows rate: number of rows / duration.
>
> Since "density" of IDs is different in "small" and "big" ranges, I adjusted
> size of chunks in order to get around 5000 rows on each iteration in both cases,
> though my experiments show that chunk size does not really matter a lot.
>
> The issue posted at the very beginning of my message was confirmed for the
> *whole* first and second ranges (so it was not just caused by randomly cached data).
>
> To eliminate cache influence, I restarted Postgres server with flushing buffers:
>
> /$ postgresql stop; sync; echo 3 > /proc/sys/vm/drop_caches; postgresql start
>
> After this I repeated the test and got next-to-same picture.
>
> "Small' range: disk read rate is around 10-11 MB/s uniformly across the test.
> Output rate was 1300-1700 rows/s. Read ratio is around 13% (why? Shouldn't it be
> ~ 100% after drop_caches?).
> "Big" range: In most of time disk read speed was about 2 MB/s but sometimes
> it jumped to 26-30 MB/s. Output rate was 70-80 rows/s (but varied a lot and
> reached 8000 rows/s). Read ratio also varied a lot.
>
> I rendered series from the last test into charts:
> "Small" range: https://i.stack.imgur.com/3Zfml.png
> "Big" range (insane): https://i.stack.imgur.com/VXdID.png
>
> During the tests I verified disk read speed with iotop and found its indications
> very close to ones calculated by me based on EXPLAIN BUFFERS. I cannot say I was
> monitoring it all the time, but I confirmed it when it was 2 MB/s and 22 MB/s on
> the second range and 10 MB/s on the first range. I also checked with htop that
> CPU was not a bottleneck and was around 3% during the tests.
>
> The issue is reproducible on both master and slave servers. My tests were conducted
> on slave, while there were no any other load on DBMS, or disk activity on the
> host unrelated to DBMS.
>
> My only assumption is that different fragments of data are being read with different
> speed due to virtualization or something, but... why is it so strictly bound
> to these ranges? Why is it the same on two different machines?

What is the storage system?

Setting "track_io_timing = on" should measure the time spent doing I/O
more accurately.

One problem with measuring read speed that way is that "buffers read" can
mean "buffers read from storage" or "buffers read from the file system cache",
but you say you observe a difference even after dropping the cache.

To verify if the difference comes from the physical placement, you could
run VACUUM (FULL) which rewrites the table and see if that changes the behavior.

Another idea is that the operating system rearranges I/O in a way that
is not ideal for your storage.

Try a different I/O scheduler by running

echo deadline > /sys/block/sda/queue/scheduler

(replace "sda" with the disk where your database resides)

See if that changes the observed I/O speed.

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com

Re: Why could different data in a table be processed with differentperformance?

From
Fabio Pardi
Date:
Hi,

answers (and questions) in line here below

On 22/09/18 11:19, Vladimir Ryabtsev wrote:
> is the length of the text equally distributed over the 2 partitions?
Not 100% equally, but to me it does not seem to be a big deal... Considering the ranges independently:
First range: ~70% < 10 KB, ~25% for 10-20 KB, ~3% for 20-30 KB, everything else is less than 1% (with 10 KB steps).
Second range: ~80% < 10 KB, ~18% for 10-20 KB, ~2% for 20-30 KB, everything else is less than 1% (with 10 KB steps).

agree, should not play a role here

>From what you posted, the first query retrieves 5005 rows, but the second 2416. It might be helpful if we are able to compare 5000 vs 5000
Yes it was just an example, here are the plans for approximately same number of rows:

Aggregate  (cost=9210.12..9210.13 rows=1 width=16) (actual time=4265.478..4265.479 rows=1 loops=1)
  Buffers: shared hit=27027 read=4311
  I/O Timings: read=2738.728
  ->  Index Scan using articles_pkey on articles  (cost=0.57..9143.40 rows=5338 width=107) (actual time=12.254..873.081 rows=5001 loops=1)
        Index Cond: ((article_id >= 438030000) AND (article_id <= 438035000))
        Buffers: shared hit=4282 read=710
        I/O Timings: read=852.547
Planning time: 0.235 ms
Execution time: 4265.554 ms

Aggregate  (cost=11794.59..11794.60 rows=1 width=16) (actual time=62298.559..62298.559 rows=1 loops=1)
  Buffers: shared hit=15071 read=14847
  I/O Timings: read=60703.859
  ->  Index Scan using articles_pkey on articles  (cost=0.57..11709.13 rows=6837 width=107) (actual time=24.686..24582.221 rows=5417 loops=1)
        Index Cond: ((article_id >= '100021040000'::bigint) AND (article_id <= '100021060000'::bigint))
        Buffers: shared hit=195 read=5244
        I/O Timings: read=24507.621
Planning time: 0.494 ms
Execution time: 62298.630 ms

If we subtract I/O from total time, we get 1527 ms vs 1596 ms — very close timings for other than I/O operations (considering slightly higher number of rows in second case). But  I/O time differs dramatically.

> Also is worth noticing that the 'estimated' differs from 'actual' on the second query. I think that happens because data is differently distributed over the ranges. Probably the analyzer does not have enough samples to understand the real distribution.
I think we should not worry about it unless the planner chose poor plan, should we? Statistics affects on picking a proper plan, but not on execution of the plan, doesn't it?


Agree, it was pure speculation


> or to create partial indexes on the 2 ranges.
Sure, will try it with partial indexes. Should I drop existing PK index, or ensuring that planner picks range index is enough?

you cannot drop it since is on a PKEY.

You can create 2 partial indexes and the planner will pick it up for you. (and the planning time will go a bit up)


> - does the raid controller have a cache?
> - how big is the cache? (when you measure disk speed, that will influence the result very much, if  you do not run the test on big-enough data chunk) best if is disabled during your tests 
I am pretty sure there is some, usually it's several tens of megabytes, but I ran disk read tests several times with chunks that could not be fit in the cache and with random offset, so I am pretty sure that something around 500 MB/s is enough reasonably accurate (but it is only for sequential read).


it is not unusual to have 1GB cache or more...  and do not forget to drop the cache between tests + do a sync


> - is the OS caching disk blocks too? maybe you want to drop everything from there too.
How can I find it out? And how to drop it? Or you mean hypervisor OS?
Anyway, don't you think that caching specifics could not really explain these issues?

Sorry I meant the hypervisor OS.

Given that the most of the time is on the I/O then caching is maybe playing a role.

I tried to reproduce your problem but I cannot go even closer to your results. Everything goes smooth with or without shared buffers, or OS cache.

A few questions and considerations came to mind:

- how big is your index?

- how big is the table?

- given the size of shared_buffers, almost 2M blocks should fit, but you say 2 consecutive runs still are hitting the disk. That's strange indeed since you are using way more than 2M blocks.
Did you check that perhaps are there any other processes or cronjobs (on postgres and on the system) that are maybe reading data and flushing out the cache?

You can make use of pg_buffercache in order to see what is actually cached. That might help to have an overview of the content of it.

- As Laurenz suggested (VACUUM FULL), you might want to move data around. You can try also a dump + restore to narrow the problem to data or disk

- You might also want to try to see the disk graph of Windows, while you are running your tests. It can show you if data (and good to know how much) is actually fetching from disk or not.

regards,

fabio pardi

Re: Why could different data in a table be processed with different performance?

From
Vladimir Ryabtsev
Date:
> You can create 2 partial indexes and the planner will pick it up for you. (and the planning time will go a bit up).
Created two partial indexes and ensured planner uses it. But the result is still the same, no noticeable difference.

> it is not unusual to have 1GB cache or more...  and do not forget to drop the cache between tests + do a sync
I conducted several long runs of dd, so I am sure that this numbers are fairly correct. However, what worries me is that I test sequential read speed while during my experiments Postgres might need to read from random places thus reducing real read speed dramatically. I have a feeling that this can be the reason.
I also reviewed import scripts and found the import was done in DESCENDING order of IDs. It was so to get most recent records sooner, may be it caused some inefficiency in the storage... But again, it was so for both ranges.

> - how big is your index?
pg_table_size('articles_pkey') = 1561 MB

> - how big is the table?
pg_table_size('articles') = 427 GB
pg_table_size('pg_toast.pg_toast_221558') = 359 GB

> - given the size of shared_buffers, almost 2M blocks should fit, but you say 2 consecutive runs still are hitting the disk. That's strange indeed since you are using way more than 2M blocks.
TBH, I cannot say I understand your calculations with number of blocks... But to clarify: consecutive runs with SAME parameters do NOT hit the disk, only the first one does, consequent ones read only from buffer cache.

> Did you check that perhaps are there any other processes or cronjobs (on postgres and on the system) that are maybe reading data and flushing out the cache?
I checked with iotop than nothing else reads intensively from any disk in the system. And again, the result is 100% reproducible and depends on ID range only, if there were any thing like these I would have noticed some fluctuations in results.

> You can make use of pg_buffercache in order to see what is actually cached.
It seems that there is no such a view in my DB, could it be that the module is not installed?

> - As Laurenz suggested (VACUUM FULL), you might want to move data around. You can try also a dump + restore to narrow the problem to data or disk
I launched VACUUM FULL, but it ran very slowly, according to my calculation it might take 17 hours. I will try to do copy data into another table with the same structure or spin up another server, and let you know.

> - You might also want to try to see the disk graph of Windows, while you are running your tests. It can show you if data (and good to know how much) is actually fetching from disk or not.
I wanted to do so but I don't have access to Hyper-V server, will try to request credentials from admins.

Couple more observations:
1) The result of my experiment is almost not affected by other server load. Another user was running a query (over this table) with read speed ~130 MB/s, while with my query read at 1.8-2 MB/s.
2) iotop show higher IO % (~93-94%) with slower read speed (though it is not quite clear what this field is). A process from example above had ~55% IO with 130 MB/s while my process had ~93% with ~2MB/s.

Regards,
Vlad

Re: Why could different data in a table be processed with differentperformance?

From
Justin Pryzby
Date:
On Mon, Sep 24, 2018 at 03:28:15PM -0700, Vladimir Ryabtsev wrote:
> > it is not unusual to have 1GB cache or more...  and do not forget to drop
> the cache between tests + do a sync
> I also reviewed import scripts and found the import was done in DESCENDING
> order of IDs.

This seems significant..it means the heap was probably written in backwards
order relative to the IDs, and the OS readahead is ineffective when index
scanning across a range of IDs.  From memory, linux since (at least) 2.6.32 can
optimize this.  You mentioned you're using 4.4.  Does your LVM have readahead
ramped up ?  Try lvchange -r 65536 data/postgres (or similar).

Also..these might be an impractical solution for several reasons, but did you
try either 1) forcing a bitmap scan (of only one index), to force the heap
reads to be ordered, if not sequential?  SET enable_indexscan=off (and maybe
SET enable_seqscan=off and others as needed).

Or, 2) Using a brin index (scanning of which always results in bitmap heap
scan).

> > - how big is the table?
> pg_table_size('articles') = 427 GB
> pg_table_size('pg_toast.pg_toast_221558') = 359 GB

Ouch .. if it were me, I would definitely want to make that a partitioned table..
Or perhaps two unioned together with a view? One each for the sparse and dense
range?

> > You can make use of pg_buffercache in order to see what is actually
> cached.
> It seems that there is no such a view in my DB, could it be that the module
> is not installed?

Right, it's in the postgresql -contrib package.
And you have to "CREATE EXTENSION pg_buffercache".

> > - As Laurenz suggested (VACUUM FULL), you might want to move data around.
> You can try also a dump + restore to narrow the problem to data or disk
> I launched VACUUM FULL, but it ran very slowly, according to my calculation
> it might take 17 hours. I will try to do copy data into another table with
> the same structure or spin up another server, and let you know.

I *suspect* VACUUM FULL won't help, since (AIUI) it copies all "visible" tuples
from the source table into a new table (and updates indices as necessary).  It
can resolve bloat due to historic DELETEs, but since I think your table was
written in reverse order of pkey, I think it'll also copy it in reverse order.
CLUSTER will fix that.  You can use pg_repack to do so online...but it's going
to be painful for a table+toast 1TiB in size: it'll take all day, and also
require an additional 1TB while running (same as VAC FULL).

Justin


Re: Why could different data in a table be processed with different performance?

From
Vladimir Ryabtsev
Date:
> And if you run?
> select count(*) from articles where article_id between %s and %s
> ie without reading json, is your buffers hit count increasing?
Tried this. This is somewhat interesting, too... Even index-only scan is faster for the "fast" range. The results are consistently fast in it, with small and constant numbers of hits and reads. For the big one, in contrary, it shows huge number of hits (why? how it manages to do the same with lesser blocks access in "fast" range?) and the duration is "jumping" with higher values in average.
"Fast": https://i.stack.imgur.com/63I9k.png
"Slow": https://i.stack.imgur.com/QzI3N.png
Note that results on the charts are averaged by 1M, but particular values in "slow" range reached 4 s, while maximum execution time for the "fast" range was only 0.3 s.

Regards,
Vlad

Re: Why could different data in a table be processed with different performance?

From
Vladimir Ryabtsev
Date:
> This seems significant..it means the heap was probably written in backwards
order relative to the IDs, and the OS readahead is ineffective when index
scanning across a range of IDs.
But again, why is it different for one range and another? It was reversed for both ranges.

> I would definitely want to make that a partitioned table
Yes, I believe it will be partitioned in the future.

> I *suspect* VACUUM FULL won't help, since (AIUI) it copies all "visible" tuples from the source table into a new table (and updates indices as necessary).  It can resolve bloat due to historic DELETEs, but since I think your table was written in reverse order of pkey, I think it'll also copy it in reverse order.
I am going copy the slow range into a table nearby and see if it reproduces (I hope "INSERT INTO t2 SELECT * FROM t1 WHERE ..." will keep existing order of rows). Then I could try the same after CLUSTER.

Regards,
Vlad

Re: Why could different data in a table be processed with different performance?

From
Vladimir Ryabtsev
Date:
> did you try either 1) forcing a bitmap scan (of only one index), to force the heap reads to be ordered, if not sequential?  SET enable_indexscan=off (and maybe SET enable_seqscan=off and others as needed).
Disabling index scan made it bitmap.
It is surprising, but this increased read speed in both ranges.
It came two times for "fast" range and 3 times faster for "slow" range (for certain segments of data I checked on, the whole experiment takes a while though).
But there is still a difference between the ranges, it became now ~20 MB/s vs ~6 MB/s.

Vlad

Re: Why could different data in a table be processed with differentperformance?

From
Justin Pryzby
Date:
On Mon, Sep 24, 2018 at 05:59:12PM -0700, Vladimir Ryabtsev wrote:
> > I *suspect* VACUUM FULL won't help, since (AIUI) it copies all "visible"
...
> I am going copy the slow range into a table nearby and see if it reproduces
> (I hope "INSERT INTO t2 SELECT * FROM t1 WHERE ..." will keep existing
> order of rows). Then I could try the same after CLUSTER.

If it does an index scan, I think that will badly fail to keep the same order
of heap TIDs - it'll be inserting rows in ID order rather than in (I guess)
reverse ID order.

Justin


Re: Why could different data in a table be processed with different performance?

From
Vladimir Ryabtsev
Date:
> If it does an index scan, I think that will badly fail to keep the same order of heap TIDs - it'll be inserting rows in ID order rather than in (I guess) reverse ID order.
According to the plan, it's gonna be seq. scan with filter.

Vlad

Re: Why could different data in a table be processed with differentperformance?

From
Justin Pryzby
Date:
On Mon, Sep 24, 2018 at 05:59:12PM -0700, Vladimir Ryabtsev wrote:
> > This seems significant..it means the heap was probably written in
> backwards
> order relative to the IDs, and the OS readahead is ineffective when index
> scanning across a range of IDs.
> But again, why is it different for one range and another? It was reversed
> for both ranges.

I don't have an explaination for it.. but I'd be curious to know
pg_stats.correlation for the id column:

SELECT schemaname, tablename, attname, correlation FROM pg_stats WHERE tablename='articles' AND column='article_id'
LIMIT1;
 

Justin


Re: Why could different data in a table be processed with different performance?

From
Vladimir Ryabtsev
Date:
> but I'd be curious to know
> SELECT schemaname, tablename, attname, correlation FROM pg_stats WHERE tablename='articles' AND column='article_id' LIMIT 1;
I think you meant 'attname'. It gives
storage articles article_id -0.77380306

Vlad

Re: Why could different data in a table be processed with differentperformance?

From
Gasper Zejn
Date:

Hi, Vladimir,


Reading the whole thread it seems you should look deeper into IO subsystem.

1) Which file system are you using?

2) What is the segment layout of the LVM PVs and LVs? See https://www.centos.org/docs/5/html/Cluster_Logical_Volume_Manager/report_object_selection.html how to check. If data is fragmented, maybe the disks are doing a lot of seeking?

3) Do you use LVM for any "extra" features, such as snapshots?

4) You can try using seekwatcher to see where on the disk the slowness is occurring. You get a chart similar to this http://kernel.dk/dd-md0-xfs-pdflush.png

5) BCC is a collection of tools that might shed a light on what is happening. https://github.com/iovisor/bcc


Kind regards,

Gasper

On 21. 09. 2018 02:07, Vladimir Ryabtsev wrote:
I am experiencing a strange performance problem when accessing JSONB content by primary key.

My DB version() is PostgreSQL 10.3 (Ubuntu 10.3-1.pgdg14.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 4.8.4-2ubuntu1~14.04.4) 4.8.4, 64-bit
postgres.conf: https://justpaste.it/6pzz1
uname -a
: Linux postgresnlpslave 4.4.0-62-generic #83-Ubuntu SMP Wed Jan 18 14:10:15 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
The machine is virtual, running under Hyper-V.
Processor: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz, 1x1 cores
Disk storage: the host has two vmdx drives, first shared between the root partition and an LVM PV, second is a single LVM PV. Both PVs are in a VG containing swap and postgres data partitions. The data is mostly on the first PV.

I have such a table:

CREATE TABLE articles
(
    article_id bigint NOT NULL,
    content jsonb NOT NULL,
    published_at timestamp without time zone NOT NULL,
    appended_at timestamp without time zone NOT NULL,
    source_id integer NOT NULL,
    language character varying(2) NOT NULL,
    title text NOT NULL,
    topicstopic[] NOT NULL,
    objects object[] NOT NULL,
    cluster_id bigint NOT NULL,
    CONSTRAINT articles_pkey PRIMARY KEY (article_id)
)


We have a Python lib (using psycopg2 driver) to access this table. It executes simple queries to the table, one of them is used for bulk downloading of content and looks like this:

select content from articles where id between $1 and $2

I noticed that with some IDs it works pretty fast while with other it is 4-5 times slower. It is suitable to note, there are two main 'categories' of IDs in this table: first is range 270000000-500000000, and second is range 10000000000-100030000000. For the first range it is 'fast' and for the second it is 'slow'. Besides larger absolute numbers withdrawing them from int to bigint, values in the second range are more 'sparse', which means in the first range values are almost consequent (with very few 'holes' of missing values) while in the second range there are much more 'holes' (average filling is 35%). Total number of rows in the first range: ~62M, in the second range: ~10M.

I conducted several experiments to eliminate possible influence of library's code and network throughput, I omit some of them. I ended up with iterating over table with EXPLAIN to simulate read load:

explain (analyze, buffers)
select count(*), sum(length(content::text)) from articles where article_id between %s and %s


Sample output:

Aggregate  (cost=8635.91..8635.92 rows=1 width=16) (actual time=6625.993..6625.995 rows=1 loops=1)
  Buffers: shared hit=26847 read=3914
  ->  Index Scan using articles_pkey on articles  (cost=0.57..8573.35 rows=5005 width=107) (actual time=21.649..1128.004 rows=5000 loops=1)
        Index Cond: ((article_id >= 438000000) AND (article_id <= 438005000))
        Buffers: shared hit=4342 read=671
Planning time: 0.393 ms
Execution time: 6626.136 ms

Aggregate  (cost=5533.02..5533.03 rows=1 width=16) (actual time=33219.100..33219.102 rows=1 loops=1)
  Buffers: shared hit=6568 read=7104
  ->  Index Scan using articles_pkey on articles  (cost=0.57..5492.96 rows=3205 width=107) (actual time=22.167..12082.624 rows=2416 loops=1)
        Index Cond: ((article_id >= '100021000000'::bigint) AND (article_id <= '100021010000'::bigint))
        Buffers: shared hit=50 read=2378
Planning time: 0.517 ms
Execution time: 33219.218 ms


During iteration, I parse the result of EXPLAIN and collect series of following metrics:

- buffer hits/reads for the table,
- buffer hits/reads for the index,
- number of rows (from "Index Scan..."),
- duration of execution.

Based on metrics above I calculate inherited metrics:

- disk read rate: (index reads + table reads) * 8192 / duration,
- reads ratio: (index reads + table reads) / (index reads + table reads + index hits + table hits),
- data rate: (index reads + table reads + index hits + table hits) * 8192 / duration,
- rows rate: number of rows / duration.

Since "density" of IDs is different in "small" and "big" ranges, I adjusted size of chunks in order to get around 5000 rows on each iteration in both cases, though my experiments show that chunk size does not really matter a lot.

The issue posted at the very beginning of my message was confirmed for the *whole* first and second ranges (so it was not just caused by randomly cached data).

To eliminate cache influence, I restarted Postgres server with flushing buffers:

/$ postgresql stop; sync; echo 3 > /proc/sys/vm/drop_caches; postgresql start

After this I repeated the test and got next-to-same picture.

"Small' range: disk read rate is around 10-11 MB/s uniformly across the test. Output rate was 1300-1700 rows/s. Read ratio is around 13% (why? Shouldn't it be ~ 100% after drop_caches?).
"Big" range: In most of time disk read speed was about 2 MB/s but sometimes it jumped to 26-30 MB/s. Output rate was 70-80 rows/s (but varied a lot and reached 8000 rows/s). Read ratio also varied a lot.

I rendered series from the last test into charts:
"Small" range: https://i.stack.imgur.com/3Zfml.png
"Big" range (insane): https://i.stack.imgur.com/VXdID.png

During the tests I verified disk read speed with iotop and found its indications very close to ones calculated by me based on EXPLAIN BUFFERS. I cannot say I was monitoring it all the time, but I confirmed it when it was 2 MB/s and 22 MB/s on the second range and 10 MB/s on the first range. I also checked with htop that CPU was not a bottleneck and was around 3% during the tests.

The issue is reproducible on both master and slave servers. My tests were conducted on slave, while there were no any other load on DBMS, or disk activity on the host unrelated to DBMS.

My only assumption is that different fragments of data are being read with different speed due to virtualization or something, but... why is it so strictly bound to these ranges? Why is it the same on two different machines?

The file system performance measured by dd:

root@postgresnlpslave:/# echo 3 > /proc/sys/vm/drop_caches 
root@postgresnlpslave:/# dd if=/dev/mapper/postgresnlpslave--vg-root of=/dev/null bs=8K count=128K
131072+0 records in
131072+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 2.12304 s, 506 MB/s

Am I missing something? What else can I do to narrow down the cause?

Re: Why could different data in a table be processed with differentperformance?

From
Fabio Pardi
Date:


On 25/09/18 00:28, Vladimir Ryabtsev wrote:

> it is not unusual to have 1GB cache or more...  and do not forget to drop the cache between tests + do a sync
I conducted several long runs of dd, so I am sure that this numbers are fairly correct. However, what worries me is that I test sequential read speed while during my experiments Postgres might need to read from random places thus reducing real read speed dramatically. I have a feeling that this can be the reason.
I also reviewed import scripts and found the import was done in DESCENDING order of IDs. It was so to get most recent records sooner, may be it caused some inefficiency in the storage... But again, it was so for both ranges.

> - how big is your index?
pg_table_size('articles_pkey') = 1561 MB

> - how big is the table?
pg_table_size('articles') = 427 GB
pg_table_size('pg_toast.pg_toast_221558') = 359 GB


Since you have a very big toast table, given you are using spinning disks, I think that increasing the block size will bring benefits. (Also partitioning is not a bad idea.)

If my understanding of TOAST is correct, if data will fit blocks of let's say 16 or 24 KB then one block retrieval from Postgres will result in less seeks on the disk and less possibility data gets sparse on your disk. (a very quick and dirty calculation, shows your average block size is 17KB)

One thing you might want to have a look at, is again the RAID controller and your OS. You might want to have all of them aligned in block size, or maybe have Postgres ones a multiple of what OS and RAID controller have.



> - given the size of shared_buffers, almost 2M blocks should fit, but you say 2 consecutive runs still are hitting the disk. That's strange indeed since you are using way more than 2M blocks.
TBH, I cannot say I understand your calculations with number of blocks...
shared_buffers = 15GB IIRC (justpaste link is gone)

15 * 1024 *1024 = 15728640 KB

using 8KB blocks = 1966080 total blocks

if you query shared_buffers you should get the same number of total available blocks

But to clarify: consecutive runs with SAME parameters do NOT hit the disk, only the first one does, consequent ones read only from buffer cache.

I m  a bit confused.. every query you pasted contains 'read':

        Buffers: shared hit=50 read=2378

and 'read' means you are reading from disk (or OS cache). Or not?



> - As Laurenz suggested (VACUUM FULL), you might want to move data around. You can try also a dump + restore to narrow the problem to data or disk
I launched VACUUM FULL, but it ran very slowly, according to my calculation it might take 17 hours. I will try to do copy data into another table with the same structure or spin up another server, and let you know.

cool, that should also clarify if the reverse order matters or not

> - You might also want to try to see the disk graph of Windows, while you are running your tests. It can show you if data (and good to know how much) is actually fetching from disk or not.
I wanted to do so but I don't have access to Hyper-V server, will try to request credentials from admins.

Couple more observations:
1) The result of my experiment is almost not affected by other server load. Another user was running a query (over this table) with read speed ~130 MB/s, while with my query read at 1.8-2 MB/s.
2) iotop show higher IO % (~93-94%) with slower read speed (though it is not quite clear what this field is). A process from example above had ~55% IO with 130 MB/s while my process had ~93% with ~2MB/s.

I think because you are looking at 'IO' column which indicates (from manual) '..the percentage of  time  the  thread/process  spent [..]  while waiting on I/O.'

Regards,
Vlad


regards,

fabio pardi

Re: Why could different data in a table be processed with different performance?

From
Vladimir Ryabtsev
Date:
> 1) Which file system are you using?
From Linux's view it's ext4. Real vmdx file on Hyper-V is stored on NTFS, as far as I know.

> 2) What is the segment layout of the LVM PVs and LVs?
I am a bit lost with it. Is that what you are asking about?
master:
# pvs --segments
  PV         VG        Fmt  Attr PSize  PFree  Start SSize
  /dev/sda5  ubuntu-vg lvm2 a--  19.76g 20.00m     0  4926
  /dev/sda5  ubuntu-vg lvm2 a--  19.76g 20.00m  4926   127
  /dev/sda5  ubuntu-vg lvm2 a--  19.76g 20.00m  5053     5
# lvs --segments
  LV     VG        Attr      #Str Type   SSize  
  root   ubuntu-vg -wi-ao---    1 linear  19.24g
  swap_1 ubuntu-vg -wi-ao---    1 linear 508.00m


slave:
# pvs --segments
  PV         VG                  Fmt  Attr PSize   PFree Start SSize
  /dev/sda3  postgresnlpslave-vg lvm2 a--  429.77g    0      0 110021
  /dev/sda5  postgresnlpslave-vg lvm2 a--  169.52g    0      0  28392
  /dev/sda5  postgresnlpslave-vg lvm2 a--  169.52g    0  28392   2199
  /dev/sda5  postgresnlpslave-vg lvm2 a--  169.52g    0  30591   2560
  /dev/sda5  postgresnlpslave-vg lvm2 a--  169.52g    0  33151  10246
  /dev/sdb1  postgresnlpslave-vg lvm2 a--  512.00g    0      0 131071
# lvs --segments
  LV     VG                  Attr       #Str Type   SSize  
  root   postgresnlpslave-vg -wi-ao----    1 linear 110.91g
  root   postgresnlpslave-vg -wi-ao----    1 linear  40.02g
  root   postgresnlpslave-vg -wi-ao----    1 linear  10.00g
  root   postgresnlpslave-vg -wi-ao----    1 linear 429.77g
  root   postgresnlpslave-vg -wi-ao----    1 linear 512.00g
  swap_1 postgresnlpslave-vg -wi-ao----    1 linear   8.59g


> 3) Do you use LVM for any "extra" features, such as snapshots?
I don't think so, but how to check? vgs gives #SN = 0, is that it?

> 4) You can try using seekwatcher to see where on the disk the slowness is occurring. You get a chart similar to this http://kernel.dk/dd-md0-xfs-pdflush.png
> 5) BCC is a collection of tools that might shed a light on what is happening. https://github.com/iovisor/bcc
Will look into it.

Regards,
Vlad

Re: Why could different data in a table be processed with different performance?

From
Vladimir Ryabtsev
Date:
> Since you have a very big toast table, given you are using spinning disks, I think that increasing the block size will bring benefits.
But will it worsen caching? I will have lesser slots in cache. Also will it affect required storage space?

>> consecutive runs with SAME parameters do NOT hit the disk, only the first one does, consequent ones read only from buffer cache.
> I m  a bit confused.. every query you pasted contains 'read':
>        Buffers: shared hit=50 read=2378
> and 'read' means you are reading from disk (or OS cache). Or not? 
Yes, sorry, it was just my misunderstanding of what is "consecutive". To make it clear: I iterate over all data in table with one request and different parameters on each iteration (e.g. + 5000 both borders), in this case I get disk reads on each query run (much more reads on "slow" range). But if I request data from an area queried previously, it reads from cache and does not hit disk (both ranges). E.g. iterating over 1M of records with empty cache takes ~11 minutes in "fast" range and ~1 hour in "slow" range, while on second time it takes only ~2 minutes for both ranges (if I don't do drop_caches).

Regards,
Vlad

Re: Why could different data in a table be processed with differentperformance?

From
Fabio Pardi
Date:

On 09/26/2018 07:15 PM, Vladimir Ryabtsev wrote:
>> Since you have a very big toast table, given you are using spinning
> disks, I think that increasing the block size will bring benefits.
> But will it worsen caching? I will have lesser slots in cache. Also will
> it affect required storage space?


I think in your case it will not worsen the cache. You will have lesser
slots in the cache, but the total available cache will indeed be
unchanged (half the blocks of double the size). It could affect space
storage, for the smaller blocks. Much depends which block size you
choose and how is actually your data distributed in the ranges you
mentioned. (eg: range 10K -20 might be more on the 10 or more on the 20
side.).

Imagine you request a record of 24 KB, and you are using 8KB blocks. It
will result in 3 different block lookup/request/returned. Those 3 blocks
might be displaced on disk, resulting maybe in 3 different lookups.
Having all in one block, avoids this problem.
The cons is that if you need to store 8KB of data, you will allocate 24KB.
You say you do not do updates, so it might also be the case that when
you write data all at once (24 KB in one go) it goes all together in a
contiguous strip. Therefore the block size change here will bring nothing.
This is very much data and usage driven. To change block size is a
painful thing, because IIRC you do that at db initialization time

Similarly, if your RAID controller uses for instance 128KB blocks, each
time you are reading one block of 8KB, it will return to you a whole
128KB chunk, which is quite a waste of resources.

If your 'slow' range is maybe fragmented here and there on the disk, not
having a proper alignment between Postgres blocks/ Filesystem/RAID
might worsen the problem of orders of magnitude. This is very true on
spinning disks, where the seek time is noticeable.

Note that trying to set a very small block size has the opposite effect:
you might hit the IOPS of your hardware, and create a bottleneck. (been
there while benchmarking some new hardware)

But before going through all this, I would first try to reload the data
with dump+restore into a new machine, and see how it behaves.

Hope it helps.

regards,

fabio pardi

> 
>>> consecutive runs with SAME parameters do NOT hit the disk, only the
> first one does, consequent ones read only from buffer cache.
>> I m  a bit confused.. every query you pasted contains 'read':
>>        Buffers: shared hit=50 read=2378
>> and 'read' means you are reading from disk (or OS cache). Or not? 
> Yes, sorry, it was just my misunderstanding of what is "consecutive". To
> make it clear: I iterate over all data in table with one request and
> different parameters on each iteration (e.g. + 5000 both borders), in
> this case I get disk reads on each query run (much more reads on "slow"
> range). But if I request data from an area queried previously, it reads
> from cache and does not hit disk (both ranges). E.g. iterating over 1M
> of records with empty cache takes ~11 minutes in "fast" range and ~1
> hour in "slow" range, while on second time it takes only ~2 minutes for
> both ranges (if I don't do drop_caches).
> 
> Regards,
> Vlad
> 


Re: Why could different data in a table be processed with different performance?

From
Vladimir Ryabtsev
Date:
> Does your LVM have readahead
> ramped up ?  Try lvchange -r 65536 data/postgres (or similar).

Changed this from 256 to 65536.
If it is supposed to take effect immediately (no server reboot or other changes), then I've got no changes in performance. No at all.

Vlad

Re: Why could different data in a table be processed with different performance?

From
Vladimir Ryabtsev
Date:
> You will have lesser
> slots in the cache, but the total available cache will indeed be
> unchanged (half the blocks of double the size).
But we have many other tables, queries to which may suffer from smaller number of blocks in buffer cache.

> To change block size is a
> painful thing, because IIRC you do that at db initialization time
My research shows that I can only change it in compile time.
And then initdb a new cluster...
Moreover, this table/schema is not the only in the database, there is a bunch of other schemas. And we will need to dump-restore everything... So this is super-painful.

> It could affect space storage, for the smaller blocks.
But at which extent? As I understand it is not something about "alignment" to block size for rows? Is it only low-level IO thing with datafiles?

> But before going through all this, I would first try to reload the data
> with dump+restore into a new machine, and see how it behaves.
Yes, this is the plan, I'll be back once I find enough disk space for my further experiments.

Vlad

Re: Why could different data in a table be processed with differentperformance?

From
Fabio Pardi
Date:

On 28/09/18 11:56, Vladimir Ryabtsev wrote:
>
> > It could affect space storage, for the smaller blocks.
> But at which extent? As I understand it is not something about "alignment" to block size for rows? Is it only
low-levelIO thing with datafiles?
 
>

Maybe 'for the smaller blocks' was not very meaningful.
What i mean is 'in terms of wasted disk space: '

In an example:

create table test_space (i int);

empty table:

select pg_total_relation_size('test_space');
 pg_total_relation_size
------------------------
                      0
(1 row)

insert one single record:

insert into test_space values (1);


select pg_total_relation_size('test_space');
 pg_total_relation_size
------------------------
                   8192


select pg_relation_filepath('test_space');
 pg_relation_filepath
----------------------
 base/16384/179329


ls -alh base/16384/179329
-rw------- 1 postgres postgres 8.0K Sep 28 16:09 base/16384/179329

That means, if your block size was bigger, then you would have bigger space allocated for one single record.

regards,

fabio aprdi


Re: Why could different data in a table be processed with different performance?

From
Vladimir Ryabtsev
Date:
> That means, if your block size was bigger, then you would have bigger space allocated for one single record.
But if I INSERT second, third ... hundredth record in the table, the size remains 8K.
So my point is that if one decides to increase block size, increasing storage space is not so significant, because it does not set minimum storage unit for a row.

vlad

On 28/09/18 21:51, Vladimir Ryabtsev wrote:
> > That means, if your block size was bigger, then you would have bigger space allocated for one single record.
> But if I INSERT second, third ... hundredth record in the table, the size remains 8K.


> So my point is that if one decides to increase block size, increasing storage space is not so significant, because it
doesnot set minimum storage unit for a row.
 
>
ah, yes, correct. Now we are on the same page.

Good luck with the rest of things you are going to try out, and let us know your findings.

regards,

fabio pardi

> vlad



Re: Why could different data in a table be processed with different performance?

From
Vladimir Ryabtsev
Date:
FYI, posting an intermediate update on the issue.

I disabled index scans to keep existing order, and copied part of the "slow" range into another table (3M rows in 2.2 GB table + 17 GB toast). I was able to reproduce slow readings from this copy. Then I performed CLUSTER of the copy using PK and everything improved significantly. Overall time became 6 times faster with disk read speed (reported by iotop) 30-60MB/s.

I think we can take bad physical data distribution as the main hypothesis of the issue. I was not able to launch seekwatcher though (it does not work out of the box in Ubuntu and I failed to rebuild it) and confirm lots of seeks.

I still don't have enough disk space to solve the problem with original table, I am waiting for this from admin/devops team.

My plan is to partition the original table and CLUSTER every partition on primary key once I have space.

Best regards,
Vlad
Hi!

Could someone discuss about following? It would be great to hear comments!

There is a good storage. According to "fio", write speed could be e.g. 3 GB/s.
 (It is First time using the command for me, so I am not certain of the real speed with "fio". E.g. with --bs=100m, direct=1, in fio. The measurement result may be faulty also. So, checking still.)

Currently there is one table without partitioning. The table contains json data. In PostgreSQL, in Linux.

Write speed can be e.g. 300 - 600 MB/s, through PostgreSQL. Measured with dstat while inserting. Shared buffers is large is PostgreSQL.

With a storage/disk which "scales", is there some way to write faster to the disk in the system through PostgreSQL?

Inside same server.

Does splitting data help? Partitioned table / splitting to smaller tables? Should I test it?

Change settings somewhere? Block sizes? 8 KB / 16 KB, ... "Dangerous" to change?

2nd question, sharding:

If the storage / "disk" scales, could better *disk writing speed* be achieved (in total) with sharding kind of splitting of data? (Same NAS storage, which scales, in use in all shards.)
Sharding or use only one server? From pure disk writing speed point of view.


BR Sam

Hi!

Still, "writing" speed to disk:

With "fio" 3 GB / s.
With PostgreSQL: 350 -500 MB / s, also with a partitioned table.

(and something similar with an other DBMS).
Monitored with dstat.

With about 12 threads, then it does not get better anymore.
Same results with data split to 4 partitions, in a partitioned table. CPU load increased, but not full yet.
Same results with open_datasync.

BR Sam


On pe, lokak. 12, 2018 at 19:27, Sam R.
<samruohola@yahoo.com> wrote:
Hi!

Could someone discuss about following? It would be great to hear comments!

There is a good storage. According to "fio", write speed could be e.g. 3 GB/s.
 (It is First time using the command for me, so I am not certain of the real speed with "fio". E.g. with --bs=100m, direct=1, in fio. The measurement result may be faulty also. So, checking still.)

Currently there is one table without partitioning. The table contains json data. In PostgreSQL, in Linux.

Write speed can be e.g. 300 - 600 MB/s, through PostgreSQL. Measured with dstat while inserting. Shared buffers is large is PostgreSQL.

With a storage/disk which "scales", is there some way to write faster to the disk in the system through PostgreSQL?

Inside same server.

Does splitting data help? Partitioned table / splitting to smaller tables? Should I test it?

Change settings somewhere? Block sizes? 8 KB / 16 KB, ... "Dangerous" to change?

2nd question, sharding:

If the storage / "disk" scales, could better *disk writing speed* be achieved (in total) with sharding kind of splitting of data? (Same NAS storage, which scales, in use in all shards.)
Sharding or use only one server? From pure disk writing speed point of view.


BR Sam