Hi friends!
I'd love to get a sanity check on whether a fat select query I'm doing makes sense given the infrastructure that we have.
We have 3 big tables that we typically join together for certain queries: a ~40 million row photos table, a ~20 million row users table, and a ~50 million row photo_to_album table that maps photos to albums.
We like to display real time analytics, which often results in a query like:
select (random aggregations )
from
photo_to_album join photos on
photos.id = photo_to_album.photo_id
where
photo_to_album.album_id = <something>
and
photos.created_at between <some dates>
and <other junk>
We have indexes on all of the joins, and the where clauses.
One of these queries that should be targeting something like 300K photos takes 38 seconds to run (with an aggregate/nested loop taking effectively all of that time), and then upon second execution with a warm cache, 4 seconds.
Also worryingly, it spikes read IOPS to almost 1500/sec during the time and write IOPS 200/sec. When not running the query, steady level read iops basically nil, write hovers around 50-100.
This also increases the queue depth from basically 0 up to 6. Keeping the queue depth high seems to cause timeouts in other queries. The CPU is barely if at all affected, hovering around 20%. Memory also barely affected.
We have a RDS Postgres database, m3.2xlarge with 2000 Provisioned IOPS and 400GB storage. This translates to 8 virtual CPUs, 30GiB memory, and all SSD drives.
Several questions here:
1) Is that level of IOPS normal?
2) Is it bad that the level of iops can queue requests that screw up the whole database even if it's just select queries? Especially when the CPU and Memory are still plentiful?
3) What is up with the huge difference between cold and warm cache?
Any help is appreciated!