Home > mailing lists

Thread: Very slow 101-feeling design/query..

Very slow 101-feeling design/query..

From

Wells Oliver

Date:

10 December 2021, 22:53:20

This feels very 101 but I feel like it should be much faster:

A table "joints" with a PK of play_uuid, target_id, joint_seq, joint_timestamp.

"joints" is partitioned using RANGE on joint_timestamp for monthly partitions 1/1 - 2/1, 2-1 - 3/1, etc.

"joints" has an FK where play_uuid refers to table "plays" and column "play_uuid" where "play_uuid" is the PK.

"plays" additionally has an indexed column game_id.

"joints" has 1133,932,391 rows across 12 monthly partitions for 2021, and "plays has 585,627 rows. We made a view called "vw_joints" which just does:

SELECT * FROM joints JOIN plays USING (play_id);

Then doing:

SELECT DISTINCT game_id FROM vw_joints

Takes 35-45 minutes. Which seems nuts. We do this kind of design in a few different plays to normalize things, but it comes at the cost of these agonizingly slow (and seemingly dead simple) qeuries.

Is there any optimization to do here beyond flattening table and de-normalizing data? Is the partitioning causing a slowness here? I feel like partitioning is creating some difficulty...

Appreciate it.

Wells Oliver
wells.oliver@gmail.com

Re: Very slow 101-feeling design/query..

From

Ron

Date:

10 December 2021, 23:16:42

On 12/10/21 4:53 PM, Wells Oliver wrote:
> This feels very 101 but I feel like it should be much faster:
>
> A table "joints" with a PK of play_uuid, target_id, joint_seq, 
> joint_timestamp.
>
> "joints" is partitioned using RANGE on joint_timestamp for monthly 
> partitions 1/1 - 2/1, 2-1 - 3/1, etc.
>
> "joints" has an FK where play_uuid refers to table "plays" and 
> column "play_uuid" where "play_uuid" is the PK.
>
> "plays" additionally has an indexed column game_id.
>
> "joints" has 1133,932,391 rows across 12 monthly partitions for 2021, and 
> "plays has 585,627 rows. We made a view called "vw_joints" which just does:
>
> SELECT * FROM joints JOIN plays USING (play_id);
>
> Then doing:
>
> SELECT DISTINCT game_id FROM vw_joints
>
> Takes 35-45 minutes. Which seems nuts. We do this kind of design in a few 
> different plays to normalize things, but it comes at the cost of these 
> agonizingly slow (and seemingly dead simple) qeuries.
>
> Is there any optimization to do here beyond flattening table and 
> de-normalizing data? Is the partitioning causing a slowness here? I feel 
> like partitioning is creating some difficulty...

What Postgresql version?

What does the query plan look like?

-- 
Angular momentum makes the world go 'round.

Re: Very slow 101-feeling design/query..

From

Wells Oliver

Date:

10 December 2021, 23:27:58

PG 13.4. Can't quite run the EXPLAIN ANALYZE since it takes so long, but EXPLAIN SELECT DISTINCT game_id FROM vw_joints shows (s = joints, p = plays)

----------------------------------------------------------------------------------------------------
HashAggregate (cost=63810150.11..63810168.40 rows=1829 width=4)
Group Key: p.game_id
-> Hash Left Join (cost=21647.78..60977838.19 rows=1132924766 width=4)
Hash Cond: (s.play_uuid = p.play_uuid)
-> Append (cost=0.00..57982241.49 rows=1132924766 width=16)
-> Seq Scan on joints_2021_01 s_1 (cost=0.00..13.00 rows=300 width=16)
-> Seq Scan on joints_2021_02 s_2 (cost=0.00..13.00 rows=300 width=16)
-> Seq Scan on joints_2021_03 s_3 (cost=0.00..13.00 rows=300 width=16)
-> Seq Scan on joints_2021_04 s_4 (cost=0.00..2217902.16 rows=49859816 width=16)
-> Seq Scan on joints_2021_05 s_5 (cost=0.00..2965019.35 rows=63440735 width=16)
-> Seq Scan on joints_2021_06 s_6 (cost=0.00..3797848.89 rows=85688889 width=16)
-> Seq Scan on joints_2021_07 s_7 (cost=0.00..5867829.24 rows=115975424 width=16)
-> Seq Scan on joints_2021_08 s_8 (cost=0.00..17274328.41 rows=380175741 width=16)
-> Seq Scan on joints_2021_09 s_9 (cost=0.00..18226427.88 rows=393209088 width=16)
-> Seq Scan on joints_2021_10 s_10 (cost=0.00..1942824.01 rows=44003201 width=16)
-> Seq Scan on joints_2021_11 s_11 (cost=0.00..25385.72 rows=570672 width=16)
-> Seq Scan on joints_2021_12 s_12 (cost=0.00..13.00 rows=300 width=16)
-> Hash (cost=14292.90..14292.90 rows=588390 width=20)
-> Seq Scan on plays p (cost=0.00..14292.90 rows=588390 width=20)
JIT:
Functions: 34
Options: Inlining true, Optimization true, Expressions true, Deforming true
(22 rows)

On Fri, Dec 10, 2021 at 3:17 PM Ron <ronljohnsonjr@gmail.com> wrote:

On 12/10/21 4:53 PM, Wells Oliver wrote:
> This feels very 101 but I feel like it should be much faster:
>
> A table "joints" with a PK of play_uuid, target_id, joint_seq,
> joint_timestamp.
>
> "joints" is partitioned using RANGE on joint_timestamp for monthly
> partitions 1/1 - 2/1, 2-1 - 3/1, etc.
>
> "joints" has an FK where play_uuid refers to table "plays" and
> column "play_uuid" where "play_uuid" is the PK.
>
> "plays" additionally has an indexed column game_id.
>
> "joints" has 1133,932,391 rows across 12 monthly partitions for 2021, and
> "plays has 585,627 rows. We made a view called "vw_joints" which just does:
>
> SELECT * FROM joints JOIN plays USING (play_id);
>
> Then doing:
>
> SELECT DISTINCT game_id FROM vw_joints
>
> Takes 35-45 minutes. Which seems nuts. We do this kind of design in a few
> different plays to normalize things, but it comes at the cost of these
> agonizingly slow (and seemingly dead simple) qeuries.
>
> Is there any optimization to do here beyond flattening table and
> de-normalizing data? Is the partitioning causing a slowness here? I feel
> like partitioning is creating some difficulty...

What Postgresql version?

What does the query plan look like?

--
Angular momentum makes the world go 'round.

Wells Oliver
wells.oliver@gmail.com

Re: Very slow 101-feeling design/query..

From

Ron

Date:

10 December 2021, 23:49:16

What table is play_id in, and is it indexed?

On 12/10/21 5:27 PM, Wells Oliver wrote:

PG 13.4. Can't quite run the EXPLAIN ANALYZE since it takes so long, but EXPLAIN SELECT DISTINCT game_id FROM vw_joints shows (s = joints, p = plays)

----------------------------------------------------------------------------------------------------
HashAggregate (cost=63810150.11..63810168.40 rows=1829 width=4)
Group Key: p.game_id
-> Hash Left Join (cost=21647.78..60977838.19 rows=1132924766 width=4)
Hash Cond: (s.play_uuid = p.play_uuid)
-> Append (cost=0.00..57982241.49 rows=1132924766 width=16)
-> Seq Scan on joints_2021_01 s_1 (cost=0.00..13.00 rows=300 width=16)
-> Seq Scan on joints_2021_02 s_2 (cost=0.00..13.00 rows=300 width=16)
-> Seq Scan on joints_2021_03 s_3 (cost=0.00..13.00 rows=300 width=16)
-> Seq Scan on joints_2021_04 s_4 (cost=0.00..2217902.16 rows=49859816 width=16)
-> Seq Scan on joints_2021_05 s_5 (cost=0.00..2965019.35 rows=63440735 width=16)
-> Seq Scan on joints_2021_06 s_6 (cost=0.00..3797848.89 rows=85688889 width=16)
-> Seq Scan on joints_2021_07 s_7 (cost=0.00..5867829.24 rows=115975424 width=16)
-> Seq Scan on joints_2021_08 s_8 (cost=0.00..17274328.41 rows=380175741 width=16)
-> Seq Scan on joints_2021_09 s_9 (cost=0.00..18226427.88 rows=393209088 width=16)
-> Seq Scan on joints_2021_10 s_10 (cost=0.00..1942824.01 rows=44003201 width=16)
-> Seq Scan on joints_2021_11 s_11 (cost=0.00..25385.72 rows=570672 width=16)
-> Seq Scan on joints_2021_12 s_12 (cost=0.00..13.00 rows=300 width=16)
-> Hash (cost=14292.90..14292.90 rows=588390 width=20)
-> Seq Scan on plays p (cost=0.00..14292.90 rows=588390 width=20)
JIT:
Functions: 34
Options: Inlining true, Optimization true, Expressions true, Deforming true
(22 rows)

On Fri, Dec 10, 2021 at 3:17 PM Ron <ronljohnsonjr@gmail.com> wrote:
On 12/10/21 4:53 PM, Wells Oliver wrote:
> This feels very 101 but I feel like it should be much faster:
>
> A table "joints" with a PK of play_uuid, target_id, joint_seq,
> joint_timestamp.
>
> "joints" is partitioned using RANGE on joint_timestamp for monthly
> partitions 1/1 - 2/1, 2-1 - 3/1, etc.
>
> "joints" has an FK where play_uuid refers to table "plays" and
> column "play_uuid" where "play_uuid" is the PK.
>
> "plays" additionally has an indexed column game_id.
>
> "joints" has 1133,932,391 rows across 12 monthly partitions for 2021, and
> "plays has 585,627 rows. We made a view called "vw_joints" which just does:
>
> SELECT * FROM joints JOIN plays USING (play_id);
>
> Then doing:
>
> SELECT DISTINCT game_id FROM vw_joints
>
> Takes 35-45 minutes. Which seems nuts. We do this kind of design in a few
> different plays to normalize things, but it comes at the cost of these
> agonizingly slow (and seemingly dead simple) qeuries.
>
> Is there any optimization to do here beyond flattening table and
> de-normalizing data? Is the partitioning causing a slowness here? I feel
> like partitioning is creating some difficulty...

What Postgresql version?

What does the query plan look like?

--
Angular momentum makes the world go 'round.

--
Wells Oliver
wells.oliver@gmail.com

--
Angular momentum makes the world go 'round.

Re: Very slow 101-feeling design/query..

From

Wells Oliver

Date:

10 December 2021, 23:50:47

Sorry, that was a typo, there is no play_id, the view is defined as SELECT * FROM joints JOIN plays USING (play_uuid);

On Fri, Dec 10, 2021 at 3:49 PM Ron <ronljohnsonjr@gmail.com> wrote:

What table is play_id in, and is it indexed?

On 12/10/21 5:27 PM, Wells Oliver wrote:
PG 13.4. Can't quite run the EXPLAIN ANALYZE since it takes so long, but EXPLAIN SELECT DISTINCT game_id FROM vw_joints shows (s = joints, p = plays)

----------------------------------------------------------------------------------------------------
HashAggregate (cost=63810150.11..63810168.40 rows=1829 width=4)
Group Key: p.game_id
-> Hash Left Join (cost=21647.78..60977838.19 rows=1132924766 width=4)
Hash Cond: (s.play_uuid = p.play_uuid)
-> Append (cost=0.00..57982241.49 rows=1132924766 width=16)
-> Seq Scan on joints_2021_01 s_1 (cost=0.00..13.00 rows=300 width=16)
-> Seq Scan on joints_2021_02 s_2 (cost=0.00..13.00 rows=300 width=16)
-> Seq Scan on joints_2021_03 s_3 (cost=0.00..13.00 rows=300 width=16)
-> Seq Scan on joints_2021_04 s_4 (cost=0.00..2217902.16 rows=49859816 width=16)
-> Seq Scan on joints_2021_05 s_5 (cost=0.00..2965019.35 rows=63440735 width=16)
-> Seq Scan on joints_2021_06 s_6 (cost=0.00..3797848.89 rows=85688889 width=16)
-> Seq Scan on joints_2021_07 s_7 (cost=0.00..5867829.24 rows=115975424 width=16)
-> Seq Scan on joints_2021_08 s_8 (cost=0.00..17274328.41 rows=380175741 width=16)
-> Seq Scan on joints_2021_09 s_9 (cost=0.00..18226427.88 rows=393209088 width=16)
-> Seq Scan on joints_2021_10 s_10 (cost=0.00..1942824.01 rows=44003201 width=16)
-> Seq Scan on joints_2021_11 s_11 (cost=0.00..25385.72 rows=570672 width=16)
-> Seq Scan on joints_2021_12 s_12 (cost=0.00..13.00 rows=300 width=16)
-> Hash (cost=14292.90..14292.90 rows=588390 width=20)
-> Seq Scan on plays p (cost=0.00..14292.90 rows=588390 width=20)
JIT:
Functions: 34
Options: Inlining true, Optimization true, Expressions true, Deforming true
(22 rows)

On Fri, Dec 10, 2021 at 3:17 PM Ron <ronljohnsonjr@gmail.com> wrote:
On 12/10/21 4:53 PM, Wells Oliver wrote:
> This feels very 101 but I feel like it should be much faster:
>
> A table "joints" with a PK of play_uuid, target_id, joint_seq,
> joint_timestamp.
>
> "joints" is partitioned using RANGE on joint_timestamp for monthly
> partitions 1/1 - 2/1, 2-1 - 3/1, etc.
>
> "joints" has an FK where play_uuid refers to table "plays" and
> column "play_uuid" where "play_uuid" is the PK.
>
> "plays" additionally has an indexed column game_id.
>
> "joints" has 1133,932,391 rows across 12 monthly partitions for 2021, and
> "plays has 585,627 rows. We made a view called "vw_joints" which just does:
>
> SELECT * FROM joints JOIN plays USING (play_id);
>
> Then doing:
>
> SELECT DISTINCT game_id FROM vw_joints
>
> Takes 35-45 minutes. Which seems nuts. We do this kind of design in a few
> different plays to normalize things, but it comes at the cost of these
> agonizingly slow (and seemingly dead simple) qeuries.
>
> Is there any optimization to do here beyond flattening table and
> de-normalizing data? Is the partitioning causing a slowness here? I feel
> like partitioning is creating some difficulty...

What Postgresql version?

What does the query plan look like?

--
Angular momentum makes the world go 'round.

--
Wells Oliver
wells.oliver@gmail.com

--
Angular momentum makes the world go 'round.

Wells Oliver
wells.oliver@gmail.com

Re: Very slow 101-feeling design/query..

From

Ron

Date:

11 December 2021, 00:21:11

This is why you need to paste actual queries and table definitions. (Snip out columns with names you don't want on the internet.)

Have the tables been recently analyzed?

On 12/10/21 5:50 PM, Wells Oliver wrote:

Sorry, that was a typo, there is no play_id, the view is defined as SELECT * FROM joints JOIN plays USING (play_uuid);

On Fri, Dec 10, 2021 at 3:49 PM Ron <ronljohnsonjr@gmail.com> wrote:
What table is play_id in, and is it indexed?

On 12/10/21 5:27 PM, Wells Oliver wrote:
PG 13.4. Can't quite run the EXPLAIN ANALYZE since it takes so long, but EXPLAIN SELECT DISTINCT game_id FROM vw_joints shows (s = joints, p = plays)

----------------------------------------------------------------------------------------------------
HashAggregate (cost=63810150.11..63810168.40 rows=1829 width=4)
Group Key: p.game_id
-> Hash Left Join (cost=21647.78..60977838.19 rows=1132924766 width=4)
Hash Cond: (s.play_uuid = p.play_uuid)
-> Append (cost=0.00..57982241.49 rows=1132924766 width=16)
-> Seq Scan on joints_2021_01 s_1 (cost=0.00..13.00 rows=300 width=16)
-> Seq Scan on joints_2021_02 s_2 (cost=0.00..13.00 rows=300 width=16)
-> Seq Scan on joints_2021_03 s_3 (cost=0.00..13.00 rows=300 width=16)
-> Seq Scan on joints_2021_04 s_4 (cost=0.00..2217902.16 rows=49859816 width=16)
-> Seq Scan on joints_2021_05 s_5 (cost=0.00..2965019.35 rows=63440735 width=16)
-> Seq Scan on joints_2021_06 s_6 (cost=0.00..3797848.89 rows=85688889 width=16)
-> Seq Scan on joints_2021_07 s_7 (cost=0.00..5867829.24 rows=115975424 width=16)
-> Seq Scan on joints_2021_08 s_8 (cost=0.00..17274328.41 rows=380175741 width=16)
-> Seq Scan on joints_2021_09 s_9 (cost=0.00..18226427.88 rows=393209088 width=16)
-> Seq Scan on joints_2021_10 s_10 (cost=0.00..1942824.01 rows=44003201 width=16)
-> Seq Scan on joints_2021_11 s_11 (cost=0.00..25385.72 rows=570672 width=16)
-> Seq Scan on joints_2021_12 s_12 (cost=0.00..13.00 rows=300 width=16)
-> Hash (cost=14292.90..14292.90 rows=588390 width=20)
-> Seq Scan on plays p (cost=0.00..14292.90 rows=588390 width=20)
JIT:
Functions: 34
Options: Inlining true, Optimization true, Expressions true, Deforming true
(22 rows)

On Fri, Dec 10, 2021 at 3:17 PM Ron <ronljohnsonjr@gmail.com> wrote:
On 12/10/21 4:53 PM, Wells Oliver wrote:
> This feels very 101 but I feel like it should be much faster:
>
> A table "joints" with a PK of play_uuid, target_id, joint_seq,
> joint_timestamp.
>
> "joints" is partitioned using RANGE on joint_timestamp for monthly
> partitions 1/1 - 2/1, 2-1 - 3/1, etc.
>
> "joints" has an FK where play_uuid refers to table "plays" and
> column "play_uuid" where "play_uuid" is the PK.
>
> "plays" additionally has an indexed column game_id.
>
> "joints" has 1133,932,391 rows across 12 monthly partitions for 2021, and
> "plays has 585,627 rows. We made a view called "vw_joints" which just does:
>
> SELECT * FROM joints JOIN plays USING (play_id);
>
> Then doing:
>
> SELECT DISTINCT game_id FROM vw_joints
>
> Takes 35-45 minutes. Which seems nuts. We do this kind of design in a few
> different plays to normalize things, but it comes at the cost of these
> agonizingly slow (and seemingly dead simple) qeuries.
>
> Is there any optimization to do here beyond flattening table and
> de-normalizing data? Is the partitioning causing a slowness here? I feel
> like partitioning is creating some difficulty...

What Postgresql version?

What does the query plan look like?

--
Angular momentum makes the world go 'round.

--
Wells Oliver
wells.oliver@gmail.com

--
Angular momentum makes the world go 'round.

--
Wells Oliver
wells.oliver@gmail.com

--
Angular momentum makes the world go 'round.

Re: Very slow 101-feeling design/query..

From

Tom Lane

Date:

11 December 2021, 02:56:53

Wells Oliver <wells.oliver@gmail.com> writes:
> PG 13.4. Can't quite run the EXPLAIN ANALYZE since it takes so long, but
> EXPLAIN SELECT DISTINCT game_id FROM vw_joints shows (s = joints, p = plays)

I doubt you're going to get a query plan that's significantly better
than what you show here.  But those tables are pretty large, so
I wonder whether you have cranked up work_mem.

One potential issue is that the planner thinks there are only
about 1829 distinct game_id values:

>  HashAggregate  (cost=63810150.11..63810168.40 rows=1829 width=4)
>    Group Key: p.game_id

Is that about the right size of result for this?

            regards, tom lane

Re: Very slow 101-feeling design/query..

From

Rui DeSousa

Date:

11 December 2021, 04:18:14

On Dec 10, 2021, at 5:53 PM, Wells Oliver <wells.oliver@gmail.com> wrote:

SELECT DISTINCT game_id FROM vw_joints

Seeing distinct makes me cringe; most uses of it foreshadow either a poor design or a bad query; not always but more often than not.

If you’re really trying to do a distinct you can select from the game table and do an exists against plays and joints tables thus removing the distinct clause; but I digress as the given query is most likely an over simplified query just as example.

Partitioning does nothing here as the query doesn’t filter on the partitioned field thus no partition elimination.

RE: Very slow 101-feeling design/query..

From

"Michel SALAIS"

Date:

11 December 2021, 15:58:48

Using view which does a join in two tables to access things in one of them is not a good idea. You pay for a join even though it is not necessary. So if you just want distinct game_id and you don’t care if its play_uuid is in joints, then you can accelerate using the table plays directly.

Michel SALAIS

De : Wells Oliver <wells.oliver@gmail.com>
Envoyé : vendredi 10 décembre 2021 23:53
À : pgsql-admin <pgsql-admin@postgresql.org>
Objet : Very slow 101-feeling design/query..

This feels very 101 but I feel like it should be much faster:

A table "joints" with a PK of play_uuid, target_id, joint_seq, joint_timestamp.

"joints" is partitioned using RANGE on joint_timestamp for monthly partitions 1/1 - 2/1, 2-1 - 3/1, etc.

"joints" has an FK where play_uuid refers to table "plays" and column "play_uuid" where "play_uuid" is the PK.

"plays" additionally has an indexed column game_id.

"joints" has 1133,932,391 rows across 12 monthly partitions for 2021, and "plays has 585,627 rows. We made a view called "vw_joints" which just does:

SELECT * FROM joints JOIN plays USING (play_id);

Then doing:

SELECT DISTINCT game_id FROM vw_joints

Takes 35-45 minutes. Which seems nuts. We do this kind of design in a few different plays to normalize things, but it comes at the cost of these agonizingly slow (and seemingly dead simple) qeuries.

Is there any optimization to do here beyond flattening table and de-normalizing data? Is the partitioning causing a slowness here? I feel like partitioning is creating some difficulty...

Appreciate it.

Wells Oliver
wells.oliver@gmail.com

Re: Very slow 101-feeling design/query..

From

Wells Oliver

Date:

11 December 2021, 18:32:01

We have hundreds of millions of joint locations across hundreds of thousands of plays across thousands of games. We also have a few hundred games with plays in them, but no joint measurements. The joint data is a measurement of body joints through time (up to 300x a second)

So my idea was the plays table contains the game identifier (there is no root game table), and the joints table refers to the plays table by a play UUID.

As an example: One game contains 231 plays. There are 396,144 rows of joint data.

I partitioned the joints data by month because of the volume.

The end result is using the joints view (which joins to plays) is just dead, grinding slow, and I'm trying to think of alternatives that stay within Postgres, i.e. maybe BigQuery and/or RedShift is more of an appropriate solution, but I don't want to go there quite yet.

Just trying to optimize this design and open to ideas. Thanks.

On Sat, Dec 11, 2021 at 7:58 AM Michel SALAIS <msalais@msym.fr> wrote:

Hi
Using view which does a join in two tables to access things in one of them is not a good idea. You pay for a join even though it is not necessary. So if you just want distinct game_id and you don’t care if its play_uuid is in joints, then you can accelerate using the table plays directly.

Michel SALAIS
De : Wells Oliver <wells.oliver@gmail.com>
Envoyé : vendredi 10 décembre 2021 23:53
À : pgsql-admin <pgsql-admin@postgresql.org>
Objet : Very slow 101-feeling design/query..

This feels very 101 but I feel like it should be much faster:

A table "joints" with a PK of play_uuid, target_id, joint_seq, joint_timestamp.

"joints" is partitioned using RANGE on joint_timestamp for monthly partitions 1/1 - 2/1, 2-1 - 3/1, etc.

"joints" has an FK where play_uuid refers to table "plays" and column "play_uuid" where "play_uuid" is the PK.

"plays" additionally has an indexed column game_id.

"joints" has 1133,932,391 rows across 12 monthly partitions for 2021, and "plays has 585,627 rows. We made a view called "vw_joints" which just does:

SELECT * FROM joints JOIN plays USING (play_id);

Then doing:

SELECT DISTINCT game_id FROM vw_joints

Takes 35-45 minutes. Which seems nuts. We do this kind of design in a few different plays to normalize things, but it comes at the cost of these agonizingly slow (and seemingly dead simple) qeuries.

Is there any optimization to do here beyond flattening table and de-normalizing data? Is the partitioning causing a slowness here? I feel like partitioning is creating some difficulty...

Appreciate it.

--
Wells Oliver
wells.oliver@gmail.com

Wells Oliver
wells.oliver@gmail.com

Re: Very slow 101-feeling design/query..

From

Mladen Gogala

Date:

11 December 2021, 19:22:23

On 12/11/21 13:32, Wells Oliver wrote:

We have hundreds of millions of joint locations across hundreds of thousands of plays across thousands of games. We also have a few hundred games with plays in them, but no joint measurements. The joint data is a measurement of body joints through time (up to 300x a second)

So my idea was the plays table contains the game identifier (there is no root game table), and the joints table refers to the plays table by a play UUID.

As an example: One game contains 231 plays. There are 396,144 rows of joint data.

I partitioned the joints data by month because of the volume.

The end result is using the joints view (which joins to plays) is just dead, grinding slow, and I'm trying to think of alternatives that stay within Postgres, i.e. maybe BigQuery and/or RedShift is more of an appropriate solution, but I don't want to go there quite yet.

Just trying to optimize this design and open to ideas. Thanks.

On Sat, Dec 11, 2021 at 7:58 AM Michel SALAIS <msalais@msym.fr> wrote:
Hi
Using view which does a join in two tables to access things in one of them is not a good idea. You pay for a join even though it is not necessary. So if you just want distinct game_id and you don’t care if its play_uuid is in joints, then you can accelerate using the table plays directly.

Michel SALAIS
De : Wells Oliver <wells.oliver@gmail.com>
Envoyé : vendredi 10 décembre 2021 23:53
À : pgsql-admin <pgsql-admin@postgresql.org>
Objet : Very slow 101-feeling design/query..

This feels very 101 but I feel like it should be much faster:

A table "joints" with a PK of play_uuid, target_id, joint_seq, joint_timestamp.

"joints" is partitioned using RANGE on joint_timestamp for monthly partitions 1/1 - 2/1, 2-1 - 3/1, etc.

"joints" has an FK where play_uuid refers to table "plays" and column "play_uuid" where "play_uuid" is the PK.

"plays" additionally has an indexed column game_id.

"joints" has 1133,932,391 rows across 12 monthly partitions for 2021, and "plays has 585,627 rows. We made a view called "vw_joints" which just does:

SELECT * FROM joints JOIN plays USING (play_id);

Then doing:

SELECT DISTINCT game_id FROM vw_joints

Takes 35-45 minutes. Which seems nuts. We do this kind of design in a few different plays to normalize things, but it comes at the cost of these agonizingly slow (and seemingly dead simple) qeuries.

Is there any optimization to do here beyond flattening table and de-normalizing data? Is the partitioning causing a slowness here? I feel like partitioning is creating some difficulty...
--

Well, you can create a trigger which would fire whenever row is inserted or deleted and would update joint, game_id and the count in a separate table. That is the usual solution for the problem you described. Also, if there is a small amount of games, you can use hash index instead of B-tree indexes.

Mladen Gogala
Database Consultant
Tel: (347) 321-1217
https://dbwhisperer.wordpress.com

Re: Very slow 101-feeling design/query..

From

Wells Oliver

Date:

11 December 2021, 19:37:20

Yeah, I can take a look at that. The other issue is simply doing SELECT * FROM vw_joints WHERE game_id = 123 is slow because joints -> plays on play_uuid, and the index is on game_id in plays, but that query is slower than I want it to be.

On Sat, Dec 11, 2021 at 11:22 AM Mladen Gogala <gogala.mladen@gmail.com> wrote:

On 12/11/21 13:32, Wells Oliver wrote:
We have hundreds of millions of joint locations across hundreds of thousands of plays across thousands of games. We also have a few hundred games with plays in them, but no joint measurements. The joint data is a measurement of body joints through time (up to 300x a second)

So my idea was the plays table contains the game identifier (there is no root game table), and the joints table refers to the plays table by a play UUID.

As an example: One game contains 231 plays. There are 396,144 rows of joint data.

I partitioned the joints data by month because of the volume.

The end result is using the joints view (which joins to plays) is just dead, grinding slow, and I'm trying to think of alternatives that stay within Postgres, i.e. maybe BigQuery and/or RedShift is more of an appropriate solution, but I don't want to go there quite yet.

Just trying to optimize this design and open to ideas. Thanks.

On Sat, Dec 11, 2021 at 7:58 AM Michel SALAIS <msalais@msym.fr> wrote:
Hi
Using view which does a join in two tables to access things in one of them is not a good idea. You pay for a join even though it is not necessary. So if you just want distinct game_id and you don’t care if its play_uuid is in joints, then you can accelerate using the table plays directly.

Michel SALAIS
De : Wells Oliver <wells.oliver@gmail.com>
Envoyé : vendredi 10 décembre 2021 23:53
À : pgsql-admin <pgsql-admin@postgresql.org>
Objet : Very slow 101-feeling design/query..

This feels very 101 but I feel like it should be much faster:

A table "joints" with a PK of play_uuid, target_id, joint_seq, joint_timestamp.

"joints" is partitioned using RANGE on joint_timestamp for monthly partitions 1/1 - 2/1, 2-1 - 3/1, etc.

"joints" has an FK where play_uuid refers to table "plays" and column "play_uuid" where "play_uuid" is the PK.

"plays" additionally has an indexed column game_id.

"joints" has 1133,932,391 rows across 12 monthly partitions for 2021, and "plays has 585,627 rows. We made a view called "vw_joints" which just does:

SELECT * FROM joints JOIN plays USING (play_id);

Then doing:

SELECT DISTINCT game_id FROM vw_joints

Takes 35-45 minutes. Which seems nuts. We do this kind of design in a few different plays to normalize things, but it comes at the cost of these agonizingly slow (and seemingly dead simple) qeuries.

Is there any optimization to do here beyond flattening table and de-normalizing data? Is the partitioning causing a slowness here? I feel like partitioning is creating some difficulty...
--

Well, you can create a trigger which would fire whenever row is inserted or deleted and would update joint, game_id and the count in a separate table. That is the usual solution for the problem you described. Also, if there is a small amount of games, you can use hash index instead of B-tree indexes.

--
Mladen Gogala
Database Consultant
Tel: (347) 321-1217
https://dbwhisperer.wordpress.com

Wells Oliver
wells.oliver@gmail.com

Re: Very slow 101-feeling design/query..

From

Doug Reynolds

Date:

11 December 2021, 22:50:03

Keep in mind that using UUIDs as an ID on large data sets will use double the memory and index size. Obviously, if that is set in stone, you probably can't readily change it, but I'd think twice before using that for billions of rows for joining.

Sent from my iPhone

On Dec 11, 2021, at 2:38 PM, Wells Oliver <wells.oliver@gmail.com> wrote:

Yeah, I can take a look at that. The other issue is simply doing SELECT * FROM vw_joints WHERE game_id = 123 is slow because joints -> plays on play_uuid, and the index is on game_id in plays, but that query is slower than I want it to be.
On Sat, Dec 11, 2021 at 11:22 AM Mladen Gogala <gogala.mladen@gmail.com> wrote:
On 12/11/21 13:32, Wells Oliver wrote:
We have hundreds of millions of joint locations across hundreds of thousands of plays across thousands of games. We also have a few hundred games with plays in them, but no joint measurements. The joint data is a measurement of body joints through time (up to 300x a second)

So my idea was the plays table contains the game identifier (there is no root game table), and the joints table refers to the plays table by a play UUID.

As an example: One game contains 231 plays. There are 396,144 rows of joint data.

I partitioned the joints data by month because of the volume.

The end result is using the joints view (which joins to plays) is just dead, grinding slow, and I'm trying to think of alternatives that stay within Postgres, i.e. maybe BigQuery and/or RedShift is more of an appropriate solution, but I don't want to go there quite yet.

Just trying to optimize this design and open to ideas. Thanks.

On Sat, Dec 11, 2021 at 7:58 AM Michel SALAIS <msalais@msym.fr> wrote:
Hi
Using view which does a join in two tables to access things in one of them is not a good idea. You pay for a join even though it is not necessary. So if you just want distinct game_id and you don’t care if its play_uuid is in joints, then you can accelerate using the table plays directly.

Michel SALAIS
De : Wells Oliver <wells.oliver@gmail.com>
Envoyé : vendredi 10 décembre 2021 23:53
À : pgsql-admin <pgsql-admin@postgresql.org>
Objet : Very slow 101-feeling design/query..

This feels very 101 but I feel like it should be much faster:

A table "joints" with a PK of play_uuid, target_id, joint_seq, joint_timestamp.

"joints" is partitioned using RANGE on joint_timestamp for monthly partitions 1/1 - 2/1, 2-1 - 3/1, etc.

"joints" has an FK where play_uuid refers to table "plays" and column "play_uuid" where "play_uuid" is the PK.

"plays" additionally has an indexed column game_id.

"joints" has 1133,932,391 rows across 12 monthly partitions for 2021, and "plays has 585,627 rows. We made a view called "vw_joints" which just does:

SELECT * FROM joints JOIN plays USING (play_id);

Then doing:

SELECT DISTINCT game_id FROM vw_joints

Takes 35-45 minutes. Which seems nuts. We do this kind of design in a few different plays to normalize things, but it comes at the cost of these agonizingly slow (and seemingly dead simple) qeuries.

Is there any optimization to do here beyond flattening table and de-normalizing data? Is the partitioning causing a slowness here? I feel like partitioning is creating some difficulty...
--

Well, you can create a trigger which would fire whenever row is inserted or deleted and would update joint, game_id and the count in a separate table. That is the usual solution for the problem you described. Also, if there is a small amount of games, you can use hash index instead of B-tree indexes.

--
Mladen Gogala
Database Consultant
Tel: (347) 321-1217
https://dbwhisperer.wordpress.com
--
Wells Oliver
wells.oliver@gmail.com

Re: Very slow 101-feeling design/query..

From

Wells Oliver

Date:

11 December 2021, 22:52:07

That's an interesting point. On the one hand, it's the data set we've been given-- UUIDs are the vendor's choice-- on the other hand, we could apply an integer/bigint identifier per UUID and use that instead for joining purposes.

On Sat, Dec 11, 2021 at 2:50 PM Doug Reynolds <mav@wastegate.net> wrote:

Keep in mind that using UUIDs as an ID on large data sets will use double the memory and index size. Obviously, if that is set in stone, you probably can't readily change it, but I'd think twice before using that for billions of rows for joining.

Sent from my iPhone

On Dec 11, 2021, at 2:38 PM, Wells Oliver <wells.oliver@gmail.com> wrote:
Yeah, I can take a look at that. The other issue is simply doing SELECT * FROM vw_joints WHERE game_id = 123 is slow because joints -> plays on play_uuid, and the index is on game_id in plays, but that query is slower than I want it to be.
On Sat, Dec 11, 2021 at 11:22 AM Mladen Gogala <gogala.mladen@gmail.com> wrote:
On 12/11/21 13:32, Wells Oliver wrote:
We have hundreds of millions of joint locations across hundreds of thousands of plays across thousands of games. We also have a few hundred games with plays in them, but no joint measurements. The joint data is a measurement of body joints through time (up to 300x a second)

So my idea was the plays table contains the game identifier (there is no root game table), and the joints table refers to the plays table by a play UUID.

As an example: One game contains 231 plays. There are 396,144 rows of joint data.

I partitioned the joints data by month because of the volume.

The end result is using the joints view (which joins to plays) is just dead, grinding slow, and I'm trying to think of alternatives that stay within Postgres, i.e. maybe BigQuery and/or RedShift is more of an appropriate solution, but I don't want to go there quite yet.

Just trying to optimize this design and open to ideas. Thanks.

On Sat, Dec 11, 2021 at 7:58 AM Michel SALAIS <msalais@msym.fr> wrote:
Hi
Using view which does a join in two tables to access things in one of them is not a good idea. You pay for a join even though it is not necessary. So if you just want distinct game_id and you don’t care if its play_uuid is in joints, then you can accelerate using the table plays directly.

Michel SALAIS
De : Wells Oliver <wells.oliver@gmail.com>
Envoyé : vendredi 10 décembre 2021 23:53
À : pgsql-admin <pgsql-admin@postgresql.org>
Objet : Very slow 101-feeling design/query..

This feels very 101 but I feel like it should be much faster:

A table "joints" with a PK of play_uuid, target_id, joint_seq, joint_timestamp.

"joints" is partitioned using RANGE on joint_timestamp for monthly partitions 1/1 - 2/1, 2-1 - 3/1, etc.

"joints" has an FK where play_uuid refers to table "plays" and column "play_uuid" where "play_uuid" is the PK.

"plays" additionally has an indexed column game_id.

"joints" has 1133,932,391 rows across 12 monthly partitions for 2021, and "plays has 585,627 rows. We made a view called "vw_joints" which just does:

SELECT * FROM joints JOIN plays USING (play_id);

Then doing:

SELECT DISTINCT game_id FROM vw_joints

Takes 35-45 minutes. Which seems nuts. We do this kind of design in a few different plays to normalize things, but it comes at the cost of these agonizingly slow (and seemingly dead simple) qeuries.

Is there any optimization to do here beyond flattening table and de-normalizing data? Is the partitioning causing a slowness here? I feel like partitioning is creating some difficulty...
--

Well, you can create a trigger which would fire whenever row is inserted or deleted and would update joint, game_id and the count in a separate table. That is the usual solution for the problem you described. Also, if there is a small amount of games, you can use hash index instead of B-tree indexes.

--
Mladen Gogala
Database Consultant
Tel: (347) 321-1217
https://dbwhisperer.wordpress.com
--
Wells Oliver
wells.oliver@gmail.com

Wells Oliver
wells.oliver@gmail.com