Thread: Optimizing search query with sorting by creation field

Optimizing search query with sorting by creation field

From
Droid Tools
Date:
Hi, 

I'm looking for tips on optimizing a search query where someone searches for content within a post and wants to sort the results by a timestamp. I have a table `posts` with a `created` field (timestamp) and a post_tsv column (TSVECTOR). Likewise I have a GIN Index on the `post_tsv` field and a separate index on the `created` field. 

My initial, naive, attempt was to simply to do something like: 

```
SELECT * FROM posts WHERE post_tsv @@ websearch_to_tsquery(?) ORDER BY created DESC
```

However, I didn't realize in this case the `created` index would be ignored, which means if there was a large number of posts returned, this query would take several seconds to execute. Also the planner would do weird things even if the result set was small and still take several seconds to execute. Currently I've papered over the problem by issuing a subquery and sorting that instead.

 ```
SELECT * FROM 
  (SELECT * FROM posts WHERE post_tsv @@ websearch_to_tsquery(?) LIMIT 10,000) q
ORDER BY created DESC
```

In short I execute the search, limit that to 10,000 rows, and then order the 10,000 rows that were returned. This worked amazingly for queries that returned fewer than 10,000 rows as those queries went from taking several seconds to run down to a handful of milliseconds. The problem is for queries with more than 10,000 rows you essentially end up with random results. I'm still not using the created index, but sorting 10,000 rows in memory is relatively fast.

I'm stuck where to go from here - what I would like, since I know I will only ever ORDER BY the created field is to build some index where the default ordering is by the created field. GIN, as I understand it, doesn't support indexing in this manner (using one of the columns as a sort field). Is there anything else I could try?

Re: Optimizing search query with sorting by creation field

From
Vijaykumar Jain
Date:
What is your baseline expectation?
With what size of db table, what query should take how much time
How much server resources can be used?
 

If this seems to be a timeseries db,
Are the rows append only  and random insertion order ?

You are create partitions based on time and sub partitions on some other field that can be used as predictable to exclude scanning other partitions and speed up retrieved rows.
Clustering of tables based on timestamp may help reduce sort time if order by is always used.

Basically a lot of stuff for improvement I would list are used by timescaledb which is optimized for time series based queries.

If you want or do not want to use timescaledb, 
I guess it has enough content to help plan time series based data query and retrieve data.






On Fri, May 7, 2021, 6:12 AM Droid Tools <droidtools@gmail.com> wrote:
Hi, 

I'm looking for tips on optimizing a search query where someone searches for content within a post and wants to sort the results by a timestamp. I have a table `posts` with a `created` field (timestamp) and a post_tsv column (TSVECTOR). Likewise I have a GIN Index on the `post_tsv` field and a separate index on the `created` field. 

My initial, naive, attempt was to simply to do something like: 

```
SELECT * FROM posts WHERE post_tsv @@ websearch_to_tsquery(?) ORDER BY created DESC
```

However, I didn't realize in this case the `created` index would be ignored, which means if there was a large number of posts returned, this query would take several seconds to execute. Also the planner would do weird things even if the result set was small and still take several seconds to execute. Currently I've papered over the problem by issuing a subquery and sorting that instead.

 ```
SELECT * FROM 
  (SELECT * FROM posts WHERE post_tsv @@ websearch_to_tsquery(?) LIMIT 10,000) q
ORDER BY created DESC
```

In short I execute the search, limit that to 10,000 rows, and then order the 10,000 rows that were returned. This worked amazingly for queries that returned fewer than 10,000 rows as those queries went from taking several seconds to run down to a handful of milliseconds. The problem is for queries with more than 10,000 rows you essentially end up with random results. I'm still not using the created index, but sorting 10,000 rows in memory is relatively fast.

I'm stuck where to go from here - what I would like, since I know I will only ever ORDER BY the created field is to build some index where the default ordering is by the created field. GIN, as I understand it, doesn't support indexing in this manner (using one of the columns as a sort field). Is there anything else I could try?