Thread: FPGA optimization ...

FPGA optimization ...

From
Gunther
Date:

The time has come.

FPGA optimization is in the palm of our hands (literally a 2 TB 40 GB/s IO PostgreSQL server fits into less than a shoe box), and on Amazon AWS F1 instances.

Some demos are beginning to exist: https://github.com/Xilinx/data-analytics.

But a lot more could be done. How about linear sort performance at O(N)? https://hackaday.com/2016/01/20/a-linear-time-sorting-algorithm-for-fpgas/. And how about ​https://people.csail.mit.edu/wjun/papers/fccm2017.pdf, the following four sorting accelerators are used:

  • Tuple Sorter : Sorts an N-tuple using a sorting network.
  • Page Sorter : Sorts an 8KB (a flash page) chunk of sorted N-tuples in on-chip memory.
  • Super-Page Sorter : Sorts 16 8K-32MB sorted chunks in DRAM.
  • Storage-to-Storage Sorter: Sorts 16 512MB or larger sorted chunks in flash.

Order of magnitude speed improvements? Better than Hadoop clusters on a single chip? 40 GB/s I/O throughput massive full table scan, blazing fast sort-merge joins? Here it is. Anybody working more on that? Should be an ideal project for a student or a group of students.

Is there a PostgreSQL foundation I could donate to, 501(c)(3) tax exempt? I can donate and possibly find some people at Purdue University who might take this on. Interest?

regards,
-Gunther

Re: FPGA optimization ...

From
Tomas Vondra
Date:
On Mon, Nov 04, 2019 at 06:33:15PM -0500, Gunther wrote:
>The time has come.
>
>FPGA optimization is in the palm of our hands (literally a 2 TB 40 
>GB/s IO PostgreSQL server fits into less than a shoe box), and on 
>Amazon AWS F1 instances.
>
>Some demos are beginning to exist: https://github.com/Xilinx/data-analytics.
><https://github.com/Xilinx/data-analytics>
>
>But a lot more could be done. How about linear sort performance at 
>O(N)? https://hackaday.com/2016/01/20/a-linear-time-sorting-algorithm-for-fpgas/. 
>And how about ​https://people.csail.mit.edu/wjun/papers/fccm2017.pdf, 
>the following four sorting accelerators are used:
>
> * Tuple Sorter : Sorts an N-tuple using a sorting network.
> * Page Sorter : Sorts an 8KB (a flash page) chunk of sorted N-tuples
>   in on-chip memory.
> * Super-Page Sorter : Sorts 16 8K-32MB sorted chunks in DRAM.
> * Storage-to-Storage Sorter: Sorts 16 512MB or larger sorted chunks in
>   flash.
>
>Order of magnitude speed improvements? Better than Hadoop clusters on 
>a single chip? 40 GB/s I/O throughput massive full table scan, blazing 
>fast sort-merge joins? Here it is. Anybody working more on that? 
>Should be an ideal project for a student or a group of students.
>

For the record, this is not exactly a new thing. Netezza (a PostgreSQL
fork started in 1999 IBM) used FPGAs. Now there's swarm64 [1], another
PostgreSQL fork, also using FPGAs with newer PostgreSQL releases.

Those are proprietary forks, though. The main reason why the community
itself is not working on this directly (at least not on pgsql-hackers)
is exactly that it requires specialized hardware, which the devs
probably don't have, making development impossible, and the regular
customers are not asking for it either (one of the reasons being limited
availability of such hardware, especially for customers running in the
cloud and not being even able to deploy custom appliances).

I don't think this will change, unless the access to systems with FPGAs
becomes much easier (e.g. if AWS introduces such instance type).

>Is there a PostgreSQL foundation I could donate to, 501(c)(3) tax 
>exempt? I can donate and possibly find some people at Purdue 
>University who might take this on. Interest?
>

I don't think there's any such non-profit, managing/funding development.
At least I'm not avare of it. There are various non-profits around the
world, but those are organizing events and local communities.

I'd say the best way to do something like this is to either talk to one
of the companies participating in PostgreSQL devopment (pgsql-hackers is
probably a good starting point), or - if you absolutely need to go
through a non-profit - approach a university (which does not mean people
from pgsql-hackers can't be involved, of course). I've been involved in
a couple of such research projects in Europe, not sure what exactly is
the situation/rules in US.

regards

[1] https://swarm64.com/netezza-replacement/

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: FPGA optimization ...

From
Gunther
Date:
Hi Thomas, you said:

> For the record, this is not exactly a new thing. Netezza (a PostgreSQL
> fork started in 1999 IBM) used FPGAs. Now there's swarm64 [1], another
> PostgreSQL fork, also using FPGAs with newer PostgreSQL releases.

yes, I found the swarm thing on Google, and heard about Netezza years 
ago from the Indian consulting contractor that had worked on it (their 
price point was way out of the range that made sense for the academic 
place where I worked then).

But there is good news, better than you thought when you wrote:

> Those are proprietary forks, though. The main reason why the community
> itself is not working on this directly (at least not on pgsql-hackers)
> is exactly that it requires specialized hardware, which the devs
> probably don't have, making development impossible, and the regular
> customers are not asking for it either (one of the reasons being limited
> availability of such hardware, especially for customers running in the
> cloud and not being even able to deploy custom appliances).
>
> I don't think this will change, unless the access to systems with FPGAs
> becomes much easier (e.g. if AWS introduces such instance type).

It already has changed! Amazon F1 instances. And Xilinx has already 
packaged a demo https://aws.amazon.com/marketplace/pp/B07BVSZL51. This 
demo appears very limited though (only for TPC-H query 6 and 12 or so).

Even the hardware to hold in your hand is now much cheaper. I know a guy 
who's marketing a board with 40 GB/s throughput. I don't have price but 
I can't imagine the board plus 1 TB disk to be much outside of US$ 2k. I 
could sponsor that if someone wants to have a serious shot at it.

>> Is there a PostgreSQL foundation I could donate to, 501(c)(3) tax 
>> exempt? I can donate and possibly find some people at Purdue 
>> University who might take this on. Interest?
>>
>
> I don't think there's any such non-profit, managing/funding development.
> At least I'm not avare of it. There are various non-profits around the
> world, but those are organizing events and local communities.
>
> I'd say the best way to do something like this is to either talk to one
> of the companies participating in PostgreSQL devopment (pgsql-hackers is
> probably a good starting point), or - if you absolutely need to go
> through a non-profit - approach a university (which does not mean people
> from pgsql-hackers can't be involved, of course). I've been involved in
> a couple of such research projects in Europe, not sure what exactly is
> the situation/rules in US.

Yes, might work with a University directly. Although I will contact the 
PostgreSQL foundation in the US also.

regards,
-Gunther




Re: FPGA optimization ...

From
AJG
Date:
From what I have read and benchmarks seen..

FPGA shines for writes (and up to 3x (as opposed to 10x claim) real world
for queries from memory)

GPU shines/outperforms FPGA for reads. There is a very recent and
interesting academic paper[1] on High Performance GPU B-Tree (vs lsm) and
the incredible performance it gets, but I 'think' it requires NVIDIA (so no
easy/super epyc+gpu+hbm on-chip combo solution then ;) ).

Doesn't both FPHGA and GPU going to require changes to executor from pull to
push to get real benefits from them? Isnt that something Andres working on
(pull to push)?

What really is exciting is UPMEM (little 500mhz processors on the memory),
cost will be little more than memory cost itself, and shows up to 20x
performance improvement on things like index search (from memory). C
library, claim only needs few hundred lines of code to integrate from
memory, but not clear to me what use cases it can also be used for than ones
they show benchmarks for.


[1] https://escholarship.org/content/qt1ph2x5td/qt1ph2x5td.pdf?t=pkvkdm



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-performance-f2050081.html



Re: FPGA optimization ...

From
Tomas Vondra
Date:
On Wed, Nov 06, 2019 at 11:01:37AM -0700, AJG wrote:
>From what I have read and benchmarks seen..
>
>FPGA shines for writes (and up to 3x (as opposed to 10x claim) real world
>for queries from memory)
>
>GPU shines/outperforms FPGA for reads. There is a very recent and
>interesting academic paper[1] on High Performance GPU B-Tree (vs lsm) and
>the incredible performance it gets, but I 'think' it requires NVIDIA (so no
>easy/super epyc+gpu+hbm on-chip combo solution then ;) ).
>
>Doesn't both FPHGA and GPU going to require changes to executor from pull to
>push to get real benefits from them? Isnt that something Andres working on
>(pull to push)?
>

I think it very much depends on how the FPA/GPU/... is used.

If we're only talking about FPGA I/O acceleration, essentially FPGA
between the database and storage, it's likely possible to get that
working without any extensive executor changes. Essentially create an
FPGA-aware variant of SeqScan and you're done. Or an FPGA-aware
tuplesort, or something like that. Neither of this should require
significant planner/executor changes, except for costing.

>What really is exciting is UPMEM (little 500mhz processors on the memory),
>cost will be little more than memory cost itself, and shows up to 20x
>performance improvement on things like index search (from memory). C
>library, claim only needs few hundred lines of code to integrate from
>memory, but not clear to me what use cases it can also be used for than ones
>they show benchmarks for.
>

Interesting, and perhaps interesting for in-memory databases.

>
>[1] https://escholarship.org/content/qt1ph2x5td/qt1ph2x5td.pdf?t=pkvkdm

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: FPGA optimization ...

From
Andres Freund
Date:
Hi,

On 2019-11-06 22:54:48 +0100, Tomas Vondra wrote:
> If we're only talking about FPGA I/O acceleration, essentially FPGA
> between the database and storage, it's likely possible to get that
> working without any extensive executor changes. Essentially create an
> FPGA-aware variant of SeqScan and you're done. Or an FPGA-aware
> tuplesort, or something like that. Neither of this should require
> significant planner/executor changes, except for costing.

I doubt that that is true.  For one, you either need to teach the FPGA
to understand at least enough about the intricacies of postgres storage
format, to be able to make enough sense of visibility information to
know when it safe to look at a tuple (you can't evaluate qual's before
visibility information). It also needs to be fed a lot of information
about the layout of the table, involved operators etc.  And even if you
define those away somehow, you still need to make sure that the on-disk
state is coherent with the in-memory state - which definitely requires
reaching outside of just a replacement seqscan node.

I've a hard time believing that, even though some storage vendors are
pushing this model heavily, the approach of performing qual evaluation
on the storage level is actually useful for anything close to a general
purpose database, especially a row store.

It's more realistic to have a model where the fpga is fed pre-processed
data, and it streams out the processed results. That way there are no
problems with coherency, one can can transparently handle parts of
reading the data that the FPGA can't, etc.


But I admit I'm sceptical even the above model is relevant for
postgres. The potential market seems likely to stay small, and there's
so much more performance work that's applicable to everyone using PG,
even without access to special purpose hardware.

Greetings,

Andres Freund



Re: FPGA optimization ...

From
Tomas Vondra
Date:
On Wed, Nov 06, 2019 at 03:15:53PM -0800, Andres Freund wrote:
>Hi,
>
>On 2019-11-06 22:54:48 +0100, Tomas Vondra wrote:
>> If we're only talking about FPGA I/O acceleration, essentially FPGA
>> between the database and storage, it's likely possible to get that
>> working without any extensive executor changes. Essentially create an
>> FPGA-aware variant of SeqScan and you're done. Or an FPGA-aware
>> tuplesort, or something like that. Neither of this should require
>> significant planner/executor changes, except for costing.
>
>I doubt that that is true.  For one, you either need to teach the FPGA
>to understand at least enough about the intricacies of postgres storage
>format, to be able to make enough sense of visibility information to
>know when it safe to look at a tuple (you can't evaluate qual's before
>visibility information). It also needs to be fed a lot of information
>about the layout of the table, involved operators etc.  And even if you
>define those away somehow, you still need to make sure that the on-disk
>state is coherent with the in-memory state - which definitely requires
>reaching outside of just a replacement seqscan node.
>

That's true, of course - the new node would have to know a lot of
details about the on-disk format, meaning of operators, etc. Not
trivial, that's for sure. (I think PGStrom does this)

What I had in mind were extensive changes to how the executor works in
general, because the OP mentioned changing the executor from pull to
push, or abandoning the iterative executor design. And I think that
would not be necessary ...

>I've a hard time believing that, even though some storage vendors are
>pushing this model heavily, the approach of performing qual evaluation
>on the storage level is actually useful for anything close to a general
>purpose database, especially a row store.
>

I agree with this too - it's unlikely to be a huge win for "regular"
workloads, it's usually aimed at (some) analytical workloads.

And yes, row store is not the most efficient format for this type of
accelerators (I don't have much experience with FPGA, but for GPUs it's
very inefficient).

>It's more realistic to have a model where the fpga is fed pre-processed
>data, and it streams out the processed results. That way there are no
>problems with coherency, one can can transparently handle parts of
>reading the data that the FPGA can't, etc.
>

Well, the whole idea is that the FPGA does a lot of "simple" filtering
before the data even get into RAM / CPU, etc. So I don't think this
model would perform well - I assume the "processing" necessary could
easily be more expensive than the gains.

>
>But I admit I'm sceptical even the above model is relevant for
>postgres. The potential market seems likely to stay small, and there's
>so much more performance work that's applicable to everyone using PG,
>even without access to special purpose hardware.
>

Not sure. It certainly is irrelevant for everyone who does not have
access to systems with FPGAs, and useful only for some workloads. How
large the market is, I don't know.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services