Warning: shower thoughts lie ahead. I've been str...
# thinking-together
n
Warning: shower thoughts lie ahead. I've been struck with the nagging thought that maybe "FoC" general-purpose languages (targeting extreme accessibility) have to be designed in such a way that they can run on (modern) GPUs rather than just CPUs. Even integrated GPUs nowadays have a minimum of 200-400 general purpose cores (albeit with a slant towards certain operations). We don't really know how to use those cores effectively for general purpose tasks because we're normally trying to program them using a C dialect (e.g. via CUDA, OpenCL, GLSL...). It's the same problem we have with multicore CPUs: writing massively-parallelizable code in an imperative language is (too) challenging, and we've known for decades that we'll need to solve the problem eventually, since parallelization is the only way to scale computation once we're building circuits out of individual atoms. Only languages based on constructs that are implicitly parallelizable are going to be able to target GPUs effectively whilst remaining highly accessible. The alternative is to ask the user to explicitly divide their computation up into parallelizable work units (threads/actors), which is an immediate complexity trap. Programmer-led task division doesn't scale, and it's a deep rabbit hole that can require a PhD to be done effectively. Some people might argue that parallelization is a performance optimization and that most end-user apps don't need it, but I think there are many occasions where the ceiling of what's possible is just too low to offer a bright future. There are always occasions where someone comes up with a need like "I want to process this entire spreadsheet / note collection / webpage" or "I want to make a picture" or "I want to do a simulation / animation of my idea", and they want that processing to be interactive (implying instantaneous), at which point most serial languages can't handle what's being asked for. So, must a new generation of accessible programming languages be based on implicitly parallelizable constructs and 400 cores? The hardware APIs we need (Vulkan, WebGPU...) are finally becoming available. We just need to utilize them half-decently.
👍 4
💯 1
🤯 2
🤔 2
Non-parallelizable language constructs include call stacks/top-down recursion, and hierarchical data structures (anything based on unidirectional pointers and singular access paths).
s
I agree. One problem we have is our languages end up expressing a lot more than we need to. E.g. sequential imperative langs specify the order of execution even when it is not necessary (and then compliers get more complicated trying to unravel the data flow to optimize). Even with parallelelizable constructs, I think we'd want to minimize the inter 'cell' communication for practical reasons.
👍 4
n
Yeah, sequences of instructions are a non-starter. I'm skeptical of the concept of "cells" too, though I'm not sure what you're thinking of. Cells = objects = actors (executing independently) are a form of explicit parallelization, even if you can justify them as representing parts of the problem domain (as OOP has always tried to do).
w
I'll agree, but maybe for a funny reason. With at the level of most FoC projects, the execution model of hardware isn't of primary concern. Most of us aren't aiming for portable assembly, rather something confluent with people's ways of thinking about problems. Many things we want to express are non-strict: you want the computer to help you figure out something, but you don't care the order it computes things in so long as it remains responsive while working.
👍 1
n
My perspective is: you can't be confluent with someone's way of thinking if you're asking them to express their thinking in terms of 400 parallelizable units (in order for their idea to be feasible to execute).
w
On the other hand, there are domains where you are describing a step-by-step process: be it card games or an assembly line. However, in this these cases. The domain specific imperative steps probably should not line up with some sort of CPU threaded execution mechanism.
@Nick Smith the kinds of things that I've done which parallelize to 400 units are all of the form, "check all the combinations and tell me the best fit." And you don't even want the computers to check all the combinations because there's way too many.
I mean to actually check them all.
s
@Nick Smith Why do you see the solution to this problem at the language design and not on the library level?
n
@Stefan What’s the difference between a library and a language? If the library is offering parallel computation, then the library will need to offer a language (API) to express the computation. Libraries don’t help solve the problem.
Also, libraries are usually designed to solve a specific problem (e.g. graphics, or matrix multiplication). If you want your arbitrary domain problems to be parallelised, you need to express them in terms of more general constructs. Those constructs are something a programming language is supposed to provide.
s
Hmm… not sure I can follow. How can you solve an arbitrary problem with parallelization? Wouldn’t you have to know enough about it so it becomes domain-specific? At least specific enough to know if it makes sense to run it on GPU cores such that the setup costs are amortized?
There are libraries today that do exactly that transparently based on context like the amount of data to be crunched, so you as the developer don’t have to make that distinction… so I guess my question is — slightly reworded — what benefits would a language offer over a library for an already existing language? It seems a lot of work to invent a new language if you can just import a library…
n
If you run a whole app on the GPU then setup costs are less of a problem (from what I know). Setup costs are a problem when you try and do some tasks on the GPU and some on the CPU, and they have to constantly communicate. Setup costs are also higher in outdated APIs like OpenGL, and very low in newer APIs since the GPU program is compiled at app startup rather than mid-frame.
s
Or at program compile time even. So do I understand you correctly that you want a programming language that does everything on the GPU?
👍 1
n
To have everyday computations be massively parallelizable they firstly need to be large enough that it matters. If your app is just a digital clock (incrementing a counter), then yeah, it can't be parallelized and it doesn't need to be. Once you have a computation large enough to be worth parallelizing, then automatic parallelization depends on having the problem expressed in a form that does not introduce artificial sequentiality (i.e. avoids instruction sequences, top-down recursion, and hierarchical data structures). What these constructs should be is an open problem. I have a hunch that relational/logic languages provide a good foundation and we need some novel ideas atop that foundation. That's what I've been focusing on for the last six months. I'm hoping to be able to share more at some point, if I find a promising path.
Yes, I want to consider a language that does everything on the GPU that is possible with today's hardware.
s
What these constructs should be is an open problem.
Sounds like the distinction between applicative functors and monads might be relevant to your investigations.
n
I've been down the route of FP, and I know about those concepts, but I don't think they're applicable in any substantial way (pun intended).
s
I was more talking about math than FP here, but it seems that might just make it even less attractive to you… Regardless, there are (imperative) languages that take advantage of the programmer specifying something as applicative, ie. “sequence of operation not important” to run them in parallel. And at the call site it looks just like your usual map over an array, just that it runs faster. On the other hand I do think a functional language like Haskell is a good example to see what design questions are raised when a language restricts specifying operations where order is important in general, and only allows that with additional effort in form of monads and syntactic sugar like do-notation. I’m genuinely interested in what you are thinking about and look forward to learn more about what you’ve been working on.
n
It's definitely necessary to exploit associativity and commutativity of operations for parallelization, which I think covers what you're referring to.
Haskell isn't really orderless. It's built atop a foundation of recursion and hierarchical data structures. Even if its operations aren't specified as a strict sequence (a total order), those constructs impose (inessential) order.
Thank you 🙂 I'll be reporting progress when I think I have a story to tell! (will try #C0120A3L30R at some point too).
👍 3
d
In Onex there are two forms of parallelism that can be exploited. There's the coarse grained parallelism that you'd probably understand as the actors or agents or live objects. And there's the finer term reduction familiar to FP folk, where you can reduce a tree by rewriting many branches at once. Neither of which the programmer needs to be conscious of! I mean, they will get the sense that the coarse objects seem to have their own animation, but that's what they'd expect given that that's how reality also works!
So to return to your point, Onex programs can be parallelised without end user involvement, including on GPUs
Oh, and I use Vulkan but haven't got a clue what I'm doing.. 😄
n
Have you actually got it (the whole language) running on GPUs? GPUs don't handle tree-like data very well, from what I know.
d
Soz I meant I use Vulkan for the UI so in theory it would be easy to use the API for processing but that's a long way off..
n
Seems like a stretch to say that you can compute on GPUs then 🙂. Let me know if you make progress in that direction.
w
Friends, let's be clear about the old difference between parallelism (doing two things at once) and concurrency (coordinating activities that are happening at about the same time).
d
There's been a lot of activity over the past few years in defining high level GPU languages that address these issues. • co-dfns is a data-parallel and functional dialect of APL that runs on GPUs. What blew my mind is that the compiler itself is written in co-dfns, and so you can compile your co-dfns programs into GPU code using the GPU. Prior to seeing this, I did not think that a compiler was the kind of program suitable for parallel execution on a GPU. Turns out that the choice of data structures is very important. • Taichi is a DSL for defining fixed-height hierarchical data structures (which behave like sparse arrays). Thousands of lines of CUDA that only one guy in your organization understands can be replaced by tens of lines of taichi code. It's very instructive to look at the GPU-specific data types and compiler optimizations exposed by the Taichi language. • TensorFlow is implemented as a library that you call from C++ or Python. Using this library, you construct what is essentially a parse tree for a program using APL-like data parallel operations. The library compiles this parse tree into GPU code and executes it. Stefan asked why this is a language issue, not a library issue. Well, it's easier to write code directly in a language, than to use a library interface that consumes source code written in another language and compiles it.
🍰 3
My Curv language already compiles into GPU code, but it's not general enough to do everything I want. Support for hierarchical data structures is the next big thing (along with the ability to automatically generate compute shader pipelines). The use case will involve hundreds of cores traversing the same hierarchical data structure in parallel.
🍰 1
s
@Doug Moen co-dfns looks super interesting — thanks for sharing that! I hadn’t heard about it, although Raph Levien’s A taste of GPU compute is sitting at the top spot of my to-watch list and it seems I would’ve picked it up there soon… fascinating. I’d love to dig deeper into the data structures part — why is it that every project you’re currently not working on looks more attractive than the one you are working on?
I also have a feeling that the C++ superset approach used for Apple’s Metal Shading Language isn’t the end of that story…
🤔 1
j
Database folks have been attacking sql-on-gpu for a long time but there are no real successes yet. The core problem at the moment is memory bandwidth and latency. Getting data in and out of gpus is slow, and gpus aren't very good at branchy workloads so you inevitably need to do some stuff on the cpu. Often the result is that actual query execution is somewhat faster but the speedup is dwarfed by the time spent fetching the data. Looking at games gives a good idea of the division of work - game programmers are among the most experienced at writing gpu-friendly code but they still typically choose to put game logic, AI, pathing etc on the cpu. Also modern cpus are actually pretty wide if you write code that is friendly to out-of-order execution and memory pre-fetching, but modern high-level languages go almost out of their way to be hostile to both. My bet is that designing a language to reduce false data dependencies and allow for more sequential memory access is a more viable target than designing a general purpose language for the gpu. This might change though in the next decade as new gpu designs offer to share general memory with the cpu, so switching back and forth becomes more practical.
s
One thing is many programming models that are friendly for GPU also run significantly faster in CPUs
Game code runs on CPUs, sure, but if you look at what Unity is doing with DOTS, or ISPC, you'll see that on CPU you can get orders of magnitude better performance, most game code doesn't run on the GPU because the GPU is budgeted for graphics work, but more and more graphics (culling, sorting) and graphics adjacent (animation, VFX) work that traditionally happened on the CPU is happening on the GPU
g
the co-dfns author did a workshop on how he made trees parallel by construction that i will probably have to digest for a full year:

https://youtu.be/lc4IjR1iJTg▾

n
@Doug Moen Thanks for all those links! You seem like you know what's going on in the GPU space, I'll have to tap your brain 🙂. And wow, I've never looked into Tensorflow because my eyes glaze over when people start talking about machine learning. I didn't realise it was a more general platform. Time to pay attention!
@jamii Couldn't the fact that game programmers still do a lot of work on the CPU be explained by the fact that: • It's hard to write massively parallel code in traditional languages • GPUs are very hard to program, especially with pre-2016 APIs • Not all work is slow enough to benefit from extensive parallelization • Everyone thinks CPUs are "normal" and GPUs are "special" None of those factors imply that GPUs would be a performance regression, they just imply that CPUs are the easy or "good enough" path. I mean... I've made some small video games before, and I never decided to use a CPU because a GPU wouldn't work. It's more that CPUs are the default choice. On the other hand, I can see how an SQL database would be a problem on the GPU since you have to shuffle a ton of data around and disk/memory speed would limit any potential gains. I'm not personally interested in "big data" apps thankfully, so I'm happy to presume my language only works with <~2 GB workloads. And yeah, I think "the future" is integration of CPU and GPU cores such that there is no perceivable communication overhead. It might be reasonable to take that as an assumption, if helpful.
d
@Nick Smith I think the future is integration of CPU and GPU cores such that there is no perceivable communication overhead.
So AMD has been pushing this idea for 8 years with their HSA architecture. If it is such a great idea, why are Intel and nVidia not doing it? An honest question, I know very little about HSA or about any engineering tradeoffs that might be involved. https://en.wikipedia.org/wiki/Heterogeneous_System_Architecture
n
My first guess would be Nvidia doesn’t see itself as a CPU company and Intel doesn’t see itself as a GPU company. They’re focused on their own niches. AMD has an equal focus on both technologies.
Of course all companies have crossed into both technologies, but not deeply.
s
@Nick Smith see my response above, none of those things are true for the current state of the AAA game industry, game engines are using (less) massively parallel code on the CPU by default, the main reason now is resource allocation, and once again many workloads that were CPU only are moving to the GPU now
sure there is a lot of game code that is still single threaded
Also one thing about the memory issue is game consoles already almost universally have unified memory
and have had it for years
on consoles there often is very little communication overhead, consoles use AMD (or Nvidia with Switch) SOCs, there are still many reasons why you can't or don't want the CPU and GPU operating on the exact same data and you don't want the CPU waiting for the GPU or vice a versa. GPU and CPU don't share caches, GPU prefers large pipelined workloads to hide memory latency and wants to saturate bandwidth, etc. but many of those same rules apply to any multiprocessor system and are valid if you want good performance with multithreaded CPU code
@Nick Smith Intel is working on a discrete GPU (that will actually ship, I think), so maybe that will change. I think Intel is more of a GPU company than Nvidia is a CPU company
There is also this idea that, everyone has a integrated GPU sitting in their Intel CPU, and that thing is often idle if a discrete GPU is active
and sure its under-powered, but it still has some compute capability that's non-trivial
its under-powered compared to high end discrete GPUs
also integrated GPUs on PC are UMA
d
I haven't seen anyone mention Halide ... I don't know how to build a programming language where massive parallelism is easy, but Halide is the obvious starting point. https://halide-lang.org/ To a large extent I think the ball is in the hardware people's court. I remember seeing a proposal for a CPU architecture, can't remember where I saw it or what it is called... it was similar to SIMD but instead of the concept being "provide a bunch of instructions and hope developers use them" it was "run arbitrary C loops in parallel" - it was an architecture designed specifically to allow the vast majority of loops to "implicitly" run in parallel (meaning, the compiler would have to emit "vector" instructions as in standard SIMD, but the instructions themselves were more powerful than standard SIMD, enabling most loops to be automatically parallelized instead of the status quo where only a fraction of all loops can be automatically converted to SIMD form.) So, like, this needs to be a standard. Meanwhile on the GPU side, there is a large physical distance and slow bus separating it from the CPU, as well as a separate memory pool... and GPUs are bad at running code that is serial in nature. If AMD/NVIDIA can come up with a hybrid architecture that is capable of running both parallel code and mostly-serial efficiently, then it will become possible to "just compile your code for the GPU" (or in case of JIT languages, "just flip a switch and it runs on the GPU"), and then the GPU will be a more popular target.
d
This is a long thread I can't read through yet, so I risk repeating ideas, but here are my thoughts: There are many projects here that consider using a DAG representation for code or logic. That's something that can be automatically parallelized (where the only blocks are dependencies). Except that each branch is probably not doing the same thing, so not SIMD. Most loops can be replaced with map / reduce / filter / join / etc. (this is becoming a big thing in JavaScript, or LINQ in C#). If we keep heading that way, a lot of that stuff can be replaced with SIMD. Might not be possible wherever there are side effects, but some might be reducible to a formula? For things that are sequential, maybe a model more like fields + particles, like Alan Kay has suggested before. Maybe that's actors all over again though? But maybe there's a way to make that easier to deal with than in traditional general purpose languages.
n
@Scott Anderson I just stalked your LinkedIn so I believe you know what you're talking about 🙂. It's good to hear that AAA studios are taking GPU compute seriously.
d
@Dan Cook Well, Parallel LINQ (parallel map / reduce / filter / join / etc for C#) has been available for many years, but I don't generally feel like just using it all the time instead of standard serial LINQ. The problem is that often we're operating on a small collection, and the overhead of coordinating the behavior of even two threads is not worthwhile if the collection is smaller than, I don't know, 100 elements or something. Throughout my career I've spent most of my time dealing with many small collections with less than 5 or 50 elements each, and when I'm dealing with medium-size collections of 100-1000 elements, it's often a hashtable, sorted list or array that I'm using as a lookup table in a bigger calculation, rather than something I'm directly using with map/filter/reduce. (I do also process larger lists, it's just that the fraction of code doing highly parallelizable work is small.) Sometimes one can rearrange lots of small collections into big arrays to allow more parallelism, but sometimes I can't think of a way to parallelize a lot of this work, e.g. algorithms on tree structures like Loyc trees (code) don't seem parallelizable, generally, although I did design my LES language intentionally to have a "context-insensitive" syntax, so at least you can parse all your files in parallel. For these kinds of workloads I really want hardware like I described, which can parallelize short loops efficiently.