Shader-Programmierung - das Puzzle-Stück das dem Cell fehlt

mocad_tom

Admiral Special
Mitglied seit
17.06.2004
Beiträge
1.234
Renomée
52
Ich schmöckere momentan in einem Buch, der den AMD-ATI-Zusammenschluss fast vorweg nimmt: "GPU Gems 2" by Matt Pharr, erschienen März 2005.
Witzigerweise prangt ein nVidia-Logo vorne auf dem Buchdeckel.
Ich möchte einen kleinen Absatz zitieren(Seite 470):
We can be confident that CPU vendors will not stand still as GPUs incorporate more processing power and more capabitlity onto their future chips. The ever-increasing number of transistors with each process generation may eventually lead to conflict between CPU and GPU manufacturers. Is the core of future systems the CPU, one that may eventually incorporate GPU or stream functionality on the CPU itself? Or will future systems contain a GPU at their heart with CPU functionality incorporated into the GPU? Such weighty questions will challenge the next generation of processor architects as we look toward an exciting future.

Ganz witzig finde ich dann die Gedankengänge von Hiroshige Goto, der sich wirklich extrem intensiv mit dem Thema befasst(über einige Wochen hinweg dominiert dieses Thema seine Kolumne):

http://babelfish.altavista.com/babe...ch.impress.co.jp/docs/2006/0810/kaigai294.htm

Hier mal ein kleiner Blick zurück, was Fred Weber damals über den Cell sagte:
http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2387&p=2
Cell does not have this symmetric luxury; instead, all of their cores are not equally capable and thus, in Weber's opinion, Cell requires that the software needs to know too much about its architecture to perform well. The move to a more general purpose, symmetric yet heterogeneous array of cores would require that each core on Cell must get bigger and more complex, which directly relates back to Weber (and our) first problem with Cell that it is too far ahead of its time from a manufacturing standpoint.

Das schwierigste am Cell ist die gleichmässige Auslastung der SPEs, von der Hardware hat man nicht allzuviel Rückendeckung zu erwarten, im Gegenteil, durch den Local-Storage muss man sich auch noch darum kümmern die Daten heranzuschaffen.

Hier finde ich den Ansatz der Shader-Programmierung sehr viel eleganter. Man bekommt einen Rahmen zur Verfügung gestellt, dabei muss man nur noch das "Schleifeninnere" angeben. Das Verteilen des Codes auf die Shader Units erledigt der Verbund aus Grafiktreiber+Shader Scheduler.

Hier mal ein paar Worte zum "Ultra Threading Dispatch Processor"
http://www.digit-life.com/articles2/video/r520-part1.html
The magic box (Ultra Threading Dispatch Processor) directs the execution — it processes 512 quads simultaneously, each of them can be at a different shader execution stage. Each quad is stored together with its current status, current shader command, values of previously checked conditions (information on the current branch of a conditional jump). NVIDIA chips run quads in circle, one after another. Maximum they can do is to skip quads, which don't fall under the current branch of a condition. The R520 operates differently — our magic box constantly monitors free resources (be it texture or pixel units) and directs queued quads into free devices. If a quad fails a condition and should not be processed by this or that shader part, it will not hang about in circles, taking up room and time, together with the other quads, which need to be processed. It will just skip unnecessary commands and will not load a texture or pixel unit. If a quad waits for data from a texture unit — it will let other quads forward, which will load pixel units for this time. Thus, this approach kills two birds with one stone — it hides texture access latency and allows efficient usage of computing and texture resources when shaders with conditions and branches are executed. Efficiency of both issues depends directly on the number of quads that our magic box can process. 512 look like an imposing set (we can get textures for 4 quads and process 4 quads in pixel processors per cycle; thus up to 8 quads are processed each cycle, while the rest of the quads wait for their turn or wait for data from texture units).

This unit is undoubtedly complex and the dispatching logic for this quad set takes up a considerable part of the chip, probably comparable with texture and pixel processors. Especially as register arrays actually belong to this unit as well — there must be lots of them to store efficiently all preliminary calculations for the 512 quads in queue.

Ich ziehe wirklich meinen Hut vor den Leuten, die in der Treiber-Sparte/Hardware-Sparte solche Module ersinnen.

Grüße,
Tom
 
Das schwierigste am Cell ist die gleichmässige Auslastung der SPEs, von der Hardware hat man nicht allzuviel Rückendeckung zu erwarten, im Gegenteil, durch den Local-Storage muss man sich auch noch darum kümmern die Daten heranzuschaffen.
Silvia Müller (verantworlich für die FPU der SPEs) hat in einer Vorlesung mal gemeint: "Beim Cell müssen die Leute mal endlich wieder richtig programmieren."
Das sagt doch schon alles über die Einstellung der Entwickler.
Wahrscheinlich sind sie auch noch der Meinung, dass man am Besten alles in Assembler programmiert.:]
 
Zurück
Oben Unten