[K42-discussion] RE: Possible benefit of running K42 on Cell CBE ...

Elvis John Dowson elvis_dowson at hotmail.com
Fri Dec 23 16:57:57 EST 2005


Hi Jimi,
          I'm going to try to answer some of your queries, but its going to
be really tricky with all these indents :-) !! 

See my answers below .. 

> -----Original Message-----
> From: Jimi Xenidis [mailto:jimix at watson.ibm.com]
> Sent: Wednesday, December 21, 2005 6:59 PM
> To: Elvis John Dowson; Elvis John Dowson
> Cc: Andrew Baumann; IBM Research K42 Discussion Forum; Orran Krieger;
> Andrew Baumann; IBM Research K42 Discussion Forum
> Subject: Re: Possible benefit of running K42 on Cell CBE ...
> 
> Ok, I like Orran do not want to turn away anyone working on k42
> especially with Cell.
> Getting K42 to run on the PPE of a Cell processor should be a "cake
> walk".
> 
> However..
> 
> Using SPEs on k42 would require the _not_so_simple_ matter of forward-
> porting the K42-Linux module to a Linux that supports the BPA (BE)
> configuration --or-- back-porting the Linux BPA support to the Linux
> module K42 is currently running.  The later being a ridiculous exercise.
> For the former, some K42 members are currently working to update the
> Linux module in K42, I'm not sure if that update includes the BPA
> changes or how hard it would be to continue the the new Linux module
> to one that supports the BPA config. Being that the SPU managment
> interface is /sysfs in Linux there may be more work then one would
> initially think.  I would caution that this would be more than a
> "side project".
> 
> more below.
> 
> On Dec 20, 2005, at 8:08 PM, Elvis John Dowson wrote:
> >         The initial idea for taking a look at K42 for the Cell, was
> > to see if can be used to support dynamic management via addition
> > and removal of Cell Computational Nodes from a Cell Cluster, and to
> > see if data can be shared and access via the individual SPUs, using
> > a shared memory model.
> hmm, so you are looking to use K42 to share memory across Cell nodes?
> What makes you think that K42 is currently useful for sharing memory
> between cluster nodes?
> 

I'm new to this, so I was hoping I could get an answer from the K42 forum.
By sharing memory between cluster nodes, I mean in terms of pointer
accessibility to the data in the local stores of the individual SPUs and to
be able to access memory using a sort of global address space. I read
references to the keywords 'cache coherent', 'scalable', 'hot swappable' ,
etc, when reading through some of the articles and present work being done
on K42, and I got the impression that one of the features of the K42 was
that its modular and supports/or will support hot swapping of nodes in and
out of the system. Do correct me if I an wrong in my assumptions, for I
havent carefully read the K42 documents, and was so far, try to compile the
toolchain and the sources from scratch, which I realize now, is not
something to be normally attempt. I just thought that if it worked for Cell,
it should work for K42 in a similar manner. So do let me know if I'm on the
right track or not. :-)


> > I posted a query regarding support for this type of operation, in
> > the hardware, on the cell forum and they said that the Element
> > Interconnect Bus on the Cell processor can be configured for
> > operation in a coherent mode, but would require the additional
> > support of a coherent memory switch. http://www-128.ibm.com/
> > developerworks/forums/dw_thread.jsp?forum=739&thread=102913&cat=46
> The EIB connects all the processing, IO(BIC) and memory(MIC) elements
> in the the chip.
> 
> When 2 (this implementation only supports 2) CBEs are connected by
> the BE Interface Unit (BEI), the BEI can be configured to form a
> fully snooping 2 way SMP bus, or a 2 node "cluster" with hi-speed IO
> bus interconnect that is also cache coherent.
> 
> Re: the "coherent memory switch". I'm not sure where this comes from,
> but I think Dan is just talking about standard cluster interconnects
> and not HW specifically for the CBE. I'd be pleasantly surprised to
> learn that such a switch is being built be someone. But, we can all
> wait for his reply :)

If you have to built a cluster of say 64 nodes, wouldn't you require a high
speed hardware interconnect. Older versions of SGI supercomputers were
configured into a hypercube topology using what they called a CrayLink
interconnect fabric, which I think was a hardware component. 

Here are some excerpts from some links I got while search for "SGI Hypercube
hardware"

http://sc.jpl.nasa.gov/hardware/origin2000/using/ 

The SGI Origin 2000 Hardware at JPL
------------------------------------

The SGI Origin 2000 supercomputers are based on the R12000 RISC
microprocessor running at 300MHz, which due to its pipelined and superscalar
architecture is capable of 600MFlops peak. Each processor has a 32KB,
two-way, set-associative, on-chip data cache, a 32KB, two-way,
set-associative, on-chip instruction cache, and an 8MB secondary cache. 

Two processors are grouped with 1GB of memory and a communications and
memory management circuit (called a "hub") to form a node. Four nodes (with
8 processors and 4GB of main memory) are connected to two Craylink routers,
four XIO channels, various I/O cards, and power supplies to form a module.
The 64 nodes of our system are interconnected by a CrayLink interconnect
fabric in a hierarchical-hypercube topology. This type of system is known in
the literature as a Cache-Coherent, Non-Uniform Memory Architecture
(CC-NUMA). Programs can use either shared-memory directives or
message-passing libraries (MPI or PVM) to access memory on any node. 

Here is a diagram : 

http://sc.jpl.nasa.gov/hardware/origin2000/using/topology.htm 

The Origin 2000 is really really old now, but the design still holds good, I
guess. 

On searching for "CrayLink Interconnect" , I came across this site
http://biology.ncsa.uiuc.edu/library/SGI_bookshelves/SGI_Admin/books/Or2000_
Rack_OG/sgi_html/ch01.html 

CrayLink Interconnect
----------------------

The Origin2000 modules are connected by the CrayLink Interconnect (also
known as the interconnection fabric). The CrayLink Interconnect is a set of
switches, called routers, that are linked by cables in various
configurations, or topologies. Here are some key features that define the
Origin 2000 interconnection fabric:

The CrayLink Interconnect is a mesh of multiple point-to-point links
connected by the routing switches. These links and switches allow multiple
transactions to occur simultaneously.

The links permit extremely fast switching (a peak rate of 1600 MB/sec
bidirectionally, 1600 MB/sec in each direction).

The CrayLink Interconnect does not require arbitration, nor is it limited by
contention.

More routers and links are added as nodes are added, increasing the CrayLink
Interconnect's bandwidth. 

The CrayLink Interconnect provides a minimum of two separate paths to every
pair of Origin2000 modules. This redundancy allows the system to bypass
failed routers or broken fabric links. Each fabric link is additionally
protected by a CRC code and a link-level protocol, which retry any corrupted
transmissions and provide fault tolerance for transient errors. 

I just automatically assumed that you guys were working on similar lines and
had some sort of similar solution to interconnect more than 2 nodes. I mean
logically some sort of hardware interconnect bus should exist, I guess.


> > Maybe, from a K42 perspective, there may not be much to do SPU
> > specific and the host PPU could co-ordinate all I/O communications,
> > but if an SPU were to reference a DMA memory operation (queue/
> > request a read or write) to the local storage of another SPU, in a
> > computational grid, I guess some level of O/S support should exist.
> It is the responsibility of the SW on the PPE to provide the
> translations for the DMA units (MFC).
> This support is in linux today, along with a library to drive it.
> 
> If that target memory is on another CBE then the PPE would have to
> make the appropriate IO Translations to make sure that the memory
> area is mapped.  If the later mapping is static, the FW can create a
> "standard" mapping, which it does.
> 
> How a "coherent memory switch" is configured for another node would
> be additional SW.
> 
> ..Now to dissect the following...
> > The initial work on UML, I plan to do is to target specific SPU
> > code generation from specific UML Model Elements that have been
> > specifically tagged and stereotype for the SPU.
> Here you plan to create an SPU/UML model so that you can generate C/C+
> + code for the SPU compiler, correct?
> Will your results be self-contained or will they require more from
> the PPU size other than loading and setting up DMA addressibility?
> If they are self contained there are ways (or there used to be) to
> use systemsim to just run the SPU part.
> 

Yes, I plan to generate some SPU specific C/C++ code from the UML models,
using UML Stereotypes and UML Stereotype Tags. These specific extensions
will be stored in what is called a UML Profile. So, in effect, if I draw a
class diagram called 'DisplayProcessor' and stereotype using <<SPUThread>>,
then when I generate code, it should automatically use the SPE Thread
Library to use the SPUs. 

The question regarding the Memory Flow Controller (MFC) is quite
interesting. I have been toying with the idea of a <<MFCThread>> to
represent an conceptual thread of execution that maps to the ability of the
DMA to execute background transfers. I also think, that <<MFCThread>>
objects would be used by both <<SPUThread>> and <<PPUThread>> objects. Its
just an initial thought, I have to create a small working prototype to see
how well it works and if its cenceptually correct from a modeling
perspective, etc. 

So, in essence, I'm attempting to make it easier to program for the CBE
using conceptual UML models, and using automatic code generation, ensure
that the implementation is in sync with the intended design.


> > I have already tested the UML automatic code generation framework
> > for the PPU and it works.
> By "works" you mean it generates code and the code compiles and links?
> Is there anything Cell specific about the generated code?
> Can't you just use a standard LinuxPPC system? Since your code is 32
> bit, your HW requirements are minimal.

By the UML automatic code generation framework running and working for the
PPU, yes, I mean that the code compiles, links and runs on the PPU target.
There was nothing Cell specific about the generated code. I just had to
check for compatibility of the framework libraries with the toolchain. The
first problem I ran into was that the C++ runtime libraries were not
available for me to compile the framework for the PPU target. But one I got
that fixed, I noticed some minor problems with the target O/S adapter file,
fixed it and the sample ran fine on the PPU. So, there is nothing special
here that I did, that the UML modeling environment doesn't support
out-of-the-box. 

However, the existing framework libraries will require a lot of adaptations,
to support execution on the SPU target. This is primarily due to the
constrainted environment defined for the SPU. So, its more of removing
support for features of the UML object execution framework for the SPU, like
iostream, etc. In terms of enhancement, I will need to do some changes such
as over-riding the default posix thread creation routines used for the PPU
target, and replace it with appropriate calls to the SPE Thread Library,
etc. But this is only a preliminary list. I will be making a requirements
document and document all the delta changes.


> > I'm just waiting for simulator network support for the Cell
> > Simulator so that the PPU application can be debugged at the model
> > level,
> ...Assuming you _do_ have a Cell dependency...
> Oh, Dude! this smells _way_ wrong.  From what I understand, all you
> want is for an app to have a character driven communication channel
> to the machine running the simulator.  Correct?
> Yes, with HW network is best, but there are _far_ more efficient ways
> to do this with systemsim then simulating an entire network stack.
> 
> Anyway, I would strive to make the communications channel as
> transport independent as possible so you can switch the underlying
> transport for different environments (sim-channel, TCP/IP, Shared
> memory, etc)

The object execution framework libraries instrument the C++ code, to
facilitate target level debugging at the model level. So, during design
level simualtion, you will generate your code using the animation libraries.
You will initialize the library, in the application main , giving the host
name and the tcp/ip port number for communication. The host is the one which
will have the UML model present. When you launch the application on the
target, the target will communicate with the host using tcp/ip and start a
design debug session. You can then control and inspect the internal state of
the application at the model level. 

For a production build, you will generate code without using the animation
library and run it on the target machine. 

If I had access to a real cell-blade server then there wouldnt be much of an
issue. For the system simulator, apparently, they plan on releasing a
version of systemsim for the cell, very soon. So, once that comes out, it
should be possible to animate on the target. I read somewhere about the
tcp/ip stack simulation and how its far more efficient to bypass it or
something, but I hope the tcp/ip funtionality wont be affected. 

So, in short its, nothing new that I am building here. Its already supported
by the UML modeling environments object execution framework and if you
generate the code to support target level debugging and animation. The
framework libraries currently support tcp/ip as a means for transferring
target execution data, back to the host machine. 

> 
> > for example,
> >
> > real-time animation of the UML sequence diagram,
> > to show external message communication between active objects in
> > the system,
> > the state of the threads running in the system,
> > dynamic UML state-charts that show the current state of the
> > individual objects in the system,
> > plus inspection of object states at the UML model level from within
> > the model browser and inspect the current state of the object
> > attributes at the UML model level, at run-time
> 
> Cool.
> > Note that GDB is not required for debugging and animating the UML
> > model; you are effectively debugging and animating the model at the
> > design level. If you find an execution trace that is not right or
> > an object struck at a particular state, you it would be possible to
> > try to launch a GDB session and debug the code at the code /
> > assembly level, concurrently with the UML model, being debugged at
> > the design level. I have been working in this manner for all the
> > projects that I have worked on, in my previous companies (British
> > Aerospace and Snecma Aerospace) for both real-time asynchronous
> > (UML) and real-time synchronous (SCADE) model driven development
> > environment for real-time embedded avionics safety critical systems
> > and desktop scientific engineering application development.
> >
> >
> >
> > Right now I'm attempting to define a CBE UML Profile, for the Cell
> > Processor, the outline steps can be found here in this post :
> > http://www-128.ibm.com/developerworks/forums/dw_thread.jsp?
> > forum=739&thread=102743&cat=46
> hmm, since we have no immediate plans to create a K42/Cell
> programming model, the first hack on K42 would simply allow the Linux/
> BPA SW stack to run on K42, in the same way that the Linux user app
> stack runs today.  So, at least to me, extending your favorite POSIX/C
> ++/UML models to perform the PU-to-SPU transitions using the Cell SDK
> makes a whole lot more sense.
> >
> >
> > At some stage, after I finish this work, I would like to
> > investigate the K42 O/S to see if can be used to support the
> > development of a dynamically reconfigurable and scalable
> > computational cluster, identify the specific portions that would
> > need to be adapted and estimate the effort for this adaptation.
> Now this would be cool.
> > Having a UML model representation of the K42 sources would have
> > helped, first in being able to understand the O/S, and second, in
> > being able to completely generate the code from the UML models and
> > debug it at the model level.
> Understand, IMNSHO, we are kernel developers using C++ to solve a lot
> of tedious programming issues.  We generally are not in favor of
> "generated" code and tho' there is use of C++ Templates their
> usefulness is arguable.

Sure, I understand this perfectly. :-) !
> 
> > The same techniques that I have outlined in the dW post about the
> > CBE Profile can be adapted to work for creating specific C/C++
> > emitted code templates using an automatic code-generator that will
> > parse a K42 specific UML Meta-Model. This meta-model, when used
> > along with the C/C++ ( or virtually any programming language,
> > including assembly ) code generation rules, can be applied to a UML
> > model created and code automatically generated for that UML model.
> >
> >
> Thanks for all this info.




More information about the K42-discussion mailing list