Distribution: All future Hardware Environment: All must be of compatible architecture Software Environment: Transparent alterations to process and thread managment Problem Description: I have a bunch of s****y computers Steps to reproduce: Upgrade and get new machines over the years What I wish to propose is a system called RAIP-U (Redundant Array of Interdependant Processing Units, "rape you"). This allows a fault tolerant (semi-ft?) method of doing parallel processing using many boxes and RAM. I will outline it lightly here; more detail may appear later but I don't know what goes inside process managment. Alterations to the fundamental process and thread managment are obviously needed; the same exact interface has to be able to check the process ID and thread handle and find where it goes and who has the backups. First thing is the easiest, the connectivity. The protocols should work over TCP/IP, direct (crossover) connections with a special protocol (yes, with NICs, but a switch/hub can't be used), serial line (serial-serial or parallel ports or such things, usb-usb even), or other things you can think of. It should be tight, to reduce latencey. This is the most important part. Next is what is communicated and how things are handled. I will break this down. Each part should be independant as much as possible (so RAM and HW sharing can be done, but process load sharing can be off). PROCESSES The main thing is process transfers. A central master sends jobs out to one of the slaves, chosen by its stats and loads. This job is a thread or a process. The job always has the slave/job ID of its owners (related processes, ones that forked it or something, or which process the thread is a part of) and related threads, as well as the Master/job ID of the host process (and the Master system). When these slaves make a new thread or fork a process, they make their own decisions, pass it to another slave or back to the master, and still give the same data (slaves never become masters). When processes have to send data, a buffering scheme should watch how far along it goes without a break in the data sending. For example, if it sends data and then does N instructions, halts, has already put out X bytes of data to send through IPC, or otherwise seems to need to send the data out NOW, the data is sent. It goes to whatever machine the process or thread it goes to is on. No buffering is done at all if the target job is on the same machine; IPC works EXACTLY the same if at all possible. The Master always is the physical hardware and software that the job is on. Always. If the job asks about its hardware or tries to write to hardware or a driver, this data is sent to the Master, which handles it and responds as if the job is on the Master. If a job seems to need extensive communication with another job or with the master, it is relocated to the machine that has that job or to the Master. If it seems particularly... pointless... it may be relocated to a Slave IF and only if the Master is starting to reach 95%+ CPU usage. If a job sends or receives data relatively infrequently, it may be subject to RAIP-U fault tolerance. If it appears safe and easy, you could shove a copy of the job in its current state on another Slave or back on the Master. Then, if that job blinks out, the related jobs and the Master would cause a switchover to a machine with a fault tolerance copy. An API to this should be given, allowing the process to disable auto-relocation, auto-fault tolerance, and automatic buffering. This will allow the jobs to handle these most efficiently. For example, a music program may thread its mixing thread, place it on a slave, give itself a large buffer on output, and then the mixer may make a fault-tolerance backup. Then, if the machine with the mixer dies, the job that handles it will be notified (via API call with a previously passed function pointer), and the job will be able to readjust to resend any sent data that wasn't processed and returned, then tell the mixer job to make another RAIP-U FT backup. As a final note, it may be possible to use other machines with similar, more advanced hardware to get around h/w incompatibility (i.e. Athlon's 3DNOW! when you just have a 386 on the Master). RAM Oh there's more to this thing than just process sharing. The next module is inter-boxen RAM. This is an extension on virtual RAM. Basically, no you don't map RAM based on job, you map RAM as ... well... blocks of RAM. So just like a 30 MB partition on your HDD is a virtual RAM block, so is a 200 MB segment of the 4 gig of RAM on Slave A. You could use a set of machines (diskless even) as RAM, even to the point that you have just whatever the OS needs and then the rest as shared RAIP-U RAM. You should be able to use this to indirectly access virtual RAM, but only the Master manages this, lest the stupid Slaves make a loop (Give me some of your RAM as my VRAM, which is in turn some of your VRAM that you got from someone else's RAM...). HARDWARE This is the scariest part. Each Slave holds a definition of its own hardware, if it has drivers for it or can otherwise write to/read from it. This allows the Master to map out devs for the Slaves' devices (burners, HDD's, etc) in /dev, and (somehow) mess with the drivers to communicate over the RAIP-U connection to share, say, HARD DISKS! (YES!) Some 400 SCSI disks maybe? 80 USB2.0 ports? Netware/RAIP-U The kernel should be able to send a copy of itself to act as a dedicated RAIP-U server if it gets a Netware or RAIP-U network boot request (diskless machines using bootp?). This should work, provided these are of similar architecture. Go think about it, I don't know if I missed this. Bye. --Bluefox