Building the Server and the Nodes
The first thing to build is the server, as you will need it up and running whe you install the software on the nodes. To a large extent, follow the instructions of the motherboard manufactures. If you are in a dry environment get a wrist-strap to ground yourself while you handle the components! The little static shocks you get when you grab a door handle can burn through the silicon in the chips like a laser.
Fisrt mount the motherboard to the chasis. There are a variety of screws and washers and numerous holes on the chasis that accomodate several types of motherboards. Try to use as many metal washers as possible, as they ground the board better.
Then install the RAM and the CPU. You will probably have to put the fan on the CPU. The fan-kits vary and usually don't carry instructions, but try to assemble something that makes sense. Remeber to plug the CPU-fan connector to the motherboard, and to select the approprite CPU/bus ratio (the speed of the CPU and the bus). These are jumpers on the board, and the manual should say how to set them.
While you are doing this, have someone install the CD-ROM, hard drive, and floppy drive on the case. After that, mount the chasis on the case, install the NICs (you have two for the server) and the video card (can be PCI or AGP), plug the power supply cable, the cables for the peripherals, and the power button, reset button, leds, and speaker cables. Close the thing, and voila, you are done.
Turn it on, and see if the memory is checked and the boot order is right. It should look in the floppy, and then in the hard drive. Otherwise, change the boot order settings of the BIOS. You should get a message that says 'No bootable media found' or something to that effect. You are OK at this point.
Building the nodes is pretty much the same as building the server, but easier because you won't have to install the CD-ROM. As you get done, fire them up to see if they work, and put them in a big pile until you are done with all of them. We found that building more than four withouth breaks makes you prone to mistakes, so take your time. If you have an army of people helping, no need to worry.
A Note on CPU fans
CPU fans are a nightmare. Although they cool the CPUs allright, the bearings on the cheap fans start rattling like crazy after a very short time. We had to change a few of them in the nodes, because the boxes sounded like if they had a V8 engine at full throtle inside. A bit of WD-40 seems to help, but it is only a temporary fix, and you don't really want to have oil and stuff like that near the CPU. Bottomline: Buy a set of good CPU fans...
Wiring the Beast
There are two aspects concering the wiring. One deals with the Ethernet cables that interconect the nodes, and the other one with the AC power.
Instead of buying patch cables, which are pretty pricy and of predetermined lenghts, we made our own. The switch connects to the NICs on the nodes using straight-through patches, which makes building the cables pretty straightforward. Although our network guru said that Ethernet cables have a particular arrangement for the conductors (which I beleive has a reason), the short distances between the nodes and the switch does not pose a threat in our private network. We just made cables in which we had the same type/color of conductor on the RJ45 connectors at both ends of the cable. Use a good crimping tool - Many times we had some of the connections misfiring because of bad crimping. To decide on the lenghts of the cable, just measure from each individual box to the switch (try to follow the legs of the rack so that you won't have a huge mass of cables right in your face all the time), and give a 10 inch slack for connections, cutting, etc.
As for power, you'll need plenty. Each box has a 250 Watt power supply, and although the ndoes won't draw that much power from it (you have pretty much no boards on them), be conservative. 17 power supplies at peak operation are 250 * 17 = 4250 Watts. If you add the switch and a couple of monitors (one for the server and a nomadic one for the nodes), you end up, roughly, with 5000 Watts. Depending on how good your electronics 101 course was, you probably figured out that you'll need a power line and a breaker for ~ 50 Amps. Don't just plug it to the wall - Get your electrician to install an isolated power line/lines for the cluster. The power distribution on the rack/shelf is easy: One power strip per shelf should take care of it, plus a power strip for the server, its monitor, and other peripherals.
At this point, you can put all the boxes on the shelf and plug them to the switch. If you run into trouble during the installation of the software on the nodes, you can always use the public network and the go back to the private network.
Installing Linux and Cloning the Nodes
Again, you should install the the operating system on the server first, as the nodes will use the server for their install. In our case we used the RedHat 6.0 distribution, with the 2.2.5-15 kernel, but you can use others. The nice thing of the RedHat distribution CD is that when your machine has nothing, you can even boot from the CD-ROM - No need to make a boot disk.
If you have to make a boot disk, just follow the instructions from RedHat, and that will take care of it. Follow the installation instructions, and install all the packages. Even if you are not interested in editing pictures or setting up a web-server, the complete installation takes only 1 GB, and it saves a lot of time not having to decide wich package to install.
An important part of this is giving your server its final IP address, and its local network address. Remember that you will have two NICs on the server: One will connect to the world, and the other one is used to connect to the private network. The configuration of this is pretty simple using PCI card, and requires only a little more effort if you have ISA cards that require and IRQ and IO number. This is well described in the Linux Installation Manual.
Once your server is up and running and plugged to the netwrok (either private or public), you can start installing the software on the nodes. You will need a network boot disk for this. Again, you can make this by going to the RedHat page. You will either have to do an NFS or FTP installation, and for this you need the RedHat CD-ROM in the CD-ROM drive of the server, the CD-ROM mounted ('mount /mnt/cdrom'), and a user account on the server. You have to specify the IP address of the box you are installing the software on, and the IP of the server that will have the software. If you are using the local network, these numbers will correspond to the private lists: 192.168.x.x, 10.x.x.x, etc. The server should have the first number (i.e., 192.168.0.1), and the nodes use consecutive numbers (192.168.0.2, 192.168.0.3, etc., etc.). If you are using the public network, your sysadmin will have to give you a couple of IP addresses you can use temporarily.
The server has two 'names'. One corresponds to the world (i.e., chaingang.usip.edu), and one to the local network (i.e., node0.chaingang for the server). One name corresponds to one numeric IP address, and the other one to the other. Analogously, one name corresponds to one of the NICs, the other oine to the other. Due to software (MPI) related reasons, it is better to assign the name of the local network as a hostname for the server.
Back to our original program. Follow the installation as you did for the server. Again, install the whole thing. Its a pain to try to figure out what is needed and what is not, and you will probably have more then 2 GB on the nodes. You now have to do this x 16, at each time changing the name of the node: node1.chaingang, node2.chaingang, etc., etc. Get a big jar of coffe, as this takes a while...
Another way of cloning the software on the nodes is to, after having installed the software on the first node, do a disk copy. Basically, plug a second hard drive into the open IDE port of the motherboard of the first node, boot the machine, login as 'root', and at the prompt (or in a shell) type:
bash# dd if=/dev/hda of=/dev/hdb
or hdc, depending on where the second hard drives ends. If you have any nfs mounted systems (like the CD-ROM from the server) remember to unmount them ('umount -a'), otherwise you'll be copying this too into your disk image. This copies the contents of the first node's hardrive onto the second. One thing against this is that you have to do a lot of reboots, you have to fiddle around with the cables, power the things up, power them down, power them up, power them down, etc., etc. Our advice: Very good to do it once if you are installing one additional node. Furthermore, both hard drives have to be identical physically (not entirelly true, but safer).
Now you will have to go into the nodes one by one and reconfigure the Ethernet card (if you did the install using a public IP), and you will have to modify some files and permissions to allow th cluster to work properly. Instead of repeating stuff that is pretty much out there, follow the instructions laid out by the people at Xtreme Machines, or those from CACR at Caltech.
At this point, you should be able to plug everything together (if it is not plugged already), and you are ready to go. You will probably need to install MPI of PVM to run the cluster, but this needs to be done only on the server and it's pretty straight forward.
Final Details
In our case, we had to do one more thing. We had 3Com 905B 100/10 NICs, and RedHat 6.0 has no drivers for this card. Although they work with the 3c59x.o module for 3Com 59x series cards, there are reported problems. 3Com now has a driver prepared by Donald Becker at NASA that will work with the 3Com 905B and other newer 3Com cards. It can be found here, and the installation is pretty simple if you follow the instructions.
Keyboarless/Monitorless Booting
This was sort of a scare initially. They don't have keyboard, mouse, or monitor, and they should boot withouth tmem. We fired up one of the nodes and 'beep-beep-beep-beep...': Trouble. The node would not boot. We plugged a monitor and we saw a message that said 'Keyboard not found. Press F1 to continue'. Who thought of this message? How in the world are you going to press F1 whe you don't have anything to press? This is a BIOS problem, in our case AMIBIOS. Although the SuperMicro manual for the motherboard said that there was an option to bypass the keyboard detection, we could not find it, but found that setting the boot to quickboot (a mode in which many of the detections and memory scans are bypassed), things went OK. The nodes still beep plenty due to the lack of monitors, but it does not affect the boot process or the performance.