This page describes in detail the components and operation of the compute cluster at ACCRE. For researchers requiring text for grant proposals and publications, we also provide a less technical Summary of the ACCRE Facility.
The compute cluster currently consists of more than 600 Linux systems with quad or hex core processors. Each node has at least 24 GB of memory and dual Gigabit copper ethernet ports.
- Details of the Cluster Design
- Detailed Node Configuration
- GPU cluster
- Cluster Filesystem – GPFS
- Cluster Storage and Backup
- Resource Allocation
- Installed Applications
The design reflects significant input from the investigators who use it in their research and education programs. Factors such as the number of nodes, the amount of memory per node, and the amount of disk space for storing user data is determined by demand. A schematic diagram of the system is shown below followed by a glossary of terms used in a subsequent description of the cluster:
Bandwidth -- The rate at which data can be transferred over the network. Typically this is measured in Megabits/sec (Mbps) or Gigabits/sec (Gbps).
Bi-section Bandwidth -- This is calculated by splitting the network topology in half and determining how much data can be transferred through this imaginary divider. The theoretical maximum bi-section bandwidth is half the total bandwidth of all connected computers.
Brood -- The fundamental cluster building block. A group of 40 or 48 x86_64 compute nodes, plus a gateway and associated network switch.
Compute Node -- A node whose purpose is to run user applications. Users will not have direct access to these machines. Access will be granted solely through the job scheduler.
Disk Server -- A machine that is used for data storage. Users will not normally have access to these machines. For more information see the description of GPFS below.
Fast Ethernet -- Commodity networking used in most desktop computers that has a bandwidth of 100~Mbps and latency of up to 150 microseconds in Linux.
Gateway or Management Node -- Computer designed for interactive login use. Users log in to these machines for compiling, editing, and debugging of their programs and to submit jobs. There is one gateway/management node per brood.
Gigabit Ethernet -- Commodity networking typically found in servers and has a bandwidth of 1 Gbps and latency of up to 150 microseconds in Linux.
10 Gigabit Ethernet -- High performance networking typically found in servers and has a bandwidth of 10 Gbps and latency of up to 150 microseconds in Linux.
Latency -- The amount of time required to “package” data for sending over the network.
The ACCRE cluster has a flexible network topology which can accommodate different users and their needs while leaving room for expansion. There are two separate functional networks: (1) external connectivity, and (2) management and application.
The gateways are connected to the Internet through the external network comprised of gigabit ethernet links to one of our core 10 gigabit switches. Vanderbilt University has a 10 gigabit connection to the Internet via ACCRE. The management network is used for both data traffic, such as GPFS, and for health monitoring of the nodes. As a result, the management network is required to be scalable and have a high bandwidth. Gigabit ethernet is sufficient. The network design is a classical fat tree with all of the disk servers and management nodes connected at the top level, allowing deployment of their full bandwidth to any brood. Only a constant incremental cost per machine is incurred for each additional machine.
For the x86 broods the rest of the management network is local within a brood as can be seen from the diagram. Each brood has a local switch. This brood switch is connected to the top level switch using a 10 gigabit uplink. The gateway and compute nodes are connected to this same brood switch with gigabit ethernet.
Node installation, maintenance updates, and health monitoring
Initial OS installation is accomplished using the open source software SystemImager. We make an image of the configured and operational system, subsequently replicating this image across all nodes. Multiple images for different compute node configurations are stored on an infrastructure node. By using this technique it becomes possible to perform a wipe and re-install of the entire cluster in a short period of time. Similarly, SystemImager is also used to update a node, transferring only the data needed for the update instead of the entire image. Because of this intelligent update, most common maintenance operations, such as updating the kernel, take mere minutes.
Each x86 brood can be configured as a stand-alone mini-cluster. This mini-cluster can then be used to test software and hardware updates without interfering with the rest of the cluster.
The health of the compute nodes within a brood are monitored through the use of the open source package Nagios which supports distributed monitoring. Distributed monitoring allows data from individual management nodes to be collected on a single machine for viewing, analysis, and problem notification.
The ACCRE Linux cluster is comprised of servers with dual quad core and hexa core Intel processors. The cluster has over 6,000 processor cores and the theoretical peak performance is roughly 24.5 TFLOPS. As shown in the table below, there is a heterogenous mix of memory configurations.
|# Processor Cores||Processor Speed
|2184 (273 dual quad)||2.3||24.0-128.0||Intel Xeon Westmere|
|1056 (88 dual hex)||2.4-3.0||48.0-64.0||Intel Xeon Westmere|
|2352 (196 dual hex)||1.9||128.0-256.0||Intel Xeon Sandy Bridge|
|504 (42 dual hex)||2.4||128.0||Intel Xeon Haswell|
* For available memory, see this FAQ.
We currently have 32 GPU nodes available to all cluster users. Each node has 4 NVIDIA GTX 480 GPUs.
|# Compute Processors
|256 (32 dual quad)||2.4||48.0||43.1||vmp801-vmp848||x86,westmere|
Users who are not in gpu groups are only allowed to run jobs on these nodes that are less than 4 hours or 24 hours ( will likely be shortened in the future ). However, all users can be added to a gpu group by request . Users who want to reserve a node for code development for a limited period of time should also submit such request via helpdesk ticket.
One of the fundamental challenges of building a Linux cluster is to make sure that data is available when needed to any CPU in any node in the cluster at any time. Having a cluster with the very latest CPUs from AMD, Intel, or IBM means very little if those CPUs spend much of their time idle waiting for data to process. There is even a joke which says, “A supercomputer is a device for turning a compute bound problem into an I/O bound problem!”
In small clusters data is typically made available to the nodes via a Network File System (NFS) server which stores all of the user data and exports it to the entire cluster. Applications can then access data as if it were local. However, there are two significant disadvantages to this approach. First, the NFS server is a single point of failure. If the NFS server is unavailable for any reason then the entire cluster is unavailable for use and jobs which may have been running for weeks may be killed. This single point of failure can be eliminated by clustering two (or more) servers, but this can quickly become very costly and, more importantly, does not solve the second significant disadvantage of NFS: poor performance. NFS typically does not scale to much over 100 MB/second bandwidth. Because of this ceiling, performance may be adequate for small clusters of up to approximately 150 nodes; scaling to a greater number of nodes causes the limitations of NFS to quickly become apparent.
Because the ACCRE cluster began as a small (120 nodes) cluster, ACCRE initially took the NFS server approach. However, by the fall of 2004 both of the limitations of the NFS server approach had become apparent and a search for a more robust, higher performance solution was initiated. After a lengthy evaluation process, ACCRE selected IBM’s GPFS (General Parallel Filesystem) to replace NFS. GPFS was placed into production in August of 2005.
In a GPFS cluster two or more servers are set up as GPFS cluster and filesystem managers. In addition, two or more servers are set up as disk I/O servers called Network Shared Disk (NSD) servers. NSD servers are connected to one or more storage arrays in a redundant configuration. This is typically accomplished via a Storage Area Network (SAN). In a redundant SAN configuration all NSD servers and storage arrays are each connected into two or more SAN switches. This allows all NSD servers to be able to see the storage in all the storage arrays. In the event of a SAN switch failure GPFS will use a feature of the Linux operating system call multipathing to redirect I/O thru a different SAN switch. The disk drives in the storage arrays are configured into RAID devices which protects against the failure of a disk drive. Each RAID devices is then “seen” by the NSD servers as a “disk” which may be used in a filesystem. When a GPFS filesystem is created a primary NSD server and up to 7 backup servers are defined for each disk. In the event of a failure of the primary NSD server for a disk the first available backup server will take over for it transparently. This means that, if configured properly, GPFS filesystems have no single point of failure.
In a typical configuration a GPFS filesystem will contain multiple disks and the data will therefore be striped across those disks. This means that during normal operations I/O requests will involve multiple NSD servers and multiple disks working in parallel, hence the name General Parallel Filesystem. In addition, GPFS offers the ability to separate data from metadata. This prevents someone who is doing an “ls” on a directory containing a very large number of files from impacting the performance of someone else who is attempting to read the data contained in a single large file. As many ACCRE users do have large numbers of small files we have implemented this feature.
In a GPFS cluster the client machines (cluster gateways and compute nodes) all run a GPFS daemon which allows them to access the GPFS filesytems as if they were local … i.e. from a user perspective the fact that the files are not local to the node they are using is irrelevant.
ACCRE is currently on its’ second generation of GPFS hardware (see diagram). Each of the 9 NSD servers has a 10 gigabit ethernet network card. Therefore, the current setup is capable of sustaining 9 gigabytes per second of I/O. For comparison, a standard DVD is 4.7 gigabytes. The capacity of our GPFS filesystems can be increased if necessary by adding additional storage arrays. The performance of our GPFS filesystems can also be increased by adding additional NSD servers. Please note that both of these can be done without a downtime.
ACCRE also has implemented the standard cluster practice of having two filesystems, one for /home and one for /scratch. However, ACCRE also has implemented a /data filesystem. The /data filesystem is stored on the same physical disk hardware as the /home filesystem. However, there are two differences between /home and /data: 1) additional quota can be purchased for /data but not for /home and, 2) tape backups are retained for a minimum of one month on /data and two months on /home. Files stored in /scratch are not backed up to tape.
Before implementing the second generation of GPFS hardware ACCRE staff reviewed the possible alternatives to GPFS (Lustre, Panasas, BlueArc) again. While GPFS is not perfect we concluded that none of the other options offered any benefits over GPFS. In addition, several of them are significantly more expensive than GPFS. GPFS is a mature, stable product that is supported by IBM and used by many of the top supercomputers in the world.
(Top of Page)
The True incremental Backup System (TiBS) is used to backup the ACCRE cluster home directories nightly. Currently the BM TS3500 Tape Library is used for the cluster disk backup. A big advantage of TiBS is that it minimizes the time (and network resources) required for backups, even full backups. After the initial full backup, TiBS only takes incremental backups from the client. To create full backups, an incremental backup is taken from the client. Then, on the server side, all incrementals since the last full backup are merged into the previous full backup to create a new full backup. This takes the load off the client machine and network. The integrity of the previous full backup is also verified. Please see our disk quotas and backups policies for more information. (TiBS is available for all current operating systems and apart from the cluster, ACCRE also offers backup services for data located remotely. This service is through special arrangement. If you are interested, please see our Tape Backup Services and contact ACCRE Administration for more details.)
A central issue in sharing a resource, such as the cluster, is making sure that each group is able to receive their fairshare if they are regularly submitting jobs to the cluster, that groups do not interfere with the work of other groups, and that research at Vanderbilt University is optimized by not wasting compute cycles. Resource management, scheduling of jobs, and tracking usage is handled by SLURM.
SLURM supplies user functionality to submit jobs and to check system status. It is also responsible for starting and stopping jobs, collecting job output, and returning output to the user. SLURM allows users to specify attributes about the nodes required to run a given job, for example eight- or twelve-core nodes.
SLURM is a flexible job scheduler designed to guarantee, on average, that each group or user has the use of the particular number of nodes they are entitled to. If there are competing jobs, processing time is allocated by calculating a priority based mainly on the “fairshare” mechanism of SLURM. On the other hand, if no jobs from other groups are in the queue it is possible for an individual user or group to use a significant portion of the cluster. This maximizes cluster usage while maintaining an equitable sharing. You can find more details about submitting jobs through SLURM in our Getting Started section or ACCRE’s own SLURM Documentation page.
For specific details about ACCRE resource allocation, please see more about ACCRE job scheduler parameters.
The cluster offers GCC, Intel compilers, with support for MPICH, OpenMPI, FFTW, LAPACK, BLAS, GSL, Dakota, R, Matlab, and several other packages/libraries. Please browse a comprehensive list of research computing software products.