high performance computing - a practical guide

A comprehensive guide to High Performance Computing concepts, architectures, and best practices, with hands-on examples from ETH HPC Lab

Introduction to HPC

High-Performance Computing (HPC) combines computational resources to deliver high computational power for solving complex problems. Modern HPC systems, often called supercomputers, can perform quadrillions of calculations per second.

Key Concepts

  1. Parallel Computing: Multiple processors working simultaneously
  2. Distributed Computing: Computations spread across multiple machines
  3. Scalability: Ability to handle increased workload with additional resources
  4. Performance Metrics:
    • FLOPS (Floating Point Operations Per Second)
    • Memory bandwidth
    • Network latency and throughput

HPC Architecture

Hardware Components

  1. Compute Nodes
    • CPUs (multi-core processors)
    • GPUs (for accelerated computing)
    • Memory (RAM)
    • Local storage
  2. Network Infrastructure
    • High-speed interconnects (InfiniBand, OmniPath)
    • Network topology
    • Bandwidth and latency considerations
  3. Storage Systems
    • Parallel file systems (Lustre, GPFS)
    • Hierarchical storage management
    • Burst buffers

Memory Hierarchy

  1. Registers: Fastest, smallest capacity
  2. Cache Levels: L1, L2, L3
  3. Main Memory (RAM)
  4. Local Storage
  5. Network Storage

ETH HPC Environment

Accessing ETH Clusters

# To access euler cluster make sure to be in ETHZ Network using VPN
ssh <user_name>@euler.ethz.ch

This connects you to the login node which manages basic cluster administration. Here, you should only compile and run small programs for testing. Large jobs should run on compute nodes using the batch system.

Software Stack Management

ETH clusters provide multiple software stacks:

Switch to the new stack using:

env2lmod

Practical Job Management

Job Submission Examples

  1. Basic Job
    bsub -W 01:00 -n 1 ./my_program
    
  2. Multi-node OpenMP
    export OMP_NUM_THREADS=8
    bsub -n 8 -R "span[ptile=8]" -W 01:00 ./omp_program
    
  3. MPI with Resource Requirements
    bsub -n 16 -R "rusage[mem=4096]" -W 02:00 mpirun ./mpi_program
    

Job Monitoring and Control

# List all jobs
bjobs [options] [JobID]
    -l        # Long format
    -w        # Wide format
    -p        # Show pending jobs
    -r        # Show running jobs

# Detailed resource usage
bbjobs [options] [JobID]
    -l        # Log format
    -f        # Show CPU affinity

# Connect to running job
bjob_connect <JOBID>

# Kill jobs
bkill <jobID>    # Kill specific job
bkill 0          # Kill all your jobs

Matrix Multiplication Case Study

Memory Hierarchy Impact

Modern processors handle floating point operations very efficiently:

This makes memory access optimization critical for performance.

Blocked Matrix Multiplication

// Cache-friendly blocked implementation
void matrix_multiply_blocked(float* A, float* B, float* C, int N, int BLOCK_SIZE) {
    for (int i = 0; i < N; i += BLOCK_SIZE) {
        for (int j = 0; j < N; j += BLOCK_SIZE) {
            for (int k = 0; k < N; k += BLOCK_SIZE) {
                // Block multiplication
                for (int ii = i; ii < min(i+BLOCK_SIZE, N); ii++) {
                    for (int jj = j; jj < min(j+BLOCK_SIZE, N); jj++) {
                        float sum = C[ii*N + jj];
                        for (int kk = k; kk < min(k+BLOCK_SIZE, N); kk++) {
                            sum += A[ii*N + kk] * B[kk*N + jj];
                        }
                        C[ii*N + jj] = sum;
                    }
                }
            }
        }
    }
}

Performance Analysis

Key findings from our matrix multiplication study:

  1. Cache utilization is critical for performance
  2. Blocked algorithms can significantly reduce memory access
  3. Proper block size selection depends on cache size
  4. BLAS libraries implement these optimizations internally

Performance Optimization

Memory Optimization

  1. Cache Optimization
    • Data alignment
    • Cache line utilization
    • Stride optimization
  2. Memory Access Patterns
    • Sequential access
    • Blocked algorithms
    • Vectorization

Parallel Optimization

  1. Load Balancing
    • Even distribution of work
    • Dynamic scheduling
    • Work stealing
  2. Communication Optimization
    • Minimize message passing
    • Overlap computation and communication
    • Use collective operations

Code Examples

// Example: Cache-friendly matrix multiplication
for (int i = 0; i < N; i += BLOCK_SIZE) {
    for (int j = 0; j < N; j += BLOCK_SIZE) {
        for (int k = 0; k < N; k += BLOCK_SIZE) {
            // Block multiplication
            for (int ii = i; ii < min(i+BLOCK_SIZE, N); ii++) {
                for (int jj = j; jj < min(j+BLOCK_SIZE, N); jj++) {
                    float sum = C[ii][jj];
                    for (int kk = k; kk < min(k+BLOCK_SIZE, N); kk++) {
                        sum += A[ii][kk] * B[kk][jj];
                    }
                    C[ii][jj] = sum;
                }
            }
        }
    }
}

Cloud HPC

  1. Benefits
    • Scalability
    • Pay-per-use
    • Quick provisioning
  2. Challenges
    • Network performance
    • Cost management
    • Data security

AI and HPC Convergence

  1. AI Workloads
    • Deep learning training
    • Large language models
    • AI-assisted simulations
  2. Hardware Acceleration
    • GPUs
    • TPUs
    • FPGAs

Container Technologies

  1. Singularity/Apptainer
    • HPC-specific containers
    • Security features
    • MPI support
  2. Docker in HPC
    • Development workflows
    • CI/CD pipelines
    • Testing environments

Best Practices and Tools

Development Tools

  1. Compilers
    • GCC, Intel, NVIDIA
    • Optimization flags
    • Vectorization reports
  2. Debuggers
    • GDB, DDT, TotalView
    • Memory checkers
    • Thread analyzers
  3. Profilers
    • Intel VTune
    • NVIDIA NSight
    • TAU

Performance Analysis

  1. Metrics
    • Strong scaling
    • Weak scaling
    • Parallel efficiency
  2. Benchmarking
    • STREAM
    • LINPACK
    • Application-specific benchmarks

Resource Management

  1. Module System
    module avail           # List available modules
    module load compiler   # Load specific module
    module list           # Show loaded modules
    module purge         # Unload all modules
    
  2. Environment Setup
    • Compiler selection
    • Library paths
    • Runtime configurations

References

  1. Introduction to High Performance Computing for Scientists and Engineers
  2. Parallel Programming for Science and Engineering
  3. The Art of High Performance Computing
  4. TOP500 Supercomputer Sites
  5. ETH Zurich Scientific Computing Wiki
  6. ETH HPC Lab Course Materials