Research Computing at ILRI Alan Orth Sys Admin ILRI, Kenya March 5, 2014 Where we came from (2003) - 32 dual-core compute nodes - 32 * 2 != 64 - Writing MPI code is hard! - Data storage over NFS to “master” node - “Rocks” cluster distro - Revolutionary at the time! Where we came from (2010) - Most of the original cluster removed - Replaced with single Dell PowerEdge R910 - 64 cores, 8TB storage, 128 GB - Threading is easier* than MPI! - Data is local - Easier to manage! To infinity and beyond (2013) - A little bit back to the “old” model - Mixture of “thin” and “thick” nodes - Networked storage - Pure CentOS - Supermicro boxen - Pretty exciting! ---> Primary characteristics Computational capacity Data storage Platform - 152 compute cores - 32* TB storage - 700 GB RAM - 10 GbE interconnects - LTO-4 tape backups (LOL?) Homogeneous computing environment User IDs, applications, and data are available everywhere. Scaling out storage with GlusterFS - Developed by Red Hat - Abstracts backend storage (file systems, technology, etc) - Can do replicate, distribute, replicate+distribute, geo-replication (off site!), etc - Scales “out”, not “up” How we use GlusterFS [aorth@hpc: ~]$ df -h Filesystem Size Used Avail Use% Mounted on ... wingu1:/homes 31T 9.5T 21T 32% /home wingu0:/apps 31T 9.5T 21T 32% /export/apps wingu1:/data 31T 9.5T 21T 32% /export/data - Persistent paths for homes, data, and applications across the cluster. - These volumes are replicated, so essentially application-layer RAID1 GlusterFS <3 10GbE - Project from Lawrence Livermore National Labs (LLNL) - Manages resources - Users request CPU, memory, and node allocations - Queues / prioritizes jobs, logs usage, etc - More like an accountant than a bouncer Topology How we use SLURM - Can submit “batch” jobs (long-running jobs, invoke program many times with different variables, etc) - Can run “interactively” (something that needs keyboard interaction) Make it easy for users to do the “right thing”: [aorth@hpc: ~]$ interactive -c 10 salloc: Granted job allocation 1080 [aorth@compute0: ~]$ Managing applications - Environment modules - http://modules. sourceforge.net - Dynamically load support for packages in a user’s environment - Makes it easy to support multiple versions, complicated packages with $PERL5LIB, package dependencies, etc Managing applications Install once, use everywhere... [aorth@hpc: ~]$ module avail blast blast/2.2.25+ blast/2.2.26 blast/2.2.26+ blast/2. 2.28+ [aorth@hpc: ~]$ module load blast/2.2.28+ [aorth@hpc: ~]$ which blastn /export/apps/blast/2.2.28+/bin/blastn Works anywhere on the cluster! Users and Groups - Consistent UID/GIDs across systems - LDAP + SSSD (also from Red Hat) is a great match - 389 LDAP works great with CentOS - SSSD is simpler than pam_ldap and does caching More information and contact a.orth@cgiar.org http://hpc.ilri.cgiar.org/