RELEASE NOTES FOR SLURM VERSION 2.2 1 December 2010 IMPORTANT NOTE: If using the slurmdbd (SLURM DataBase Daemon) you must update this first. The 2.2 slurmdbd will work with SLURM daemons of version 2.1.3 and above. You will not need to update all clusters at the same time, but it is very important to update slurmdbd first and having it running before updating any other clusters making use of it. No real harm will come from updating your systems before the slurmdbd, but they will not talk to each other until you do. Also at least the first time running the slurmdbd you need to make sure your my.cnf file has innodb_buffer_pool_size equal to at least 64M. You can accomplish this by adding the line innodb_buffer_pool_size=64M under the [mysqld] reference and restarting the mysqld. This is needed when converting large tables over to the new database schema. SLURM can be upgraded from version 2.1 to version 2.2 without loss of jobs or other state information. HIGHLIGHTS ========== * Slurmctld restart/reconfiguration operations have been altered. NOTE: There will be no change in behavior unless partition configuration or node Features/Weight are altered using the scontrol command to differ from the contents of the slurm.conf configuration file. Preserve current partition state information plus node Feature and Weight state information after slurmctld receives a SIGHUP signal or is restarted with the -R option. Recreate partition plus node information (except node State and Reason) from slurm.conf file after executing "scontrol reconfig" or restarting slurmctld *without* the -R option. OPERATION ACTION slurmctld -R Recover all job, node and partition state slurmctld Recover job state, recreate node and partition state slurmctld -c Recover no jobs, recreate node and partition state SIGHUP to slurmctld Preserve all job, node and partition state scontrol reconfig Preserve job state, recreate node and partition state Old logic preserved node Feature plus partition state after "slurmctld" or "scontrol reconfig" rather than recreating it from slurm.conf. Node Weight was formerly always recreated from slurm.conf. * SLURM commands (squeue, sinfo, sview, etc...) can now operate between clusters. Jobs can also be submitted with sbatch to other cluster(s) with the job routed to the one cluster expected to initiated the job first. * Accounting through the SlurmDBD with the MySQL plugin can now support a default account and wckey per cluster. CONFIGURATION FILE CHANGES (see "man slurm.conf" for details) ============================================================= * A hash of the slurm.conf running on each node in the cluster is sent when registering with the slurmctld so it can verify the slurm.conf is the same as the one it is running. If not an error message is displayed. To silence this message add NO_CONF_HASH to DebugFlags in your slurm.conf. * Added VSizeFactor to enforce virtual memory limits for jobs and job steps as a percentage of their real memory allocation. * Added new option for SelectTypeParameters of CR_ONE_TASK_PER_CORE. This option will allocate one task per core by default. Without this option, by default one task will be allocated per thread on nodes with more than one ThreadsPerCore configured (i.e. no change in behavior without this option). * Add new configuration parameters GroupUpdateForce and GroupUpdateTime. These control when slurmctld updates its information of which users are in the groups allowed to use partitions. NOTE: There is no change in the default behavior. * Added new configuration parameters SlurmSchedLogFile and SlurmSchedLogLevel to support writing scheduling events to a separate log file. * Added new configuration parameter JobSubmitPlugins which provides a mechanism to set default job parameters or perform other site-configurable actions at job submit time. Site-specific job submission plugins may be written either C or LUA. * MaxJobCount changed from 16-bit to 32-bit field. The default MaxJobCount was changed from 5,000 to 10,000. * Added support for a PropagatePrioProcess configuration parameter value of 2 to restrict spawned task nice values to that of the slurmd daemon plus 1. This insures that the slurmd daemon always have a higher scheduling priority than spawned tasks. Also added support in slurmctld, slurmd and slurmdbd for option of "-n " to reset the daemon's nice value. * Support has been added for the allocation of generic resources (GRES). A new configuration parameter, GresPlugins, has been added along with a node- specific parameter, Gres. There is also a gres.conf file to be configured on each node. For more information, see the web page Support for enforcement of these allocations using Linux CGroup will be provided in a later release. * Added support for new partition states of DRAIN (run queued jobs, but accept no new jobs) and INACTIVE (do not accept or run any more jobs) and new partition option of "Alternate" (alternate partition to use for jobs submitted to partitions that are currently in a state of DRAIN or INACTIVE). * Added the ability to configure PreemptMode on a per-partition or per-QOS basis. * Modified the meaning of InactiveLimit slightly. It will now cancel the job allocation created using the salloc or srun command if those commands cease responding for the InactiveLimit regardless of any running job steps. This parameter will no longer effect jobs spawned using sbatch. * Added SchedulerParameters option of bf_window to control how far into the future that the backfill scheduler will look when considering jobs to start. The default value is one day. * Added the ability to specify a range of ports in the SlurmctldPort parameter for better handling of high bursts of RPCs (e.g. "SlurmctldPort=1234-1237"). COMMAND CHANGES (see man pages for details) =========================================== * sinfo -R now has the user and timestamp in separate fields from the reason. * Job submission commands (salloc, sbatch and srun) have a new option, --time-min, that permits the job's time limit to be reduced to the extent required to start early through backfill scheduling with the minimum value as specified. * scontrol now has the ability to change a job step's time limit. * scontrol now has the ability to shrink a job's size. Use a command of "scontrol update JobId=# NumNodes=#" or "scontrol update JobId=# NodeList=". This command generates a script to be executed in order to reset SLURM environment variables for proper execution of subsequent job steps. * We have given Operators, Administrators, and bank account Coordinators (as defined in the SLURM database) the ability to invoke commands that view/modify user jobs and reservations. Previously, one had to be root to invoke "scontrol update JobId" for example. In addition, Administrators have the ability to view/modify node and partition info without having to become root. For moredetails, see AUTHORIZATION section of the man pages for the following commands: scontrol, scancel and sbcast. * Users can hold and release their own jobs. Submit in held state using srun or sbatch --hold or -H options. Hold after submission using the command "scontrol hold ". Release with "scontrol release ". Users can not release jobs held by a system administrator unless the adminstrator uses the command "scontrol uhold " ("uhold" for "user hold"). * Add support for slurmctld and slurmd option of "-n " to reset the daemon's nice value. * srun's --core option has been removed. Use the SPANK "Core" plugin from for continued support. * Added salloc and sbatch option --wait-for-nodes. If set non-zero, job initiation will be delayed until all allocated nodes have booted. Salloc will log the delay with the messages "Waiting for nodes to boot" and "Nodes are ready for use". * Added scontrol "wait_job " option to wait for nodes to boot as needed. Useful for batch jobs (in Prolog, PrologSlurmctld or the script) if powering down idle nodes. * Modified sview to display database configuration and add/remove visible tabs. * Modified sview to save default configuration in .slurm/sviewrc file. Default setting can be set by using the menus Options->Set Default Settings or typing Ctrl-S. * Modified select/cons_res plugin so that if MaxMemPerCPU is configured and a job specifies it's memory requirement, then more CPUs than requested will automatically be allocated to a job to honor the MaxMemPerCPU parameter. BLUEGENE SPECIFIC CHANGES ========================= OTHER CHANGES ============= * Added support for a default account and wckey per cluster within accounting. * Added support for several new trigger types: SlurmDBD failure/restart, Database failure/restart, Slurmctld failure/restart. * Support has been added for TotalView to attach to a subset of launched tasks instead of requiring that all tasks be attached to. This is the default behavior unless an option of "--enable-partial-attach=no" be passed to the configure (build) script. * A web application (chart_stats.cgi) has been added that invokes sreport to retrieve from the accounting storage db a user's request for job usage or machine utilization statistics and charts the results to a browser. * Much functionality has been added to account_storage/pgsql. The plugin is still in a very beta state. * SLURM's PMI library (for MPICH2) has been modified to properly execute an executable program stand-alone (single MPI task launched without srun). * The PMI was also modified to use more socket connections for better scalability and to clear state between job step invocations. * Added support for spank_get_item() to get S_STEP_ALLOC_CORES and S_STEP_ALLOC_MEM. Support will remain for S_JOB_ALLOC_CORES and S_JOB_ALLOC_MEM. * Changed error message from "Requested time limit exceeds partition limit" to "Requested time limit is invalid (exceeds some limit)". The error can be triggered by a time limit exceeding the user/bank limit or the time-min exceeding the job or partition's time limit. * Added proctrack/cgroup plugin which uses Linux control groups (aka cgroup) to track processes on Linux systems with this feature (kernel >= 2.6.24). * Added the derived_ec (exit code) member to job_info_t. exit_code captures the exit code of the job script (or salloc) while derived_ec contains the highest exit code of all the job steps. * Added the derived exit code and derived exit string fields to the database's job record. Both can be modified by the user after the job completes. See job_exit_code.html API CHANGES =========== Changed members of the following structs ======================================== job_info_t num_procs -> num_cpus job_min_cpus -> pn_min_cpus job_min_memory -> pn_min_memory job_min_tmp_disk -> pn_min_tmp_disk min_sockets -> sockets_per_node min_cores -> cores_per_socket min_threads -> threads_per_core job_desc_msg_t num_procs -> min_cpus job_min_cpus -> pn_min_cpus job_min_memory -> pn_min_memory job_min_tmp_disk -> pn_min_tmp_disk min_sockets -> sockets_per_node min_cores -> cores_per_socket min_threads -> threads_per_core partition_info_t state_up (new states added PARTITION_DRAIN and PARTITION_INACTIVE) default_part -> flags (as PART_FLAG_DEFAULT flag) disable_root_jobs -> flags (as PART_FLAG_NO_ROOT flag) hidden -> flags (as PART_FLAG_HIDDEN flag) root_only -> flags (as PART_FLAG_ROOT_ONLY flag) slurm_step_ctx_params_t node_count -> min_nodes slurm_ctl_conf_t cache_groups -> group_info (as GROUP_CACHE flag) Added the following struct definitions ====================================== block_info_t (BlueGene-specific information) reason job_info_t derived_ec gres max_cpus resize_time show_flags time_min job_desc_msg_t gres max_cpus time_min wait_all_nodes job_step_info_t gres node_info_t boot_time gres reason_time reason_uid slurmd_start_time partition_info_t alternate flags preempt_mode slurm_ctl_conf_t gres_plugins group_info hash_val job_submit_plugins sched_logfile sched_log_level slurmctld_port_count vsize_factor slurm_step_ctx_params_t features gres max_nodes update_node_msg_t gres preempt_mode reason_uid Changed the following enums =========================== job_state_reason FAIL_BANK_ACCOUNT -> FAIL_ACCOUNT FAIL_QOS /* invalid QOS */ WAIT_QOS_THRES /* required QOS threshold has been breached */ select_jobdata_type SELECT_JOBDATA_PTR /* data-> select_jobinfo_t *jobinfo */ select_nodedata_type SELECT_NODEDATA_PTR /* data-> select_nodeinfo_t *nodeinfo */ select_type_plugin_info is no longer and it's contents are now mostly #defines Added the following API's ========================= slurm_checkpoint_requeue() slurm_init_update_step_msg() slurm_job_step_get_pids() slurm_job_step_pids_free() slurm_job_step_pids_response_msg_free() slurm_job_step_stat() slurm_job_step_stat_free() slurm_job_step_stat_response_msg_free() slurm_list_append() slurm_list_count() slurm_list_create() slurm_list_destroy() slurm_list_find() slurm_list_is_empty() slurm_list_iterator_create() slurm_list_iterator_reset() slurm_list_iterator_destroy() slurm_list_next() slurm_list_sort() slurm_set_schedlog_level() slurm_step_launch_fwd_wake() slurm_update_step() Changed the following API's =========================== slurm_load_block_info(): Added show_flag parameter