DISTRIBUTED QUEUEING
SYSTEM - 3.3.x
INSTALLATION AND MAINTENANCE MANUAL
December 12, 2000
The Distributed Queuing System (DQS)
Completion Of The Installation
The Distributed Queuing System (DQS) is an experimental batch queuing system which has been under development at the Supercomputer Computations Research Institute (CSIT) at Florida State University for the past 9 years. The first years of this activity were funded by the Department of energy Contract DE-FC0585ER250000. DQS is freely distributed to all parties with the understanding that it continues to be an evolving development system, and no warranties should be implied by this distribution.
DQS is intended to provide a mechanism for the management of requests for execution of batch jobs on one or more members of a homogeneous or heterogeneous network of computers. Facilities for load balancing, prioritization and expediting of a wide variety of computational jobs are included to assist each site in tailoring the behavior of the system to their particular environment.
CSIT will make every effort, within its resources to assure that DQS is suitable for operation as a Batch Queuing system in as many site situations as possible. CSIT staff will respond to requests for assistance as well as investigating bugs, incorporating repairs and updating documentation, from those who are utilizing DQS. However it is not possible, at this time, to make a formal commitment for the long term support and enhancement of this system. Any user or organization that decides to adopt DQS will be assuming all risks from that undertaking.
DQS and future enhancements can be obtained by Internet ftp from "ftp.csit.fsu.edu".
Announcements of new releases and improvements will be emailed to anyone who contacts CSIT to add their name to the announcement list dqs-announce. This is accomplished by filling out the online form at URL:
http://mailer.csit.fsu.edu/mailman/listinfo/dqs-announce
A name can be removed from the announcement list by visiting the same online form and using the "Edit Options" selection with your email address:
Bug reports should be sent to : dqs@csit.fsu.edu
DQS user information exchange is provided by Rensselaer Polytechnic Institute. To add your name and email address to this list:
Send email to dqs-l@vm.its.rpi.edu
Leave the "subj:" line blank
Send a one line message: SUBSCRIBE dqs-l 1stname Lastname
To remove name and email address:
Send email to dqs-l@vm.its.rpi.edu
Leave the "subj:" line blank
Send a one line message: UNSUBSCRIBE dqs-l 1stname Lastname
Where 1stname is the user's first name and Lastname is the user's last name.
The release of DQS 3.0 was a major departure for the DQS evolution. It was based on several years' experience with DQS 2.1 in a variety of computing environments. Although it retained many features of the 2.1 version, DQS 3.0 was a major restructuring and re-coding of the basic system with a major focus on supporting parallel (clustered) computation on two or more UNIX based hardware platforms. The newly emerging message passing scheme (MPI) was considered throughout the DQS 3.0 implementation.
In early 1995 DQS 3.0-3.1 was subjected to extensive testing and the contributions of numerous users were incorporated to produce DQS 3.1.2 which was released in March and augmented over a period of six months to become DQS 3.1.2.4. With the exception of some minor "improvements this system has been fairly stable and in operational use for nine months.
Operational experience at CSIT and other large production sites revealed several features which needed to be added or adapted to make the system easier to use or to manage. Several sites have provided the DQS development team with valuable insight, advice, and code which has been incorporated into this new release. Although all user interfaces have not been changed (albeit "enhanced") the internals of this system have undergone considerable change, hence the naming of this release as 3.3.1 instead of 3.1.2.5. We took this opportunity to restructure the documentation (one more time!) in response to numerous requests to make it easier to access. In addition to numerous bug-fixes for DQS 3.1.2.4 provided by several very helpful sites (see "acknowledgments") a number of new features have been added to the system.
The "new" features of DQS 3.3.2 tend to be somewhat invisible to the DQS user. The bulk of this effort has been focused on further "bulletproofing" the system to minimize, if not eliminate, the unreported termination of daemons, utilities and jobs. With this in mind we list here the major changes that appear in DQS 3.3.2:
qconf -Mq filename option added
A new option was added to qconf to specify a file for use in modifying the configuration of a specific queue. The qname and qhostname entries in the file must match those of the queue to be modified.
Logical OR support added to resource specification for qsub
Logical OR can now be used in the resource specification to qsub to define what resources are needed by the job.
More Memory Leak Fixes
More memory leaks in the qmaster and dqs_execd that were fixed. These leaks were in library code that also is used by the ancillary programs.
Port to FreeBSD
DQS 3.3.2 now runs on FreeBSD.
User Job Limits fixed to search for correct Process Group ID
On some systems, a job hitting the cpu_job_limit would cause the dqs_execd for signal itself to terminate. User job limits now are calculated for the Process Group ID of the job shepherd process. All processes associated with the job are terminated when the sum of their CPU limits exceeds that specified by cpu_job_limits for the queue. It should be noted that jobs running under MPI or PVM do not run under a single Process Group ID. Only those processes running on the same host as the shepherd are terminated.
CPU Job Limits
New hard and soft CPU Job Limits provide for limiting the resource utilization of all processes that make up a job.
Year 2000 Compliance
DQS is now Year 2000 ready.
System Log Interface
Critical errors are now logged to both the DQS log files and the System Log.
Decimal Notation For Queue Configuration
The queue configuration parameters are now entered in decimal notation instead of hexadecimal when you execute the qconf332 program. They are also displayed in decimal with qstat332 -ext.
Many Memory Leaks Fixed
There were a number of memory leaks in the qmaster and dqs_execd that were fixed. These leaks were in library code that also is used by the ancillary programs.
Protocol Modified
The internal protocol was modified. This is a major departure from 3.2.7 because the 3.2.7 protocol is incompatible with the 3.3.x protocol. Backward compatibility was built in so the 3.3.x daemons can start up on 3.2.7 configuration files. The files will be immediately modified for the new protocol and will become incompatible with the 3.2.7 daemons.
Environment Overrides for log_file and err_file
The log_file and err_file names can now be overridden by specifying the new value as an environment variable. The new variables names are LOG_FILE and ERR_FILE.
New make clean target
A new target, clean, was added to the makefiles.
The DQS 3.3.1 Documentation was reorganized.. .again. The POSIX specification has been extricated from the document body and is now an Appendix. The reference manual pertains only to the DQS 3.3.x implementation and all confusing references to "standard" and "non-standard" options removed.
The documentation consists of three principle chapters and three appendices. The Installation and Maintenance Manual is primarily aimed at the DQS system administrator. The User Guide is obviously targeted at the DQS user community. Both users and administrators will use the Reference Manual. Appendix A contains a catalog of all DQS error messages with information on methods for dealing with the error. Appendix B contains the POSIX specification on which DQS 3.3.x is based. Appendix C contains several miscellaneous sections, including installation variants and system tuning guidelines.
The documentation is supplied in several forms:
DQS is installable on almost every existing UNIX platform. This process thus must cope with many differences and idiosyncrasies of the varied hardware configurations and Operating Systems. DQS 3.3.2 attempts to detect and resolve these differences to minimize the need for operator actions, but with even the simplest installation there will be a need for some input from the DQS administrator.
DQS 3.3.2 can be obtained by ftp download from ftp.csit.fsu.edu/pub/dqs. The "README.332" file in that directory will indicate which version should be downloaded. To reduce download bandwidth, improvements and big-fixes will be distributed on a file-by-file replacement basis rather than requiring a complete download of the DQS 3.3.2 system. For this reason we do not envision distributing systems such as DQS 3.3.2.1 in the future. (But you never know.)
DQS 3.3.2 is distributed as a compressed TAR file. After this file is uncompressed it is recommended that the DQS system be extracted (with TAR) into a directory which is accessible by all operating systems for which DQS will be built. The DQS installation process will create a separate directory in the sub-directory .../DQS/ARCS for each different architecture/operating system.
The installation scripts produce a list of defaults that will be used for the installation. The user is asked to review this list to ensure that it meets their requirements. The default cell and default initial queue names are derived from the host-name of the machine on which the installation process is being executed. If the installation is being executed as "root" the system will be setup to use "reserved" ports for communication, otherwise "non-reserved" ports will be utilized. If the installation is being run as non-root then the user doing the "make install" will be automatically added as a manager.
The installation proceeds in stages, as follows:
An optional approach is available to the knowledgeable DQS administrator that omits all interactions. This requires the editing of three DQS files used during the make process.
The X-window based DQS graphical interface is installed as a separate step. Change directory to DQS/XSRC and follow the instructions provided in the file named INSTALL. The X-Window interface is being restructured and will be intgerated fully in future DQS releases.
The installation process creates a series of directories and subdirectories and two crucial files, the "conf_file" (configuration file) and the "resolve_file". If the system installation was completed correctly the conf_file will contain information which will be read by every DQS binary file when it is started. This includes the DQS daemons, qmaster and dqs_execd, and the DQS interface "utilities" qsub, qdel, qmod, qconf, qstat, qrls, qhold and qmove. It is best that these two files are accessible through a cross-mounted NFS/AFS/DFS file. If that is not possible then the administrator must ensure that identical copies of these files are present on each host.
Once the binaries have been moved to their execution directory (we will use the path /usr/local/DQS/bin" for all future examples), the qmaster can be started. If during the installation process the administrator chose "FALSE (NO)" when asked the question "Reserved ports?", then the /etc/services file will have been updated (by a root user) with the three entries suggested by the config process (or a rational alternative). The conf_file will contain the names of these entries along with the DEFAULT_CELL name, which must match the first entry on the first (non-commented) line in the resolve file. The administrator should make a visual check of these three crucial files, conf_file, resolve_file and /etc/services to make sure that they conform to these requirements.
QMASTER
<The qmaster manages all resources for a single DQS cell.>
Once satisfied that all is well the qmaster can be started by typing "/usr/local/DQS/bin/qmaster332. On this first occasion, it would be useful to check that the process has actually started by viewing the UNIX process status (ps). If the qmaster name does not appear in the host's process list, the administrator should check the "err_file" in the qmaster spool directory (chosen during the DQS config stage-default: " /usr/local/DQS/common/conf").
If the qmaster appears to be operating, it can be tested by executing the command "/usr/local/DQS/bin/qstat332 -f", on the same host where the qmaster332 is running. A normal response to this command would be one or more lines of output describing the status of the current queues. For brand new installations this will be simply a header with no other lines. Error messages may appear if things are not quite "in harmony", refer to "DQS Error Messages" and "Solving Installation Problems: for assistance in this case.
DQS_EXECD
<The dqs_execd is a DQS daemon which resides on each host which has at least one queue and will be executing DQS managed jobs.>
If the "qstat332" command succeeds, it is time to start a dqs_execd, which actually manages a particular queue. For this test, on the same host where the qmaster "dwelleth" type the command "/usr/local/DQS/bin/dqs_execd332".. Again the UNIX process status should be examined (ps). If the dqs_execd is not executing, refer to the err_file for significant error messages. Consult "DQS Error Messages" and "Solving Installation Problems: for assistance.
Executing the command "qconf -aq" (queue configuration, add queue) will produce an edit session with the default editor on that host. If the "qconf" command yields an error message and shuts down, consult "Solving Installation Problems". A queue "template" will be displayed which can be modified using the editor commands. For this test the queue name, and queue host name should be changed to match the name of the host on which the dqs_execd is executing. We will deal with the remaining entries later (see .The Queue Configuration).
Q_name ibm11 hostname ibm11.csit.fsu.edu seq_no 0 load_masg 1 load_alarm 175 priority 0 type batch rerun FALSE quantity 1 tmpdir /tmp shell /bin/csh klog /usr/local/bin/klog reauth_time 6000 last_user_delay 0 max_user_jobs 4 notify 60 owner_list NONE user_acl NONE xuser_acl NONE subordinate_list NONE complex_list NONE consumables NONE s_rt 2147483647 h_rt 2147483647 s_cpu_job 2147483647 h_cpu_job 2147483647 s_cpu 2147483647 h_cpu 2147483647 s_fsize 2147483647 h_fsize 2147483647 s_data 2147483647 h_data 2147483647 s_stack 2147483647 h_stack 2147483647 s_core 2147483647 h_core 2147483647 s_rss 2147483647 h_rss 2147483647
When the queue name and queue host name are modified, exit the editor in the normal manner (ESC-ZZ for vi or CTRL-X CTRL-C for emacs). This will trigger the qconf utility to parse the submitted definition and, if no syntactical errors are discovered will create the requested queue.
Queue Name Queue Type Quan Load State ---------- ---------- ---- ---- ------ ibms30 batch 0/1 0.10 dr DISABLED
Note that the status entry in the right column of the qstat output will display the word "DISABLED". All new queues are initiated in "DISABLED" state. To enable the queue we need to invoke another DQS command "/usr/local/DQS/bin/qmod332 -e <queue name>" (modify queue, enable the queue name given here as <queue name>).
Again execute the "/usr/local/DQS/bin/qstat332 -f" command:
Queue Name Queue Type Quan Load State ---------- ---------- ---- ---- ------ ibms30 batch 0/1 0.10 dr UP
TEST SCRIPT
Once the qmaster and at least one daemon are running, a simple test can be performed. In the .../DQS/tests directory is a collection of sample scripts. The entire contents of this directory should be copied to a user (non-root) directory owned by the administrator. As a first test change directory to this non-root directory and type "/usr/local/DQS/bin/qsub332 dqs.sh". This will submit the simple script to DQS:
#!/bin/ksh #$ -l qty.eq.1 #$ -N UTESTJOB #$ -A dummy_account #$ -cwd echo 'we are now doing something else' printenv sleep 30 echo 'end of script'
A message should appear in response to the qsub332 command:
"your job 1 has been submitted".
After 30 seconds the job should complete and in the directory where the job was submitted two output files should appear:
UTESTJOB.e1.25674 and UTESTJOB.o1.25674
The title UTESTJOB was established by the DQS directive "#$ -N UTESTJOB". The next field (either e1 or o1) contains the job number preceded by the type of file. The stderr file for the job will have an "e" in that position and the stdout file will have an "o". The number at the end of the file name represents the PID the job had when it executed. The UTESTJOB.e1.25674 file should be zero length. If not examine its contents for the cause of any error. The stdout file should begin with the line : 'we are now doing something else', followed by a display of the user's environment and ending with the line 'end of script'.
If the test script completes correctly, hosts can be added and additional queues created and more complex job tests can be submitted. If the "Quick Install" method was chosen the time has probably arrived to plan an operational cell organization and setup resource files and queues. In order to layout an effective system it is important to understand how DQS is constructed, the capabilities of its components and how they may be tailored for a specific site.
A basic DQS system consists of at least one computer host which is running the qmaster program and at least one instantiation of the dqs_execd daemon which manages the actual execution of jobs on the host which they `inhabit'. All of the resources managed and monitored by a qmaster are considered to be a "cell".
Within a cell there are three classes of programs operating. These are the qmaster daemon, the dqs_execd daemon, and the DQS utilities that include qsub, qstat, qmod, qconf, qdel, qhold, qhold, qrls.
The second test compares the user's request for site-defined resources (such as those actually present in the system at the moment. Unless the submitted job possesses the DQS directive "-F" ( force the acceptance of the resource request ), If one or more of the requested resources do not exist (please note that this test verifies that a resource is present in the system, not whether or not it is in use by another job!).
In the previous section a diagram of the elements constituting a "cell" were displayed. A DQS332 site may have several independent cells, or they may be aggregated into a common operating environment:
This example displays three cells A,B&C, each managed by its own qmaster QM-A, QM-B or QM-C. The hosts are labeled A1 and A2 for Cell-A, B1, B2 and B3 for Cell-B and C1 and C2 for Cell-C. For this discussion we will assign the qmasters to a separate host in each cell. QM-A will thus be on host A0, QM-B on host B0 and QM-C on host C0.
Communications among the various hosts in a cell and between cells is structured by the inclusion of a host within a qmaster's host list. In the above example qmaster QM-A has four hosts in its table, A0 (its own host), A1, A2 and B0 (the qmaster host for cell B). Instead of a completely symmetrical inter-cell arrangement here we have chosen to not have QM-A linked with QM-C. Thus neither of these qmasters will have the other cell's qmaster host in its own hosts table.
An option, which is less secure, is to permit the host from one cell to contact the qmaster in another cell (as shown by path [c]. In this case host B3 could execute utilities and perhaps launch jobs in Cell-C as well as Cell-B. Even without this "sneak path" hosts in cells A and C can interrogate the status of queues in Cell-B, if the user permissions allow such an activity.
Note, once again, that a host in a cell may have no queues assigned to it for execution, or it may have one or more queues assigned to it. It is also quite common to have a dqs_execd running on the same host as the qmaster daemon. The DQS332 utilities can be executed on any host in a cell, regardless of whether that host is running a dqs_execd daemon.
The first level of security within DQS is then a "trust" relationship among a cell's hosts and between each cell's qmasters. The next level of security is the level of permissions established by a qmaster's "manager" and "administrator" lists. The third level of security is defined by the specific user permissions or exclusions defined for each queue. Certain activities are permitted to a DQS administrator or manager which a queue owner may not invoke, Among them are deleting the queue itself or changing its configuration. A queue owner and the DQS managers may perform activities such as queue suspension, which of course the average user is prohibited from doing.
To manage system security, queues, jobs and user access, a number of directories are created during the startup process. The DQS administrator will normally not have to deal with these directories nor their contents. However when all DQS files cannot or should not be cross-mounted it is important that the function of these elements are understood so that they can be placed correctly in the system.
Shared & Local
As indicated in the installation instructions, the easiest method for managing a DQS is to have all the system files and directories mounted by NFS/AFS or DFS on all hosts. The one exception to this is that the directories containing the binaries for all DQS executables which, of course, should only be shared by hosts with identical architecture and operating system configurations. A knowledgeable administrator may wish to make changes directly to the contents of one of these directories. Where appropriate a hint or two are provided to assist the system manager. A typical directory tree will look somewhat like the following: (underlined names are directories, italicized names are files)
user local DQS bin local man common doc conf conf_file err_file key_file log_file resolve_file dqs_execd host-A1 pid_file exec_dir complex_file script_file consumables_file job_dir host_file job1 job2 rusage_dir current_usage job1 job2 tid_dir tid_#xxxx tid_#xxxx host_An pid_file exec_dir complex_file script_file consumables_file job_dir host_file job3 job4 rusage_dir current_usage job3 job4 tid_dir tid_#xxxx tid_#xxxx local qmaster qmaster_hostname pid_file stat_file common_dir generic_queue host_file man_file op_file seq_num_file job_dir queue_dir queue_A1 core queue_A2 queue_a3 tid_dir tid_#xxxx tid_#xxxx
Four system files are classed as "should be shared by all hosts, if at all possible". They are:
conf_file
--- This file is created during the DQS332 "config" step of the installation or system update. This file contains system-wide configuration which is read by the qmaster, dqs_execd and all DQS utilities when they startup. If it is necessary to make changes to this file, the qmaster and all dqs_execd's should be shutdown and restarted after the changes are complete, so that they will posses the latest configuration. Failure to observe this step may often result in bizarre and unexplained behavior of the system if not an outright collapse. If this file cannot be cross-mounted by all hosts, then an IDENTICAL COPY of this file needs to be distributed to all hosts before restarting the qmaster or dqs_execd daemons or any of the command-utilities.The location from which this file is read is "hard-wired" into the compiled DQS code based in the #define CONF_FILE statement in the dqs.h file which is also created by the DQS "config" step. It is important to understand that the default installation setup places the conf_file in "/usr/local/common/conf" directory, which is also used as the default location for the qmaster and dqs_execd spool directories. While those directories can be relocated by changing the conf_file and restarting the daemons, the location of the resolve_file and conf_file can only be changed by modifying "dqs.h" with an editor or be re-executing the "config" program.
The following are the initial entries in the conf_file file with a description of each line's affect on the system.
QMASTER_SPOOL_DIR /usr/local/DQS/common/conf
This parameter points to the starting directory from which the qmaster's sub-directories are created. While at some sites with several cells the resulting tree can be shared by multiple qmasters, it is only necessary that the qmaster have access to the sub-directories for itself. This tree appears above as "qmaster/QM-A".
EXECD_SPOOL_DIR /usr/local/DQS/common/conf
This parameter defines the starting directory from which all of the dqs_execd's in the cell will find their individual queue management directories. In the default DQS setup all dqs_execd's in a cell use this same directory tree terminating their own specific set of sub-directories. This is illustrated in the preceding diagram by " ../dqs_execd/host-A1".
DEFAULT_CELL user-network
The system-wide, unique name for a given cell. This can be any arbitrary ASCII string and is defaulted to the qmaster's host domain name during the installation process. If this name is changed the corresponding string in the "resolve_file" must be changed accordingly and vice-versa.
RESERVED_PORT TRUE
This parameter indicates that all daemons and utilities in a cell will be using UNIX reserved ports for socket communications. UNIX system port numbers from 0 to 1023 are designated as "reserved". If this parameter is set to TRUE then all of the DQS332 programs MUST execute with root ownership. If this parameter is set to FALSE then the /etc/services port numbers for DQS332 services must be greater than 1024.
DQS_EXECD_SERVICE dqs33_execd
Any arbitrary ASCII string can be used to identify the tcp port number to be used when the qmaster or the DQS utility "dsh" is communicating with the dqs_execd. The only requirement is that this name must be unique among all names in the /etc/services file.
QMASTER_SERVICE dqs33_qmaster
Any arbitrary ASCII string can be used to identify the tcp port number to be used when the dqs_execd or DQS utilities are communicating with the qmaster. The only requirement is that this name must be unique among all names in the /etc/services file.
INTERCELL_SERVICE dqs33_intercell
Any arbitrary ASCII string can be used to identify the tcp port number to be used when the one qmaster is communicating another qmaster. The only requirement is that this name must be unique among all names in the /etc/services file.
KLOG /usr/local/bin/klog
The re-authentication process in AFS systems will use the klog program. This entry is only used when AFS support was selected during DQS installation.
REAUTH_TIME 60
If AFS has been selected, all daemons and executing jobs will re-authenticate every period of this number of seconds.
MAILER /bin/mail
All jobs can select options to send brief "job startup", "job end" and "job abort" messages to one or more designated users. In addition the DQS332 system will send mail messages to the administrator in the event of extraordinary system events.
DQS_BIN /usr/local/DQS/bin
The qmaster, dqs_execd and all ancillary programs locate their binaries in the BIN_DIR established during the "config" step of installation. This entry is set by that step, and acts as a "place-holder" for that target directory. This parameter is used, however by the parallel queue management system. If the administrator wishes this parameter can be changed to point to a different directory where PVM,P4,TCGMSG and MPI support programs may reside. Doing so will not affect the continued use of the BIN_DIR for the remaining DQS executables.
ADMINISTRATOR admin@host_machine
On startup of the qmaster this entry is used to identify the primary DQS administrator for this cell. This also forms the email address used to send system error messages.
DEFAULT_ACCOUNT GENERAL
Any arbitrary ASCII string (without separator characters such as blanks, periods, commas) can be used as an account identifier. Each job submission can provide its own account identifier, which overrides this default string. No validation is performed on this or the user submitted account name string. When a job terminates a record is created from hardware and software usage data. The "account string" is appended and the record is appended to the qmaster's "act_file".
LOGMAIL FALSE
By default none of the mail generated by the DQS, either to users or the system's managers is logged. Setting this parameter to TRUE will cause the qmaster to create a mail log file, where all system emails are recorded and time-stamped.
DEFAULT_RERUN FALSE
It is our sincere hope to have the rerun feature if DQS implemented in future versions. In DQS332 this parameter is ignored.
DEFAULT_SORT_SEQ_NO FALSE
During the qmaster's scheduling process two major steps occur. First the jobs themselves are sorted according to their submitted priorities and internal policy criteria. Second all of the available queues are scanned to find one which suits the needs of the first job to be scheduled. The ordering of this queue scanning process can be changed by this parameter. When this parameter is FALSE all of the queue entries are sorted in the decreasing order of their host's usage data (as reported by the dqs_execd). Thus the first queue examined will be the least "busy" queue, in an effort to spread the workload across the system.
If this parameter is set to TRUE the queues are examined in the order of the sequence number assigned by the administrator in each queue configuration. Many sites use this method to ensure that their most powerful hosts are scanned first, by assigning those hosts very low sequence numbers to the corresponding queues.
SYNC_IO FALSE
In multi-host systems utilizing NFS mounted files it is possible for I/O actions to become disordered in their results. The ordering of lines of output sent to stdout or stderr can become totally confused. DQS332 is supposed to have a feature in its "process shepherd" to ensure that all stdout and stderr output is properly time sequenced, even when multiple SLAVE processes are involved. In the initial DQS332 release this feature is not active.
USER_ACCESS ACCESS_FREE
This feature for differentiating levels of access for users or classes of users is not implemented in DQS332.
LOGFACILITY LOG_VIA_COMBO
Many system messages are generated to aid in the maintenance and diagnosis of DQS operation. Three files are used for this activity, the "err_file", the "log_file" and the "syslog_file". Depending on the level of attention required, messages are directed to one of these files. All messages with ERR, CRIT, or WARNING are always sent to err-file. Messages with levels of INFO, WARNIING or NOTICE can be sent to the system log or the normal activity log file. The normal mode is to use both the system log and normal log file.
LOGLEVEL LOG_INFO
Information is logged depending on the level assigned within the DQS. In increasing order they are LOG_INFO, LOG_NOTICE, LOG_WARNING, LOG_ERR, LOG_CRIT, LOG_ALERT, LOG_EMERG. Setting the LOGFACILITY parameter establishes the minimum level of messages to be recorded. A parameter of LOG_INFO ensures that all messages will appear in the "log_file".
MIN_UID 10
MIN_GID 10
For security reasons it is desirable to establish a minimum user and group identifier (uid or gid) which will be permitted in execution of any of the DQS utilities. The qmaster and dqs_execd, of course normally operate at root level. The recommended setting is "10" for these parameter values as most UNIX critical processes run with uid and gid values below "10". It is strongly recommended that these default values be retained.
Attempts to run DQS utilities such as qsub, qalter, qstat, etc. will fail if these default values are used, which is the "correct", albeit confusing (to new system managers) behavior of DQS.
MAXUJOBS 10
There are a number of DQS "system policy" parameters available to the DQS332 administrator. One of these is a system-wide limit on the total number of jobs a user may have considered for scheduling at any one time. This is not a limit on the total number of jobs a user can have queued up in the system, but it does instruct the qmaster not to consider more than MAXUJOBS for a user during a scheduling pass. The effect of this limit can become quite subtle. For example, if a limit of 10 is established and the user submits 100 jobs, they will be ordered in sequence of their priority and submission time. If the first ten of these jobs require system resources not currently available, they cannot be scheduled. Neither will any following jobs, which may need some resource which is actually available. An additional user limit can be found in each queue configuration.
OUTPUT_HANDLING LEAVE_OUTPUT_FILES
When a job is started by the qmaster it may be able to produce large stdout or stderr files. The writing of these files to a a remote, NFS mounted file system can have negative impacts on system performance. In some cases, retaining these files on a hosts local filesystems could prevent network congestion and minimize I/O delays for the running job. DQS332 provides three options for handling these output files. The default LEAVE_OUTPUT_FILES causes the stdout and stderr files to be left in the working directory established by the user's "qsub" script.
This parameter can be changed to LINK_OUTPUT_FILES. In this case the administrator must create a special file in one or all the dqs_execd spool directories. The name of this file is defaulted to "netpath" during the DQS "config" step. This default name may be changed by the administrator in the dqs.h file if they are prepared to recompile the entire DQS332 system. The "netpath" file should contain one ASCII line defining the fully qualified network path of the target directory into which the stdout and stderr files are to actually be placed.
If the parameter is set to COPY_OUTPUT_FILES the DQS332 process "shepherd" creates temporary standard output and standard error files local to the host executing the job. A special "copy" process is started which wakes up periodically (set by the hard-wired COPY_FILE_DELAY in the dqs.h file), and copies the current contents of those files to their actual destination.
ADDON_SCRIPT NONE
At the conclusion of a user's job, and in the working space of that job, it is sometimes necessary to conduct system cleanup tasks. This is particularly true of parallel processing tasks that might leave "orphan" daemons running, in the event of unplanned process termination. A system script maintained within the DQS can be created and invoked at the conclusion of EVERY user job. This parameter must then contain the fully qualified path-name to this script file.
ADDON_INFO NONE
When OUPUT_HANDLING is set to anything other than LEAVE_OUTPUT_FILES, the system administrator may wish to maintain a diagnostic awareness of the "process shepherd" handling of the copying or linking of a user's stdout and stderr files. If this parameter is set to something other than NONE, the parameter string should be a fully qualified path to a file containing an ASCII string to be appended to the "stdout" file along with other job information.
LOAD_LOG_TIME 30
Upon startup the dqs_execd sets this parameter (specified in seconds) as a minimum period for the dqs_execd to deliver system usage statistics to the qmaster.
STAT_LOG_TIME 600
Various system statistics, beyond the host usage provided by the dqs_execd daemons, are gathered periodically, based on the value of this parameter (specified in seconds).
SCHEDULE_TIME 60
The qmaster scans the cell's job queue after every new job is submitted to the system or upon termination of a running job. Absent these occurrences the qmaster will trigger a scheduling pass of the jobs based on this parameter (specified in seconds).
MAX_UNHEARD 90
The qmaster does not poll other daemons for their status. Instead it updates the queue status for each dqs_execd which reports in. If a dqs_execd fails to report in to the qmaster within this threshold (seconds) the qmaster will mark all queues managed by the dqs_execd as "status UNKOWN". This status is updated every interval, and can be changed from UNKNOWN to UP if the dqs_execd has finally succeeded in updating the qmaster.
ALARMS 3
ALARMM 4
ALARML 5
The admonition to "avoid changing these parameters" in the installation is well founded. These parameters control the amount of time permitted before the UNIX system interrupts an attempt at inter-host communications. The "ALARMS" value is the time in seconds before a DQS utility such as qsub, qmod is interrupted. A message will appear for the user with message "Alarm Clock Shutdown" indicating that the utility cannot contact the qmaster within "ALARMS" seconds. The "ALARMM" parameter sets a similar limit on dqs_execd<->qmaster communications attempts. "ALARML" is the longest period established for inter-process interchange attempts, and is used to control qmaster<->qmaster communications.
In systems where the qmaster host is also running other jobs or where the network interconnect can become congested is possible for one or more communications attempts to fail due to an "ALARM" time-out. If the err_file contains frequent "ALARM CLOCK Shutdown" warnings or utility execution fails often with similar error messages the three "ALARM" parameters should be increased. These values should be kept as small as practical to prevent a failing DQS element from tying up the host's tcp/ip interface.
resolve_file --- This file is also created during the DQS "config" process. It is the equivalent of a combination of the UNIX "resolv.conf" and "hosts.equiv" files for managing network security. The default resolve_file is:
# NOTE! blank lines NOT permitted # # NOTE! fields must be separated by one(1) AND ONLY one space # # 1st field = cell_name # 2nd field = primary qmaster # 3rd field = primary qmaster alias # 4th field = secondary qmaster # 5th field = secondary qmaster alias user-network QM-A0 QM-A0.user.com NONE NONE
The comment lines direct the DQS manager as to the format of new entries or entry changes, Some of the aspects of this file need further explanation.
err_file --- The master, dqs_execd and all DQS utilities may originate error messages which are directed to a hard-wired filename "err_file". This name is created during the DQS "config" step and implanted in the "dqs.h" include file in the ../DQS/SRC directory. The installation process assumes that all DQS332 programs will have write-access to the path name which appears as QMASTER_SPOOL_DIR in the conf_file. If this path name is inappropriate for ALL DQS programs the administrator may choose to change the definition of ERR_FILE in the include file "dqs.h". This will require recompilation of the entire DQS332 system.
As an alternative, the administrator may choose to let each program write to its own "err_file" and gather and collate all the files when it is necessary to examine error information. In this case, however the path-name accessible by each host must be identical to the QMASTER_SPOOL_DIR name.
log_file --- The master, dqs_execd and all DQS utilities may originate error messages which are directed to a hard-wired filename "log_file". This name is created during the DQS "config" step and implanted in the "dqs.h" include file in the .../DQS/SRC directory. The installation process assumes that all DQS332 programs will have write-access to the path name which appears as QMASTER_SPOOL_DIR in the conf_file. If this path name is inappropriate for ALL DQS programs the administrator may choose to change the definition of ERR_FILE in the include file "dqs.h". This will require recompilation of the entire DQS332 system.
"log_file" and gather and collate all the files when it is necessary to examine error information. In this case, however the path-name accessible by each host must be identical to the QMASTER_SPOOL_DIR name.
Qmaster
The qmaster directory contains a major sub-directory for each qmaster registered in this cell. Each qmaster's directory contains four sub-directories whose contents change constantly during DQS332 operation, and hence must permit write operations an all files. There are also two files created by the qmaster , the pid_file and stat_file. An additional, unwelcome file may appear here also. In the event of a qmaster crash, its core file will be placed in this directory.
common_dir
This directory contains files common to the scheduling and dispatching of jobs by the qmaster.
complex_file -- This file contains all of the definitions of complexes created by the add complex command (qconf -ac ).
consumables_file -- This file contains all of the definitions of consumable resources created by the add consumable resource command ( qconf -acons ).
generic_queue -- This file is read by the qmaster each time the create queue command (qconf -aq) is performed and no name is provided as a parameter following the "-aq" option flag. The contents of this file form the starting template presented in the editor for modification by the administrator.
host_file -- The host_file is read up at startup of the qmaster and contains a list of all the hosts known to the qmaster and occasionally called "trusted hosts". Any program attempting to contact the qmaster must have its host's name in this list or be rejected. On the initial startup of the qmaster this file will not be present. The qmaster will post a warning in the err_file and create the host_file.
man_file -- This file contains the login names of all individuals identified as cell "managers". A cell "manager" is given permission to access al DQS332 system files and to execute every option of every DQS332 utility.
op_file -- This file contains the login names of all individuals identified as cell "operator". A cell "operator" is given permission to perform a number of system operations normally reserved to the system manager, and prohibited to the standard system user. The functions qdel, qmod, qmove, and qrls are permitted by operators. Functions such as creating or deleting queues or adding and deleting managers and operators is, of course, limited to cell managers.
seq_num_file -- Jobs are assigned an internal sequence number. The next number to be assigned by the qmaster appears as a single binary value in this file. It is thus not possible to manually reset sequence numbers, other than to delete this file, forcing the numbering sequence to begin over with "1".
acl_file -- This file contains all of the access control list "acl" names for all queues. This is actually a list of lists. An "acl" is a list of names to be given access to one or more queues. A queue definition can include these individuals by naming the corresponding "acl" in its "user_acl" parameter.
job_dir
This directory contains a file for each job currently in the queuing system. Each file contains the submitted script file along with tables and lists created by the qsub operation and used to manage the job while it is in the queue awaiting assignment to a host, as well as during actual job execution.
queue_dir
This directory contains a file for each queue. The file name is, in fact, the name assigned to that queue. Each file contains the queue configuration, encoded in binary form, along with various tables the queue manager utilizes to manage the queues.
tid_dir
To maintain internal coherency during system operation in the face of multiple hosts executing multiple processes a unique identifier label is generated by the qmaster and dqs_execd for every inter-host communication. This label is called a "task identifier" or "tid". An empty file for each generated "tid" is created. An acknowledgment by the receiving host for a transaction causes the corresponding tid file to be deleted from this directory.
In the event of aberrant behavior of a hardware or DQS332 software element some "orphan tid's" may be found in this directory, however the administrator is cautioned to NOT clear out tid files manually without careful analysis. This scheme was created to ensure inter-host synchronization despite multiple restarting of the qmaster or the dqs_execd.
pid_file -- This file contains a list of the process id of the running qmaster. This is a "canonical" location where site procedures may find this pid for system management actions.
Stat_file -- Based on the period defined as "STAT_LOG_TIME", the qmaster records summary information about all the queues it manages. This data is time-stamped so DQS managers might determine when queue status changes occur inadvertently.
dqs_execd
The dqs_execd directory contains major sub-directories for each dqs_execd operting in this cell. Each dqs_execd directory contains four sub-directories plus one file, the "pid_file" which contains the process id of the dqs_execd. Of course there is also the possibility of a core file being placed here in the event of a dqs_execd crash.
exec_dir
The exec_dir contains the actual job file for the executing job. When the dqs_execd launches a job, the script file is copied here and executed.
job_dir
The job_dir contains a file for each job which the dqs_execd is managing (usually only one). In addition to the job's DQS script this file contains all the tables and information necessary for the qmaster and the dqs_execd to manage this job.
rusage_dir
Upon job termination usage data is collected and formatted into a "termination record" to be sent to the qmaster. This record is written to this directory and retained until the qmaster has received and recorded this information. The procedure is used to prevent vital data from being lost, particularly from long-running jobs, in the event of an interruption of dqs_execd or qmaster service.
tid_dir
In order to maintain internal coherency during system operation in the face of multiple hosts executing multiple processes a unique identifier for each communication is generated by qmaster and dqs_execd. This label is called a "task identifier" or "tid". An empty file for each generated "tid" is created. An acknowledgment by the receiving host for a transaction causes the corresponding tid file to be deleted from this directory.
Temporary Files
The dqs_execd creates and deletes a number of temporary files in the "/tmp" directory of its host. These are deleted after use, but if the dqs_execd has been shut down during job launching and execution, these files may be left n the "/tmp" directory inadvertently. Since they are given unique names for the job execution they will remain until removed by the system manager.
The queue configuration was introduced during the discussion of setting up an initial DQS332 cell and queue. The queue configuration is the primary means of tailoring a DQS system to a particular site's requirements. The queue configuration can be changed dynamically by the DQS cell manager without requiring a shutdown and restart of either the qmaster or dqs_execd, unlike the more static "conf_file". Changing the queue configuration will not affect any jobs already in execution. The modified configuration will be considered during the next scheduling pass of the qmaster after the change has been completed. A description of each element follows:
Q_name QA1
Any ASCII string of numbers and letters may be used in the queue name. I t must ba a unique queue name in a given cell.
hostname QA1_host
The hostname entered here may be any form of the host's name which is used by the network members,. DQS will convert the entered name to the fully qualified host name and insert that into the registered queue configuration.
seq_no 0
The seq_no is an arbitrary sequence number assigned by the DQS administrator. It is ignored if the conf_file parameter "DEFAULT_SORT_SEQ_NO" is set to FALSE. If "DEFAULT_SORT_SEQ_NO" is set toTRUE the qmaster will scan the queue list in the order of the sequence numbers starting with zero "0".
The DQS administrator may choose one of several strategies for assigning sequence numbers. At CSIT the lowest sequence number is assigned to the most powerful computing engines, with less powerful machines being assigned higher sequence numbers.
load_masg 1
Each dqs_execd collects information about the state of its host's overall computational and I/O load as reported by the UNIX system through the "rusage" structure. A "total system load" is provided as an integer value representing a fractional percentage of the system usage. A value of 1 represents a load of 0.01, a value of 10 represents a load of 0.10, and a value of 100 represents a load of 1.0.
When DEFAULT_SORT_SEQ_NO is set TRUE the qmaster attempts to assign jobs to the least loaded queues with the required resources requested by the job. The queues are sorted into increasing order of the load average, weighted by multiplying by the reported load average by the "massage factor" (the load_masg value). The load_masg factor thus permits the adnininstrator to adjust the system wide relationships between different hosts that may be necessitated by variations in usage measurements or background task activity.
load_alarm 175
A threshold value can bet set beyond which a queue will not be considered for scheduling by the qmaster. When a host reports a load average greater than this threshold it is in an "ALARM" state, and this flag is displayed in qstat output. The default load_alarm represents a load average of 1.75.
priority 0
This field may be confusing at this point because jobs also posses a submission priority. The difference is that the job priority determines only how it is ordered among other jobs in competition for system resources. The job submission priority has no influence on the UNIX priority with which that job is executed.
The queue priority field here IS the UNIX priority assigned to any job executed in this queue and thus may range from -19 (low) to +19(high).
type batch
DQS was designed to support the scheduling and management of batch and interactive jobs. DQS332 supports only batch queues. This parameter is ignored
rerun FALSE
Automatic job rerun is not enabled in DQS332. This field is ignored.
quantity 1
A DQS332 queue can manage more than one job in execution at a time, though this is usually not a practical way to operate a single cpu host.
tmpdir /tmp
During job startup and execution several temporary files are created. This parameter should be the fully qualified path name to the host's temporary directory.
shell /bin/csh
The default shell for executing jobs in this queue. This default can be overridden by commands in the job script.
klog /usr/local/bin/klog
The path name to the AFS klog executable.
reauth_time 6000
The time period in milliseconds for performing an AFS re-authentication of the executing job.
last_user_delay 0
To prevent a single user from dominating the utilization of a queue the administrator can set this time-out value (seconds) during which a user's job will not be consdered for scheduling following termination of a previous job for that user.
max_user_jobs 4
This is the second system parameter available for implementing scheduling policies for DQS332 at a site. The MAXUJOBS parameter in the conf_file limits the total number of jobs a user can have considered for scheduling across the entire system. The queue configuration "max_user_jobs" establishes a limit on the number of jobs a user can have queued which will be considered for scheduling for this queue. See "SCHEDULING" for a more complete discussion of this topic.
notify 60
A user job may invoke the "-notify" option to instruct the system to send the job a SIGUSR1 or SIGUSR2 signal as a warning in advance of a SIGSTOP or SIGTERM signal. This "notify" parameter in the queue configuation establishes the number of seconds between sending the warning signal and the SIGTERM or SIGSTOP.
owner_list NONE
In addition to the DQS manager and DQS operator an individual can be designated a queue owner. A queue owner can perform many system management tasks permitted to the managers and operators but limited to this queue. Job deletion, queue suspension, enabling, and disabling are among those actions One or more login names can be entered for this parameter.
user_acl NONE
The administrator can create one or more access lists using the "qconf -au" command. This command adds one or more users to a named list. (This named list will be created if it doesn't exist.) These named lists (of names) can be used to include or exclude groups of users in access to a specific queue, This queue configuration parameter "user_acl" can contain a list of one or more acl_list names which will be permitted to use the queue. (That is, the parameter can itself be a list of names of lists of names - confused?).
xuser_acl NONE
The administrator can create one or more access lists using the "qconf -au" command. That command adds one or more users to a named list. (This named list will be created if it doesn't exist.) These named lists (of names) can be used to include or exclude groups of users in access to a specific queue, This queue configuration parameter "user_acl" can contain a list of one or more acl_list names which will be excluded from access to the queue.
subordinate_list NONE
One or more DQS332 queues can be subordinated to another queue The queue specifying a list of subordinates with this parameter is called the "superior queue". A "superior queue" can NOT be a subordinate queue to another. A queue can only be subordinated to one other queue. The "subordinate_list" parameter can contain a list of one or more queue names in the same cell as the queue defining this parameter.
Superior queues are analyzed for scheduling in the same manner as all queues, If a job is assigned to a superior queue, the qmaster will suspend the execution of jobs in all of the queues in the superior queue's subordinate list.
complex_list NONE
This parameter can contain one or more names of complexes defined by the "add complex" function of the qconf command (qconf -ac). See "Complexes and Consumables". Any complex name can be preceded by the DQS reserved word "REQUIRED" (must be all caps). This indicates that no job will be scheduled for this queue UNLESS it requests a resource described in that complex.
consumables NONE
This parameter can contain one or more names of consumable resources defined by the "add consumable " function of the qconf command (qconf -acons). See "Complexes and Consumables". Any consumable name can be preceded by the DQS reserved word "REQUIRED" (must be all caps). This indicates that no job will be scheduled for this queue UNLESS it requests a resource described in that consumable.
s_rt 2147483647 h_rt 2147483647 s_cpu 2147483647 h_cpu 2147483647 s_fsize 2147483647 h_fsize 2147483647 s_data 2147483647 h_data 2147483647 s_stack 2147483647 h_stack 2147483647 s_core 2147483647 h_core 2147483647 s_rss 2147483647 h_rss 2147483647
These parameters establish the "hard" and "soft" limitations on a host's resource utilization of a job executing under control of this queue. The "hard" limits are transferred to the job's execution environment in the hopes that the host operating system provides support for these limits. Note, however, that if a host does support these limits they apply only on a process-by-process basis!! If a job script contains multiple invocations of processes, as in a FORTRAN compilation and execution, the limits apply to each individual step in the job.
DQS332 does check the "soft" and "hard" real-time limits (s_rt & h_rt) and will terminate jobs based on the values of those parameters. A job exceeding the "soft" real-time limit is sent a SIGTERM signal that can be intercepted by the job using the "-notify" option in the job script. If the job exceeds the "hard" real-time limits it is sent a SIGKILL signal which cannot be caught by the user job.
The most valuable aspect of DQS, and easily its most confusing property is the ability to define and utilize a variety of system "resources" which can then be requested in a user's DQS job script. These resource requests are used to differentiate and assign jobs to the variety of system capabilities found in today's heterogeneous computing environments. Let us look at an example of how and why resource definitions are created at a site. The diagram below shows five DQS hosts with different capabilities.
Many users will have created an application compiled for one machine architecture, say AIX. In the pictured environment the user could run their application on one of the AIX machines by specifying the queue name, say QN1. The negative aspect of this simple approach is that the job may be kept waiting for QN1 because of a previous job on that machine while either QN3 or QN5 might be available.
The solution for this situation is to create a resource definition for all AIX machines in the cell and name it "AIX1". Then the user can submit a job using the qsub command with the "-l" option. What are the steps needed to accomplish this:
This simple example illustrates two key points.
Let us expand the example slightly and create a new complex that cuts across machine architecture features, but shares a different attribute:
So far the sample resource definitions have been a single string such as "OUR_AIX" or "OUR_PVM". We could have used an alternative form for describing alternatives as we did with AIX versus HPUX. This form would replace the string we entered in the complex files: arch=OUR_AIX and arch=OUR_HPUX. The string "arch" is one created by the administrator and could be any arbitrary name. A resource request would then have the form "-l arch=OUR_AIX", or "-l arch=OUR_HPUX".
Resource definitions can contain numeric values and the corresponding resource requests can perform numeric comparison on these values to satisfy a criterium. A complex called BigMemory could be defined containing the line "mem=128". For our example let QN1 and QN2 both be operating on hosts which have 128 megabytes each. The complex BigMemory would be added to the QN1 and QN2. A request for an AIX machine with at least 64 bytes of memory might be stated as "-l OUR_AIX.and.mem.ge.64".
Resource definitions can possess more than the single line examples in each named complex. A complex definition named "BIG_HUMMER" might look like:
AIX414 mem=1028 Horsepower=10 IO_bandwidth=250
A resource request which needs a BIG_HUMMER host would, in this case look like:
"-l AIX414.and.mem.ge.1028.and.Horsepoer.ge.10.and.IO_bandwidth.ge.250"
There is one type of resource we have singled out for special handling in DQS332. These are resources that are not static during the operation of a DQS cell. While machine horsepower, memory size and operating systems and compilers for long periods of times (on the order of days or weeks), shared memory multiprocessor CPUs will have varying amounts of shared memory available to them as different jobs are executed on other of its CPUs. An increasingly common resource situation is "licensed software" such as compilers and database management systems. In many cases there are fewer licenses available within a system than there are hosts to execute the software.
This type of resource is called a "consumable" in DQS332. The definition of a consumable resource is somewhat different than a DQS "complex", in that the administrator will describe the total number of a resource which is available in a system, and the number of that resources consumed by a satisfied resource request. In the case of a FORTRAN compiler license, a site usually purchases a number of licenses for their system that are managed by a "license server". The consumable resource manager in DQS332 does not supplant a license server nor can it effectively mimic such a server. Instead it provides a mechanism parallel to the license server which attempts to keep track of how many licenses are in use by DQS clients. That is, DQS does not query the license manager for a count of available licenses, it keeps its own count of how many licenses (a consumable resource) are in use by DQS jobs.
The administrator defines a consumable resource by executing the command "qconf -acons FORTRAN"(using the compiler as an example). The default editor will open a window with the following template
Consumable xlf Available = <the amount of resources available> Consume_by < quantum by which resource is reduced by a request> Current = < currently available resources>
The field for Available should be filled in with the number of FORTRAN licenses authorized to this system. The Consume_by will be 1 for software such as compilers. The "Current" field will usually be equal to the "Available" field, unless there are several licenses in use at the time this Consumable is being defined. The Current field is also used to reset the DQS332 consumable counter when DQS332 gets out of sync with the actual license manager.
Queues that must manage this consumable resource should then have the consumable name added to the "consumables" parameter list in the queue configuration. The user need not be aware of the distinction between standard complexes and consumables. Their resource requests are stated in the same way: "-l our_AIX.and.mem.ge.64.and.xlf". The qmaster will determine if a xlf license is available by examining its internal counters (which may NOT match the license server's). If the license and other resources are available the job will be launched. At the time the job is started the consumable count for the FORTRAN resource will be decremented.
Upon job termination this resource count will be incremented. Obviously this is not a satisfactory situation for a user who wishes to submit a job which does a quick FORTRAN compile which produces an executable which is then to run a week long job. The consumable count would remain decremented for the duration of the job while the license manager will have had the license "token" returned at the conclusion of the compilation.
For this situation the cooperation of the user is required, to avoid breaking up jobs into compile-only and compute-only separate jobs. The "qalter" command has been modified to permit any user tp execute the "qalster" command but only if it has the "-rc " (return consumable) option. The user job would then have a script file that might look like:
#!/bin/csh #$ -l xlf.and.our_AIX xlf my myprogram Regrettably, "qalter -rc" not implemented - hen 990806 qalter -rc xlf 1 myprogram mydata
The qalter command here specifies the name of the resource being returned followed by the quantity being returned. When resources such as high performance disk or shared memory are being defined as a consumable resource often a "quanta" of the resource is granted and recovered. An example might be that a UNIX page is the minimum quanta or an integral number of pages could be the "quanta". Where licenses are normally doled out one at a time, memory might be allocated 1 MB at a time. Hence the Consume_by field in the consumable definition.
REQUIRED Complexes and Consumables
A job submission may contain one or more resource requests (the "-l" option). A job with no specific resource requests is thus a candidate for assignment to any available queue. In many installations some queues are best utilized by very specific job configurations. An example might be a site that possesses a heterogeneous collection of cpus with very wide differences in computing capacity. The more robust computers should not be assigned to "tiny" but persistent jobs in some cases. DQS 3.3.2 provides a special keyword "REQUIRED" which can precede any complex or consumable which a user MUST request in order for that job to be considered for scheduling on that queue.
The crux of any resource allocation and management system is its ability to provide resources in an "efficient" and "fair" manner. "Efficiency" is usually measured terms of maximizing job throughput and effective utilization of the available resources. "Efficiency" can be quantified in ways usually referred to the hardware hosts in a system "Fairness" is less easily described, is often measured by perceptions and is most often referred to the human users of a system. Further, priorities for efficiency and fairness and their relative values can vary widely from site to site. The burden of meeting these objectives falls upon the system job scheduling mechanism.
Forty years of experience with attempts at creating comprehensive job scheduling algorithms have demonstrated several points:
DQS332 therefore attempts to provide only a minimal amount of job scheduling technology. Hopefully small sites will be able to achieve a good level of balance in host usage and perceived "fairness" with the system as it is delivered. As a site develops experience with batch job management the staff will experiment with the few parameters provided in DQS. At some point the administrator will want to probe the module dqs_schedule.c , adding or subtracting from its capabilities. To that end we will describe the basic features of DQS scheduling and try to illuminate the routines most likely to be modified.
A user job passes through two screening processes before being considered by the qmaster for scheduling:
At the time of job submission a user job is checked to see if it meets two system criteria:
If a job fails these tests it is rejected at the time of submission and an error message returned to the user submitting the job. ( In the event that a job is submitted in anticipation of resources being added to the system, such a new host architecture, the user can choose to override the first test by using the "force" option ("-F") in the qsub command.
The qmaster conducts an examination (or "pass") over the job list :
The scanning process consists of sorting the jobs according to their submitted priority (the "-p" option), then by an internally generated "subpriority" and finally by the job sequence number (establishing its submission order). After the jobs are sorted they are examined in order, testing each available queue ( each ordered by load average or sequence number) looking for the first one which matches the resources requested by that job. If a match is found the job is dispatched and the next job is examined.
Manipulation of a job's subpriority before the sorting step is the easiest way to affect the basic scheduling algorithm. In DQS332 this simply consists of increasing the subpriority field of a job based on the number of previously submitted jobs (at the same priority level) for that user. Thus two or more users with several jobs queued at the same priority and for the same system resource will have their jobs interleaved, so that no one user can dominate a resource by submitting a large quantity of jobs.
The system administrator will probably experiment with this subpriority computation as a first step in customizing DQS. Flirting with the resource matching is considered to be a more risky affair as the side effects of such changes are harder to predict or detect.
DQS332 provides a minimal AFS support capability. The introduction of the "process shepherd" has made the job re-authentication in DQS conform to AFS security requirements. The output file disposition feature addresses the `cross platform' security problems of dealing with stdout and stderr.
A limited multi-cell operation capability is provided in DQS332. Jobs may be moved from cell to cell if they are not yet in execution, and users authenticated in one cell can view the status of the queues in another cell.
Site accounting methods vary as widely as any aspect of a batch processing system. DQS332 records as much information as possible about a job's scheduling and execution in a single ASCII line of text. These entries are preceded by an ASCII string of the standard UNIX GMT time of the entry.
Extraction of the accounting information simply requires using a structure definition for the act_file entries in one's "c" extraction program. An example of this technique may be found in the program acte.c which can be found in the .../DQS/tools directory. Included in the tools directory is a script "dostats" which employs acte to create a series of system summary files for the administrator.
The process of DQS system management first consists of laying out the physical and logical structure of a cell. The physical organization is described by adding hosts and assigning them to queues. The logical organization consists of defining resource "complexes" and consumable resources and assigning these to their appropriate queue hosts. Finally setting system parameters in the conf_file and each queue configuration establishes the operating environment for DQS operation.
The ongoing management steps should include:
The majority of the DQS332 utilities set and their options are provided for the system management function. While users may employ the qalter command, for example, to change the characteristics of a submitted job, the administrator will use this function more often. A not-uncommon occurrence is for the administrator to increase the submission priority of a job to move it ahead of other jobs in the scheduling.
One utility should be highlighted here, the "qidle" function. Many DQS hosts may actually reside on someone's desk and serve as their personal; workstation. At the same time these machines are utilized for their computational capabilities in a cell. To serve both functions, it must be possible for the workstation user to have priority access to their machine and not suffer keyboard and mouse response deficiencies because the host is being shared with DQS. A first step is to make the "owner" of the workstation also an "owner" of all queues assigned to that host. Then when the workstation owner wishes to have exclusive use of the machine they will have DQS permission to suspend any queues on that machine.
Enter the "qidle" utility. This is an X-Window based program, since we presume that workstation users will be operating with X-Window. It can be started at any workstation and performs the following functions on behalf of the workstation "owner" who the administrator has also designated a queue "owner" in the queue configuration.
What happens in the case where more than one user may have access to a workstation. The "system console" is an example where many users may be permitted to operate the keyboard and mouse. Making all users "owners" of that station's queues could result in an unmanageable list and is a potential security problem, since a queue owner has privileges beyond queue suspension actions.
The qidle in DQS332 has thus been modified from its DQS 3.1.2.4 form. It is now a member of the DQS332 utilities group and communicates directly with the qmaster rather than indirectly through the qmod utility. qidle can be started on any workstation by any user who has permission to login to that workstation. Once started it performs the same functions described above.
Solving Installation Problems
Most installation difficulties can be divided into three categories (in the order of probability)
Identify the symptoms of the installation failure and refer to one of the following sections:
INSTALL fails during the make process of the "config" program.
The GNU configure program uses the "Makefile.in" template in the DQS/CONFIG directory to produce the Makefile for the DQS config utility. It is possible that a new configuration of compilers or linkers can cause the GNU facility to create an erroneous Makefile. Visually check the Makefile for correctness.
Although DQS332 installation has been tested on many platforms, variants of the compiler or operating systems can create WARNING messages during the compilation which we have not made provision for. Even different versions of GNU "C" yield different warning messages. If the error is fatal to the compilation please contact the DQS332 support team for assistance.
INSTALL fails during the execution of the DQS config program.
During the config process the system attempts to create a number of directories and sub-directories. The default starting point for this process is the current working directory of the user if running as non-root or /usr/local/DQS if running as root.. If any of the directories exist, an error message is displayed on stdout, but the config program continues. If the user discovers that erroneous directory names were specified, config can be interrupted with CTRL-C. This will unwind many aspects of the configuration process, however NO DIRECTORIES will be removed. The administrator will have to cleanup any relevant directories manually. After reviewing the directory already exists" messages the administrator can choose to ignore those which are expected because the directories were previously created..
INSTALL fails during the "make" process
During the DQS config step, all of the target directories are created except for the ones associated with the compiled output object (`.o' files) and the interim executables (qmaster, dqs_execd). If a previous installation occurred under a "root" user and the current "make" is being done as a "non-root" the attempt to create the ARCS sub-directories will fail for lack of permissions. The solution is to perform the "make" as root or change the owner of the ARCS sub-directories to the user doing the installation of DQS332.
The GNU CC compiler is chosen as the default compiler or the "make" process if it is available. Some sites may experience a large number of "gcc" warning messages if there have been local modifications to the gnu include files. If this occurs or if the site prefers to use the native "C" compilers then the following steps should be taken"
If only "warning" messages appear in the stdout results, you can feel reasonably secure with the installation. However we will try to eliminate these in future releases and would appreciate receiving information on these occurrences. If an error fatal to the compilation occurs please contact the DQS support staff.
INSTALL fails during the "make installbin" phase
Once the make process has created the temporary executables in the ARCS directory they should be moved to their "final resting place" as chosen during the DQS config step. For operational installations this step should be performed as root. If the INSTALL script was started as non-root and the target directory requires root permissions the INSTALL process will fail at this point.
If this occurs the administrator should switch to "root", change directory to ./DQS and type "make installbin".
Since the DQS config process attempts to create the BIN target directory, this phase may generate several warning messages that "directory already exists". Ignore these warnings. If, however the message is "error, permission denied", the process should be repeated in "root" mode.
To prevent confusion between DQS332 binaries and previously installed versions we have appended the string "332" during the installbin process. The usual next step is to provide soft-links in /usr/local/bin to these binaries something of the form:
ln -s /usr/local/DQS/bin/qmaster332 /usr/local/bin/qmaster
INSTALL fails during the "make installconf" phase
After the binaries have been installed in their directory the "resolve_file" and "conf_file" will be moved to their target directory, (a possible default might be "/usr/local/DQS/common/conf" ). In our "quick install example" this process should proceed automatically. If a non-root users invokes the INSTALL script and the destination directory is restricted to a root-user, this step will fail with a "permission denied" error message. However when a series of different platform types are being aggregated into a single cell, only one conf_file and resolve_file need be moved to the common/conf directory. If this has already been done then this step can be skipped.
Startup of the qmaster fails.
The principle reason for the qmaster not executing during initial testing is the absence of the /etc/services entries directed by the installation process. The err_file should be examined. Warning messages about absent hosts, acl and complex files should be ignored. Look for an entry "Bad Service" which points to the /etc/services file.
An obvious error, but one that occurs often is trying to start the qmaster in user-mode while the RESERVED_PORTS TRUE appears in the conf_file.
If attempts at starting the qmaster fail, after checking root-mode and the /etc/services file, the administrator should set the environment variable DEBUG to 1 and then restart the qmaster as follows : "qmaster332 >&debug.out &" (assuming a C shell environment). After the qmaster crashes send the file "debug.out" to the DQS support staff.
Startup of the dqs_execd fails
The principle reason for the dqs_execd not executing during initial testing is the absence of the /etc/services entries on its host as directed by the installation process. The err_file should be examined. Warning messages should be ignored. Look for an entry "Bad Service" which points to the /etc/services file.
An obvious error, but one that occurs often is trying to start the dqs_execd in user-mode while the RESERVED_PORTS TRUE appears in the conf_file.
If the dqs_execd is not able to check-in with the qmaster during dqs_execd startup the daemon will shut down (once executing the dqs_execd will not shut down if the qmaster is absent). Make sure the qmaster is running before attempting to start the dqs_execd.
If attempts at starting the dqs_execd fail, after checking root-mode and the /etc/services file, the administrator should set the environment variable DEBUG to 1 and then restart the qmaster as follows: "dqs_execd332 >&debug.out &" (assuming a C shell environment). After the qmaster crashes send the file "debug.out" to the DQS support staff.
Startup of qconf fails
If the first attempt at using qconf produces error messages and the qconf terminates there are several possible causes:
qstat display shows queue status as UNKNOWN
During the initial test phase, the manager will have created one queue using qconf. After it has been created, execution of qstat should show the presence of a queue and a status of DISABLED. An UNKNOWN status indicates a failure of the dqs_execd to contact the qmaster in the time prescribed as MAX_UNHEARD in the conf_file. Check the err_file for messages relating to the dqs_execd being unable to contact the qmaster. Since the dqs_execd would not even start if it could not check in with the qmaster, some new problem must have developed. Check to see if the dqs_execd is still running.
qsub fails to submit test job
The test script should be accepted by the DQS system at this point with no problem, since utility<->qmaster interaction has been operating successfully in the previous steps. The most likely reason for a failure of this qsub test is represented by a message of the form "ALARM CLOCK shutdown".. This is due to the qmaster or the network interfaces being overburdened. Often the host on which the qmaster is running may be executing some non-DQS managed computational hog. If the ALARM message occurs try increasing the ALARM values in the conf file and re-executing the qsub command. (Note that ub this experiment the dqs_execd and qmaster need not be restarted after changing the conf_file, as the qsub is the only one complaining. However if the new values of ALARM's parameters prove satisfactory the daemons should be restarted as soon as practicable.).
Test job ends with no output
If the permissions for the user submitting the test script are not sufficient for the target host, the job launching process will be terminated and a message sent to the err_file. An accounting record will also be sent to the DQS act_file. Check these files for information.
Test script produces a non-zero length stderr file
The test script should create two output files, one containing stdout information and the other the stderr output. If the stderr output is not zero length than some "very unlikely" event occurred during the job execution. Examine this stderr file and the err_file to determine what the cause was.
Operational errors
Once the system has succeeded in running the test script, the administrator will configure hosts, queues and resources for its operational settings. A myriad of situations can then occur which may appear to be, or in fact are, DQS system errors. For this reason DQS produces a large number of informational, warning and error messages which are posted to the system err_file.
In the event that an operational aberration is detected the err_file should be examined closely. If no explanations are obvious,. The DQS support staff should be contacted and sent a relevant extraction from the err_file and act_file.