LQCD Homepage

LQCD Home

QDCOC Computing

QBATCH Information


Introduction

Current Status

Quick Start

File Management

Job Description

Interactive Jobs

Queues on ACC Mbds

Power Cycling

qhdwcheck wrapper
qhdwcheck errors database web front

Deleting jobs

Basic PBS commands

PBS accounting

Machine Status


Web Display (Under Construction)
(Allocation status of all available partitions)

QCDOC Status (USDOE only)
(Partitions, Jobs DB, etc.)

Batch System: Current Status
(Available Queues, Running Jobs, etc.)

Errors Database
(DB of ASIC and Wire errors.)

Accounting and Usage Statistics


QCDOC Usage (USDOE only)(Under Construction)
USDOE QCDOC Machine Usage

New Users


Computer Accounts

Accessing QCDOC

CTS accounts

CyberSecurity Training

RBRC Users Mailing List

USDOE Users Mailing List


Internal Links
(Available to QCDOC Admins Only)

QBATCH: Job Description

Job Description

qbatch.pbs provides the description of the PBS job that will be submitted. It contains PBS directives (starting with #PBS) that provide a job name, queue name, specify PBS output and error file generation, PBS notification etc.
It also contains environment variables that specify the name of the user provided qcsh script (QCSH_SCRIPT), the qrb file the job will read (QRB_FILE), the topology (TOPOLOGY), the maximum idle time before declaring a job as hung (MAX_IDLE_TIME), how often to check the job for being idle (CHECK_IDLE_TIME), etc.
qbatch.pbs is the file that actually gets submitted with qsub.
The user should copy this file into his working directory and customize it.

qstart.csh is the script that starts the PBS job. It starts the qdaemon and qcsh processes with the appropriate arguments. It also powercycles the machine partition if needed and restarts the job. The user should not need to copy or modify this script.

prerun.qcsh is executed under the qcsh shell. It starts up the machine partition by running qinit, qpartition_connect, etc. It also provides recovery machanisms from machine errors and hung jobs. The user should not need to copy or modify this script.

user.qcsh is a user provided qcsh script that will run on the machine partition. The example $QBATCH_HOME/user.qcsh script is a simple "Hello World" example. The exact name of the user provided script is defined with the QCSH_SCRIPT variable in qbatch.pbs.

Job Recovery

We have currently implemented the following procedure to recover jobs that produce errors or hung. During the machine start up phase (prerun.qcsh script) we execute the following sequence of qcsh commands:
  • qinit $QMACHINE
  • qpartition_connect -p 0
  • qset_reset_boot -R $QRB_FILE (only if the user has defined the $QRB_FILE variable)
  • qreset_sys
  • qreset_boot
  • qdiscover
  • qpartition_remap $TOPOLOGY (only if the user has defined the $TOPOLOGY variable)
In the above startup sequence we check for errors only during the qreset_sys, qreset_boot and qdiscover commands. If an error occurs we then go back and rerun the above sequence starting at qreset_sys. If an error occurs again we exit the startup phase, power cycle the machine, restart qdaemon and qcsh, and then go over the startup procedure again. This cycle is repeated up to $MAX_PCOUNT times or until the startup phase ends without an error. $MAX_PCOUNT is a parameter that represents the maximum number of powercycles during a job run before giving up declaring the machine partition as broken requiring technical attention. If the maximum number of powercycles has been reached all queued jobs will be put on Hold ( so they won't start running until they are released (using: qrls JobId ) by the user (owner of the job) or by a user with PBS administrative priviledges). If a job passes the startup phase with no errors, it will run the the user provided qcsh script in the background:
 source $QCSH_SCRIPT &
If the user jobs exits normally we execute:
  • qhangupcheck
  • qhdwcheck
  • qdetach
The output of qhdwcheck is emailed to the user.

Checking for Idle Jobs

For as long as the user provided script executes in the background, we examine the modification time of file $IDLEFILE every $CHECK_IDLE_TIME seconds. If $IDLEFILE has not been modified for more than $MAX_IDLE_TIME seconds we declare the job as idle (hung) and a recovery process starts by first executing a series of qcsh commands:
  • qkill
  • qhangupcheck
  • qhdwcheck
We then exit qcsh, kill any remaining qdaemon processes for the user partition, powercycle, restart qdaemon and qcsh and then restart over the startup phase.

The "watch" file ($IDLEFILE) remains the same throughout the PBS job run. If a user wishes to use a different $IDLEFILE file during the run he/she should append the filename of the new "watch" $IDLEFILE file in $IDLEFILENAMES:

   echo my_new_idlefile >> $IDLEFILENAMES
PBS scripts will read the last line of the $IDLEFILENAMES before checking the modification time of IDLEFILE (thus every $CHECK_IDLE_TIME seconds) and use that as the new IDLEFILE.

Re-running jobs

A variety of errors can occur during a job run (SCU, DRAM, EDRAM). To determine if the results of a job can be trusted we have defined error limits for each type of error and for each partition. Error counters in a running job can be examined by the qhdwcheck qcsh command. If the error counts are more that the error limits then a job should re-run. Error limits for each machine partition are specified in file:
      /qcdoc/machines/status/$QMACHINE/error_limits.dat
The user can override the default error limits file by defining the environment variable ERRORLIMITSFILE (it is commented-out in qbatch.pbs, version v2).

A qhdwcheck wrapper gets the error counts by running the standard qhdwcheck qcsh command and compares them to the error limits. If an error counter is greater that the corresponding error limit the wrapper returns an exit code of 1, meaning that the job should re-run. More details in qhdwcheck wrapper.

CheckSum Errors

Each Serial Communication Unit (SCU) on a QCDOC node has a running checksum of all the sent and received data separately. The sent and receive checksums on the two ends of each wire should agree. Any disagrement indicates a miscommunication between the nodes that is not caught by packet level data integrity check.

QOS has a routine (ScuChecksum::CsumSwap(void)) that reads the checksum registers and compares them. The checksum wrapper (csum_test.qcsh) is a qcsh script located at $QBATCH_HOME/ invokes the above subroutine via a program csum_test.x (written by Chulwoo) and examines the exit code. If an error is detected it email the output to the user and the admins. An error usually means that the result of the calculation may not be trusted and the job should re-run.

csum_test.qcsh is invoked by prerun.qcsh after the user provided script ($QCSH_SCRIPT) has run succesfully.

To turn off checksum tests users should leave the env. variable $CHECKSUM undefined or set to 0 in qbatch.pbs. For very large partitions (8K nodes) running the csum test may take a long time.

Error Notification

The user is notified by email when:
  • trying to run a job on a partition that is not allocated properly
  • the user provided script $QCSH_DCRIPT does not exist
  • an error occurs during the startup phase (qreset_sys, qreset_boot, qdiscover)
  • the job appears to be idle (Idle time of $IDLEFILE is greater than $MAX_IDLE_TIME)
  • the machine is being powercycled
  • the maximum number of powercycles has been reached ($MAX_PCOUNT)
  • queued jobs are being put on a Hold state as the maximum number of powercycles has been reached and errors still occur.
  • Checksum errors were detected.
In addition PBS notifies the user when job starts, ends or aborts. The PBS notification is specified with the #PBS -m bea directive in qbatch.pbs.

Diagnostics

The output of every qcsh command is redirected to a file. All job output files are stored in the $JOBID/ sub-directory and have the following format:

Command_Name.Job_Name.Job_ID.out.Number_of_Powercyles.Number_of_Retries
  • Command_Name: is the qcsh command, like qinit, qreset_sys, qreset_boot etc.
  • Job_Name: is the Job Name as specified by the user in the qbatch.pbs file (#PBS -N Job_Name)
  • Job_ID: is the Job ID assigned by PBS
  • Number_of_Powercycles: is the powercycle counter
  • Number_of_Retries: is the number of times we try to reset the machine by executing qreset_sys before powercycling
For example: qreset_boot.qbatch.1112.out.2.1 contains the output of the qreset_boot command after we have powercycled twice and qreset_sys once.

One of ten national laboratories overseen and primarily funded by the Office of Science of the U.S. Department of Energy (DOE), Brookhaven National Laboratory conducts research in the physical, biomedical, and environmental sciences, as well as in energy technologies and national security. Brookhaven Lab also builds and operates major scientific facilities available to university, industry and government researchers. Brookhaven is operated and managed for DOE's Office of Science by Brookhaven Science Associates, a limited-liability company founded by Stony Brook University, the largest academic user of Laboratory facilities, and Battelle, a nonprofit, applied science and technology organization.
Privacy and Security Notice