LQCD Homepage

LQCD Home

QDCOC Computing

QBATCH Information


Introduction

Current Status

Quick Start

File Management

Job Description

Interactive Jobs

Queues on ACC Mbds

Power Cycling

qhdwcheck wrapper
qhdwcheck errors database web front

Deleting jobs

Basic PBS commands

PBS accounting

Machine Status


Web Display (Under Construction)
(Allocation status of all available partitions)

QCDOC Status (USDOE only)
(Partitions, Jobs DB, etc.)

Batch System: Current Status
(Available Queues, Running Jobs, etc.)

Errors Database
(DB of ASIC and Wire errors.)

Accounting and Usage Statistics


QCDOC Usage (USDOE only)(Under Construction)
USDOE QCDOC Machine Usage

New Users


Computer Accounts

Accessing QCDOC

CTS accounts

CyberSecurity Training

RBRC Users Mailing List

USDOE Users Mailing List


Internal Links
(Available to QCDOC Admins Only)

QBATCH: qhdwcheck wrapper

Errors (SCU, DRAM, EDRAM) can occur during a job run. Large number of errors may indicate that the job results cannot be trusted. Values of error counters are reported with the qhdwcheck qcsh command and get reset with qreset_sys.

To determine if the job results can be trusted we have defined error limits for each type of error and for each partition. If the job errors are larger than the error limits the job should re-run and the new results should be compared with the previous.

The qhdwcheck wrapper executes the qcsh command qhdwcheck to get the error counts and compares those with the error limits. The error limit are read from the file:

     /qcdoc/machines/status/$QMACHINE/error_limits.dat
The user can override the default location of the error limit file by specifying the env. variable ERRORLIMITSFILE. for example:
    export ERRORLIMITSFILE=/home/user/workdir/errors.dat
The wrapper can be invoked as:
     /qcdoc/local/batch/v2/qhdwcheck.qcsh [output_file]
The optional argument output_file specifies the output file of the qhdwcheck command. The default output file is qhdwcheck.out (if no argument is passed).

If an error count is greater than the corresponding limit the wrapper returns an exit code of 1 indicating that the job should re-run. Otherwise it returns 0. For example:

qcsh> source /qcdoc/local/batch/v2/qhdwcheck.qcsh outputfile
qcsh> if ( $? != 0) then
qcsh>   re-run-job
qcsh> else
qcsh>   run-next-job
qcsh> endif
The wrapper puts all the error counts (regardless if they are above the error limits) into a database. The database has a web front at: http://www3.bnl.gov/qcdoc/qhdwerrors/

One of ten national laboratories overseen and primarily funded by the Office of Science of the U.S. Department of Energy (DOE), Brookhaven National Laboratory conducts research in the physical, biomedical, and environmental sciences, as well as in energy technologies and national security. Brookhaven Lab also builds and operates major scientific facilities available to university, industry and government researchers. Brookhaven is operated and managed for DOE's Office of Science by Brookhaven Science Associates, a limited-liability company founded by Stony Brook University, the largest academic user of Laboratory facilities, and Battelle, a nonprofit, applied science and technology organization.
Privacy and Security Notice