The channel check handler (CCH) aids the I/O supervisor (DMKIOS) to
recover from channel errors. CCH provides the device-dependent error
recovery programs (ERPs) with the information needed to retry a channel
operation that has failed.
This support is standard and model-independent on the external level (from the user's point of view there are no considerations, at system generation time, for model dependencies) SYSTEM INITIALIZATION FOR RMS DMKCPI calls DMKIOEFL to initialize the error recording at cold start
and warm start. DMKIOEFL gives control to DMKIOG to initialize the MCB area. A store CPU ID (STIDP) instruction is performed to determine if VM/370 is running in a virtual machine environment, or running
standalone on the real machine. If VM/370 is running in a virtual
machine, the version code is set to X'FF' by DMKPRV. If the version
code returned is X'FF', the RMS functions are not initialized beyond
setting the wait bit on in the machine check new PSi (virtual). This occurs because machine check interruptions are not reflected to any
virtual machine. VM/370, running on the real machine, determines
whether the virtual machine should be terminated.
If the version code is not X'FF', DMKIOG determines what channels are
online by performing a Store Channel ID (STIDC) instruction and saves
the channel type for each channel that is online. The maximum machine check extended logout length (MCEL) indicated by the Store CPU ID (STIDP) instruction is added to the length of the MCH record header,
fixed logout length and damage assessment data field. DMKIOG then calls DMKFRE to obtain the necessary storage to be allocated for the MCB record area (MCRECORD), the CP execution block (CPEXBLOK), MCHAREA, and MCEL. The address of MCHAREA is put in the PSI (ABCBAREA). Pointers to MCRECORD and the CPEXBLOK and put in MCHAREA. DMKIOG puts the address of !CEL in control register 15. DMKIOG obtains the storage for the I/O extended logout area and initializes the logout area and the ECSi to
ones. The I/O extended logout pointer is saved at location 172 and
control register 15 is initialized with the address of the extended
logout area. The length of the CCB record and the online channel types
are saved in DMKCCH. It should be noted that the ability of a CPU to
produce an extended logout or I/O extended logout and the length of the
logouts are both model- and channel-dependent. If VM/370 is· being
initialized on a Model 165 II or 168, the 2860, 2870, and 2880 standalone channel modules are loaded and locked by the paging
supervisor and the pointers are saved in DMKCCB. If VM/370 is being
initialized on any other model, the integrated channel support is
assumed; this support is part of the channel control subroutine of DMKCCH. Before returning to DMKIOE, the VM/370 error recording
cylinders are initialized. DMKIOE passes control back to DMKCPI and
control register 14 is initialized with the proper mask to record
machine checks. OVERVIEW OF MACBINE CHECK HANDLER
A machine malfunction can originate from the processor, real storage or
control storage. When any of these fails to work properly, the processor
attempts to correct the malfunction. When the malfunction is corrected, the machine check handler (MCB) is
notified by a machine check interruption and the processor logs out
fields of information in real storage, detailing the cause and nature of
the error. The model-independent data is stored in the fixed logout area 1-150 IBM VM/370 System Logic and Problem Determination--Voluae 1
and the model-dependent data is stored in the extended logout area. The
machine check handler uses these fields to analyze the error, format an
error record, and write the record out on the error recording cylinder
of SYSRES. If the machine fails to recover from the malfunction through its own recovery facilities, the machine check handler is notified by a machine
check interruption. An interruption code, noting that the recovery
atteapt was unsuccessful, is inserted in the fixed logout area. The machine check handler then analyzes the data and attempts to keep the system as fully operational as possible.
Recovery from machine malfunctions can be divided into the following
categories: functional recovery, system recovery, operator-initiated restart, and system repair. These levels of error recovery are discussed
in their order of acceptability, functional recovery being most
acceptable and system repair being least acceptable: Rl£OVERY: Functional recovery is recovery from a machine
check without adverse effect on the system or the interrupted user.
This type of recovery can be made by processor retry, the ECC facility,
or the machine check handler. Processor retry and ECC error correcting
facilities are discussed separately in this section because they are
significant in the total error recovery scheme. Functional recovery by MCH is made by correcting storage protect feature (SPF) keys and
intermittent errors in real storage. System recovery is attempted when functional recovery
is impossible. System recovery is the continuation of systea operations
at the expense of the interrupted user, whose virtual machine operation
is terminated. System recovery can only take place if the user in
question is not critical to continued systea operation. An error in a system routine that is considered to be critical to system operation
precludes functional recovery and would require logout and a system dump followed by reloading the system. When ,the errors may have caused a loss of
supervisor or system integrity, the system is put into a disabled wait
state. The operator is instructed to run the standalone error recovery
(SEREP) program and then IRa nua11y restart the system. REP!!R: System repair is recovery that requires the services of
aaintenance personnel and tak'es place at the discretion of the opera tor. Usually, the operator has tried to recover by system-supported restart
one or more times with no success. SYSTEM/370 RECOVERY FEATURES The operation of the Machine Check Handler depends on certain automatic
recovery actions taken by the hardware and on logout information given
to it by the hardware.
Processor errors are autoaatica11y retried by microprogram routines.
These routines save source data before it is altered by the operation. When the error is detected, a aicroprograa returns the processor to the
beginning of the operation, or to a point where the operation was executing correctly, and the operation is repeated. After several
unsuccessful retries, the error is considered peraanent.
CP Introduction 1-151
Previous Page Next Page