Skip to content

troubleshooting

bergsma edited this page Jul 31, 2014 · 2 revisions

Categories of Automation Failures

  1. Program run-time Errors

  2. HS run-time program error

  3. AutoMan

  4. TP HS run-time errors

  5. Program logic issues

  6. Network LAN Problems

  7. AutoRouter related

  8. Global timeouts

  9. HSMS

  10. SECS Protocol

  11. Terminal Server

  12. Physical Connection

  13. SECS Device ID

  14. Infrastructure Problems

  15. AutoRouter behavior

  16. Soft links

  17. Privilege flags

  18. Disk I/O

  19. HS behavior

  20. AutoMan behavior

  21. ORACLE database.

Scope of Automation Failure

  1. Single interface.
  2. Entire group of interfaces (e.g.: all tegalís, all semyís).
  3. Entire node (e.g. all of fab3, autoprod-sc, all of site - autoprod-site).
  4. All interfaces, all nodes.

Single Interface Failure

  1. When the CONNECT message was sent, was the target executed by AutoRouter?

The <target> is the first field in the CONNECT message. The target is often a UNIX HS, but can also be a VMS HS, an executable image, perl script, DCL command procedure, etc.

Look in the UNIX AutoRouter log to find the CONNECT message. The incoming CONNECT message from the PROMIS client will trigger the AutoRouter to execute a HS process based on the target of the message.

For example, if the target of the message is "|semyk1@dusc06|event|CONNECT|...." then you should see the line:

...executing image file /fs/local/user/autoprod/run/semyk1

just above where the CONNECT message is logged

If that line does not exist along with the CONNECT message, then you may have a stuck FIFO. The procedure for stuck FIFO's is explained below.

If the HS file was not found in $GAEQ/hyp, then the AutoRouter will send back the message

%TARGET: failed to router message to ....

Most targets have log files. If HS is the target, the check the tail end of the HS log to see if it was executed.

UNIX: tail ñf $HSLOG/<target>.log

VMS: type/tail AUTOLOG:<target>.log

Was the target executed?

* YES: Did a Run-time error occur?  

Unexpected HS errors can usually be corrected by editing the source file. The next CONNECT message will run the changed HS. If the HS is not normally started by a CONNECT message from AutoMan or a PROMIS HS script file, such as an equipment server HS, then an open <target> command will start a new HS. Check the log to be sure.

You can also turn on debug tracing by placing the statement
debug(1) ;
in the section of the HS code where you think there is a problem. Look at the log to see a trace of execution.

* YES: Did a logic error occur?  

Logic errors can be difficult to fix, especially if the interface must be made operational A.S.A.P. If the logic error was caused from a recent change in the HS, then consider restoring the previous version until the new version can be properly fixed and corrected.

* YES: Did a core dump occur?  

UNIX: A core file is produced when a HS program has a segmentation fault, access violation, or other fatal error. Usually, this is inferred by the sudden end to the log file messages, plus the presence of the core file in the top-level directory. You can confirm that a particular core file belonged to a suspect HS by executing the following command: ìgdb ñcore core hsî, showing the trace back information.

VMS: There are no core files, but the log file should contain the text of the ìtrace backî,
* NO: Is the HS file found in the $AUTORUN (UNIX) or AUTORUN: (VMS) directory?

UNIX: A HS in $AUTORUN is soft-linked to the real file. To create a soft-link:
Example: The tegal1 HS is linked to tegal_promis_group, ìln ñs ../hyp/tegal_promis_group tegal1î

* NO: Is the HS file executable?  

UNIX: All hss are made executable by the following three steps: 1. The first line of the HS file must be:

#!/fs/local/area/abinitio/bin/hs_xxx ñf

where xxx is the version number. The ì-fî on the end is required; make sure it is there. 2. The execute bit must be enabled: i.e.: ìchmod +x targetî 3. It should compile correctly: i.e.: ìhs ñcf targetî

* NO: Was the FIFO stuck?  

A stuck FIFO results when a HS program is terminated abnormally, and the AutoRouter is not notified to close down and delete the FIFO. Thus, AutoRouter will continue to router messages, including the CONNECT message, to the FIFO.
Look to see if you can find the FIFO for the HS:

> ls -l $AUTOFIFO

If you see a FIFO special file with the same name as the HS, and it the file size is non-zero, then it is probably a stuck FIFO. The most likely cause is the HS that last accessed the FIFO has terminated abnormally.

Try the 'close' command
> close <target>

where <target> is the name of the FIFO.

The AutoRouter may also clear up the stuck FIFO after 10 minutes of activity.
As a last resort, stop and start the AutoRouter.

* NO: Are the logical names for the target correct?  

AUTOMAN: The VMS group logical is AUTO_<type>_<target>, where type is one of IN, LOTDATA, GENDATA, ABORT, OUT, and target is the name of the $AUTO_RECIPE or $AUTO_DCOP parameters. Check the following: 1. The value of the target has the correct node. I.e. target@autoprod-sc 2. The logical is not overridden at the process level. The AutoMan logicals are group logicals. 3. The main automation switch OPTION_AUTO is set to YES.

  1. Did the interface fail during the transaction process?
  • YES: If the interface was properly coded, all errors (except for core dumps), should be handled by the HS and the associated error message returned to the calling program: AutoMan, another HS, and also printed in the log file. Often, the error message will provide enough information as to how to fix the problem, or where to look for further information.

  • NO: A successful automation interface applies to all components. If the HS succeeded, as indicated by the log file, is the same true for all the log files. Look at all the log files to find out where the error occurred.

  1. Did the AutoMan session freeze?
    (FROZEN MUCHMAIN)

Symptom: Hotkeys disabled, target (HS) dead
AutoRouter will free automatically after 10 minutes of idle time.

On VMS, find the MBX_xxxxxxxx_PROMIS mailbox name.
$ show log mbx_* /sys

Execute the command procedure to kill it
$ @AEQ_SSP:AUTOSTOP MBX_xxxxxxxx_PROMIS

Example: Suppose PROMIS mailbox is MBX_06B81049_PROMIS
$ @AEQ_SSP:AUTOSTOP MBX_06B81049_PROMIS

Interface Group Failure

If an entire group of interfaces is non-operational, then the problem is going to be related to something that all the interfaces share in common. The possibilities are:

  1. The common HS fails for each interface in the same way. See diagnostic procedures for a single interface.

  2. The interfaces share the same SECS port, such as all semy tubes on a single furnace bank.

  3. All the interfaces share a common network node. (In this case, check that ALL interfaces on the node do not work.)

  4. The interfaces share a common resource directory, i.e.: auto/rec, auto/inbox.

  5. The interfaces share a common service: such as a PROMIS TP_script HS service.

Node Failure

An entire node failure can be caused from one of the following:

  1. Network Failure. Contact IT.

  2. Node (CPU) failure. Contact IT.

  3. Disk (I/O) failure. Contact IT.

  4. Infrastructure failure.

  5. VMS AutoRouter is down.
    See system startup procedures.

  6. UNIX AutoRouter is down.
    See system startup procedures.

  7. VMS down.
    Contact IT.

  8. UNIX down.
    Contact IT.

Clone this wiki locally