11:00 - 17:00

Mon - Fri

Autosys Documentation

AUTOSYS FAQ 

Why is my job getting a CHASE error 

chase is an Autosys command that audits the status of some jobs and in some cases takes an action. 

First chase looks at all the jobs that have been in STARTING state for more than 120 seconds. Since STARTING is a transient state (and should not last that long) chase issues an alert. If you receive this alert, you should verify the real state of the job (e.g. telnet into the client and run ps to see if the process is running) and either change the status accordingly or force-start the job to get it going again (both actions are done with sendevent.) 

Second, chase looks at jobs with a RUNNING status to see if the remote agent (auto_remote) and the command are running. If neither is running, then it sets the status of the job to FAILURE. The assumption is that either the processes have been killed outside of Autosys or that the machine has been rebooted. Autosys will continue the processing of that job as if it had failed. 

chase also reports if either the remote agent or the command process are missing (only one of them). This is an abnormal situation and needs to be investigated by the user. A common situation is when the job has run for more than one week and the /tmp/auto_rem.* log of the remote agent has been removed. See separate FAQ about this situation: 

    What does this Chase error mean: AutoSys remote agent is NOT running, but the user Job is?

chase also reports an error if it cannot connect to the client machine where it needs to verify the auto_remote and command process. This means obviously a machine or network problem. 

In general, a CHASE alarm means either that there is something wrong with your job or with the client machine. It is not a problem with the Autosys infrastructure, thus not an autosysops responsibility. 

For more details, plase read the manual page (man chase). We run chase every 1/2 hour with the -A -E options. As a user, you should not run chase. 

How do I get historical run data for my job? 

    autorep -r -N -J job_name [-d]

Where N is an integer representing the run number counted backwards. E.g. to get the 3rd run back use: 

    autorep -r -3 -J job_name [-d]

What is AutoSys? 

AutoSys is a cross platform job scheduler; 

I don't have time to read all this, where's the AutoSys crash course? 

Sorry, but Autosys is not a simple product like cron; it is not possible to explain everything in a few sentences. To use the AutoSys product effectively and properly, you need to read through a couple documents. You can access all both vendor and internal documentation at You will need to spend approximately one hour of time read the documentation. 

What are the advantages of using AutoSys? 

AutoSys allows you to setup job dependencies, use calendars. Maintaining your jobs is easier since you assign job names to each job, and can add/modify/delete them all from one machine. You can also load balance your jobs on a group of machines. 

What are the disadvantages of using AutoSys? 

Each region has a single Event Processor that does job scheduling for thousands of jobs. Because of this, you may experience occasional delays of up to a couple minutes from when your job is scheduled to run. Do not use AutoSys for real time scheduling. Read the ``Autosys Policy Documentation'' at for proper usage. AutoSys is not a complete replacement for cron. Make sure you understand both the benefits and risks of using AutoSys. 

Do I need a license to use AutoSys? 

Yes

How do I get support for AutoSys? 

You are responsible for troubleshooting your own AutoSys jobs. Limited support will be provided for AutoSys jobs. 

Where can I find more information about AutoSys? 

 

How do I use AutoSys? 

For UNIX, you need to load the proper module: 

     module load 3rd/autosys/instance_name

What GUIs are available for AutoSys? 

AutoSys comes with its own GUI call autosc for UNIX and NT 

The recommended method to create/modify/delete AutoSys jobs is through the command line utils supplied by the vendor. You can use any text editor to create your JIL scripts, use RCS for version control. It is also recommended that you create a project in AFS to keep all of your JIL scripts together. 

Difference between ON_HOLD and ON_ICE 

What is right for you depends on whether the job is in a stream and when you expect the job to run after you take if off hold/ice. 

When you put a job OH_HOLD 

If the job is in a box, the box will not finish. If the job has dependent jobs, the dependent jobs will not start. When the job is taken off hold (with 'sendevent -J JOB_OFF_HOLD') the job will run imediately (once) if any runs were missed during the ON_HOLD period. Otherwise, it will start at its normal time. 

When you put a job OH_ICE 

If the job has dependent jobs, the dependent jobs will see its status as SUCCESS. Thus they will run anytime they are scheduled to run and also immediately after the job is put ON_ICE. Thus putting a job ON_ICE can trigger a stream. When the job is taken off ice (with 'sendevent -J JOB_OFF_ICE'), it will start at the next scheduled run, not before. 

How do I 

Submit a job 

     jil < foobar_jilscript

Delete/Kill/Terminate a job 

     sendevent -E DELETEJOB -J foobar       
     sendevent -E KILLJOB -J foobar
     sendevent -E CHANGE_STATUS -s TERMINATED -J foobar

You use KILLJOB, when the job is running, and you want to kill the process from the process table. If autosys thinks the job is running, but the process is not in the process table, then you want to change the state to TERMINATED. You can't kill a process that no longer exists. 

Put a job ON_HOLD 

To put a job ON_HOLD, use sendevent as shown below. Note that the command may succeed, but the job status may not be changed to ON_HOLD for a while depending on how long it takes the event processor to get to this event. Also, if the job is at that time in STARTING or RUNNING, the EP will not change its status and you will not get a notification, so you need to follow up with autorep commands: 

     sendevent -E JOB_ON_HOLD -J foobar_job
     autorep -J foobar_job -d

Note also that the sendevent argument is JOB_ON_HOLD (i.e. and event, not ON_HOLD - a status). 

To take the job off hold use: 

     sendevent -E JOB_OFF_HOLD -J foobar_job

Note that if the job was scheduled to run during the time when it was ON_HOLD, Autosys will start it as soon as it is taken off hold (if all other conditions are met). 

Update a job using Jil 

     echo "update_job:  foobar_job start_times:  \"0:00\"" | jil

Restart a job 

     sendevent -E FORCE_STARTJOB -J foobar_box

View the event log for a job 

     autosyslog -J foobar

Check status of a job 

     autorep -J foobar
     autorep -J foobar -d

View job definition 

     autorep -J foobar -q

Get a list of all the jobs I own by prod id or machine name? 

There is no vendor provided tool to do this. You can try to use jilgrep 

Examples: 

     jilgrep NYA pleung
     jilgrep NYT -q -h hqfid1

Read/write to AFS using AutoSys? 

This is the same as cron, you need to wrap your job with kcron, and add a ticket for your id to /var/spool/tickets 

In general, you need to use kcron whenever you need afs tokens to access an afs dir. Most dirs in afs are world readable, so you may not need to use kcron if you just want to read an afs dir unless it is a protected dir. 

Redirect stderr and stdout? 

Unless you specify the redirection, stderr and stdout will go to /dev/null by default. It makes sense to define the job attributes for stderr, and stdout so you can have some logs to look at for troubleshooting a job failure. 

     std_out_file:  /var/tmp/${AUTO_JOB_NAME}.stdout
     std_err_file:  /var/tmp/${AUTO_JOB_NAME}.stderr

The default is to append to the file if you do not specify either > or >>. ${AUTO_JOB_NAME} is an autosys environment variable for the name of your autosys job. 

Find all dependencies on a job? 

Use job_depends 

     job_depends -d -J job_name

Convert an existing cron job to jil automatically? 

There is a command cron2jil that may be able to do this. However, it is not recommended. Use at your own risk. 

Add retry logic to my job? 

If a job starts and exits with FAILURE status, by default it will not attempt to restart again. Use the job attribute n_retrys to change this behavior. 

     n_retrys:  attemps

Control load balancing 

You can do this by defining virtual machines, and job loads for your machines. Please read Chapter 9 in the ``AutoSys User Guide for UNIX.'' 

Prevent Autosys form TERMINATE-ing my job 

When Autosys puts a job in TE status it is because the command had an abnormal end, e.g. dumped core. This is different from either the command returning an error exit (which results in FA) or from a user killing the job with the sendevent command. 

Naturally, the best way to ensure that Autosys does not TERMINATE your job is to make sure the process runs clean every time. However, if that is not practical (yes, we've seen worse shortcuts), the following approach can be used. 

When a process terminates abnormally, its parent, if it's waiting for it, will be able to tell this from the return code. (This is in fact what the auto_remote agent does to determine if the process ended up abnormally.) So one solution is to have a perl wrapper that actually runs the command and interprets the return. Then, the wrapper can either restart the process itself or it can exit with a value that is interpreted as failure (typically non-zero). 

The following perl code (contributed by Noel Yap) is an illustration: 

  my $result = system( "$cmd" );
  if( $result >> 8 == 0xff )
  {
    print "Unable to run $cmd\n";
  }
  elsif( $result == 0 )
  {
    print "$cmd succeeded\n";
  }
  elsif( $result > 0xff )
  {
    $result >>= 8;
    print "$cmd failed with status $result\n";
  }
  else
  {
    my $message;
    $message = "$cmd terminated with ";
    if( $result & 0x80 )
    {
      $result &= ~0x80;
      $message .= "coredump from ";
    }
    $message .= "signal $result\n";
    print $message;
  }
  exit( $result != 0 );

I accidentally deleted my job. How can I recover a deleted job? 

So, you can try looking for your job in dump/archive directory. Cut and paste the jil for your job into a temporary file, and resubmit back into autosys. 

What is a box? 

What is a file watcher? 

What are the job naming conventions for AutoSys? 

The purpose of job naming conventions is to make support easier for both yourself and autosysops. You want to have some way of grouping all of your jobs. To easily find your job; you need to know the name of job in order to find it using autorep. Here's how to name your jobs: 

     group-jobname-prod-city
     group-jobname-dev-city

city is the city that the instance belongs to, i.e. NYA is ny. 

Example: 

     foo_group-foobar-prod-ny

The max length of a job name is 30 characters. A lot of users of AutoSys complained because this job naming convention does not leave many characters for a jobname. So, the alternative is to remove prod and dev; append an A to the city name for production, or T for test. 

Example: 

     foo_group-foobar-lna
     foo_group-foobar-lnt

Why can't I update my job anymore? 

There are a few possiblities: 

a) You forgot to set the job permissions with mx,me. This allows you to modify the job from any machine. If you don't specify mx,me, you will only be able to update the job from the same machine you submitted it. If the machine you originally submitted the job got renamed or decomissioned, you will need autosysops to update the job for you. 

b) You are not the owner of the job. 

What is a profile and how do I use it? 

A profile for AutoSys is similar to the user profile for UNIX. It is a file that gets sourced before your job gets started. Setting up a common profile is useful if you want to set the same enviornment variables before startup for many jobs. The profile is a job attribute: 

     profile:  /xyz/dist/foo/etc/autosys_profile

Why is my job stuck in starting state? 

After it connects to the remote agent and gets a positive acknowledgement from it that the command has been started, the event processor (EP) marks the status of the job as STARTING, and moves on to the next job. It is the remote agent that further updates directly in the database the job status to RUNNING and then SUCCESS or FAILURE. 

Thus, a job stuck in STARTING denotes lack of communication between the remote agent (or its machine) and the database. It also occurs if the machine was shut down while the job was being started. 

The STARTING state is transient. A command named chase is run automatically every 30 minutes. This command, among others, raises an ALARM for jobs that have been in STARTING for more than 120 seconds. Moreover, while job is in STARTING (as well as in RUNNING), Autosys will not start the job on its normally scheduled runs, so as not to have the same job running twice in parallel. So effectively, a job stuck in STARTING is not being run. 

To get the jobs out of this status, manual intervention is required as Autosys cannot know if the job has run or not. 

To verify if a job is stuck in starting state, run 

    autorep -J jobname -d

You may see one or multiple lines stating that CHASE has seen the job ``stuck in starting state for over 120 seconds''. 

You should first log into the remote machine to verify that the job is indeed not running. If it is running, you need to investigate why the remote agent did not update the status in the database. But most likely, the job is not running and typically the machine has been rebooted. 

You then need to change the status of the job to TERMINATED (or FAILURE or SUCCESS) as shown below: 

    sendevent -E CHANGE_STATUS -s TERMINATED -J jobname

After that, Autosys will start your job for future scheduled runs, or you can force start immediately it if you wish. 

What does this Chase error mean: AutoSys remote agent is NOT running, but the user Job is

The full messages as seen in the autorep -d -J ... output is: 

  [*** ALARM ***]
    CHASE         04/19/2015 08:24:07  0  PD  04/19/2015 08:24:08
    <*** ERROR: AutoSys remote agent is NOT running, but the user Job is, on machine = lncmd1>
  [*** ALARM ***]
    CHASE         04/19/2015 08:54:14  0  PD  04/19/2015 08:54:16
    <*** ERROR: AutoSys remote agent is NOT running, but the user Job is, on machine = xxcmd1>
  ...

Most likely the /tmp/auto_rem.* file for this job execution has been removed. Chase checks for the existence of the auto_remote agent for a running job by probing if this file is locked, so a missing log file is interpreted as the auto_remote process having died. 

This situation happens typically if the job has been running for more than one week. On Aurora machines, a janitor script removes all the /tmp/* files older than 1 week. Thus, we recommend that jobs don't run for more than 1 week. 

Why doesn't fsql work with my autosys job? 

fsql requires the environment variables not available in autosys. Try adding the line below into your command: 

     USER=`/usr/ucb/whoami`; export USER

It's done in this context, because autosys uses sh, not ksh. 

What is the timezone for my job, and how do I change this? 

The timezone for you job will be the timezone of the Event Processor for the autosys instance. So, the machine you want to run your job is in California, the instance you use s NYA, and the EP is in New York. Obviously, you don't want the job to run using Eastern Time zone. You can change the time zone by specifying the job attribe: 

     timezone:  my_time_zone

You can get a list of all available time zones by running the command below: 

     autotimezone -l

What are calendars and how do I use them? 

Autosys lets you define a calendar to specify the list of dates to run your jobs. Specify the job attribute for calendar: 

     run_calendar:  calendar_name
     exclude_calendar:  calendar_name

You can get a list of all the calendars by running the command below: showautocal 

To learn more about calendars, read ` at  

I need a new calendar defined, how can I do this? 

Only members of autosysops and autosys-admin can define new calendars. Before you request for a new calendar to be created, make sure that you have verified that none of the existing calendars satisfy your needs. 

Non-standard, custom user defined calendars are NOT supported, because these must be updated manually. 

How do I use Netcool with AutoSys? 

This can be done, as long as you specify the proper attributes for Netcool. Please define your rule using the attributes as below: 

     Pattern on Manager  =  Autosys
     Pattern on Agent    =  your_jobname
     Pattern on Class    =  unix
     Pattern on Status   =   production
     Pattern on Machine Type =  server

Additionally, you can use Pattern on Group to filter on the alarm type. Alarms are listed in the Reference Guide in Appendix A under Alarms

Also, you can use the AlertKey to filter on the Autosys instance. This field contains something like: 

    xyz machine: <>

Thus the following pattern should work when INSTANCE is replaced with an actual three letter instance name: 

    Pattern on AlertKey = INSTANCE.*

Define a schedule to specify the time of the week to be notified by email/beep. Then define a policy to tie together the rule with the schedule. 

 

My job failed, but why didn't I receive a beep/email from Netcool? 

Autosys sends the alarm to separate servers: 

This reduces the chance of the alarm getting lost on the network, but there is still no guarantee of delivery; sometimes the SNMP traps get lost over the network. Autosys did the right thing, if you see the alarm by using autorep -J jobname -d; it means that Autosys tried to send the snmp trap to the Netcool servers. The notification layer is totally separate from autosys. 

You can try to add an extra layer of notification to your job. This job will can call ``msbeep'' and ``mailx'' directly to beep/email you. Of course, then this will assume the email and pager service are working properly. 

How do I append the current date to a file watcher/stderr/stdout? 

Calling your filename something like filename.`date ``+%m-%d-%y''` will not work. You need to define an environment variable in your profile to call the date command, 

     i.e. DATE=`date "+%m-%d-%y"`

Then, your filewatch filename would be filename.${DATE} 

How do I automatically rotate my log files for stderr/stdout? 

You can do this yourself by using an autosys job plus gzip or agelog. See ``man agelog''. 

How can I use "module load foo/bar" for my command attribute in jil?" 

Autosys uses sh (bourne shell) by default, and cannot be changed. You can put them into a shell script, then call your module load command. Here's an example: 

     #!/bin/ksh
     module load foo/bar/1.0
     command1
     command2

If you're doing this with perl, you can load the module from within your perl script. You may want to consider using the perl module Env::Modulecmd 

My job does not work, so there has to be a problem with the AutoSys infrastructure. How can I know for sure? 

Don't be so quick to blame the AutoSys infrastructure; make sure you didn't make any obvious mistake in your job definition. You can easily check the functionality of the AutoSys infrastructure by running chk_auto_up. This checks to make sure that the instance is up. Anyone can run this command. 

Example: 

     $ module load 3rd/autosys/
     $ chk_auto_up
     ______________________________________________________________________________
     Attempting (1) to Connect with Database: ABC:autosys
     *** Have Connected successfully with Database: ABC:autosys. ***
     ______________________________________________________________________________
     Connected with Event Server: ABC:autosys
     ______________________________________________________________________________
     ______________________________________________________________________________
     Checking Machine: <>
     Primary Event Processor is RUNNING on machine: <>
     Checking Machine: <>
     No Event Processor is RUNNING on machine: <>
     Checking Machine: <>
     Primary Event Processor is RUNNING on machine: <>
     ______________________________________________________________________________

This means that the Event Processor for Autosys, which is on hqsas130, is able to connect to Sybase without any issues. You should also log onto the Event Processor to verify that jobs are being processed: 

     telnet <>
     $ cd /var/tmp
     /var/tmp 208$ ls -l *

You can also try to autoping the machine you want to run your job on. 

     $ autoping -D -m <>
     AutoPinging Machine []  AND checking the Remote Agent's DB Access.
     AutoPing WAS SUCCESSFUL!

head=2 Changes were made to my job some time ago. Is it possible to view the logs for modifications to a job? Yes. Use the command autotrack 

    $ autotrack -h
     

How do I use AutoSys with a HA/VCS cluster pair? 

Define the job to use the cluster service name as the machine attribute. Thus, the job will be started on the active side. 

Note: Autosys will NOT restart your job automatically if one machine crashes and the service is failed over. A poor approach is to set n_retrys > 0 and wait for chase to run (it does so every 1/2 hour) to find out that the job is not present and thus change its status to FAILURE. (Autosys does restart a failed job up to n_retrys times.) It is preferable that you use VCS to start and failover processes (daemons) that need to run continuously on a cluster. 

What is the maximum amount of delays I can expect for AutoSys? 

The total amount of delay you will experience will heavily depend on: 

     the load on the sybase server
     the load on the event server
     the load on the machine you want to run your job
     the number of jobs getting kicked off at the same time as your job

Any delays over 5 minutes, alerts get generated, email gets sent to autosys-users and autosys-admin. 

It works from command line, but not in AutoSys. Why? 

When you run your command from command line, you have kerberos tickets, but not when you use AutoSys or cron. The people that ask this question are generally either using rsh or writing to AFS. Another possiblity is the difference in environment vars. 

The default rsh is /bin/rsh, which is kerberized rsh. Try using /bin/rsh or /usr/ucb/rsh from command line, which does not rely on kerberos. If you can get that to work, you should have no problem with using rsh command in an AutoSys job. Otherwise, please work with your local UNIX support group. 

If you're using rsh, you may also be interested in using ersh (/bin/ersh). rsh returns the exit code of rsh, not the remote command supplied to rsh. ersh solves this problem, see ``man ersh'' for more details. 

For reading/writing to AFS, you need to wrap you command with kcron, and have tickets installed in /var/spool/tickets. If your command forks off child processes, then make sure you use kcron with the -e option, see ``man kcron'' for more info. Also, please make sure you have the proper acls set for AFS directory you want to read/write to. 

Lastly, when you run a job in autosys or cron, you only get a small subset of enviornment variables. Unlike a normal login terminal, you don't have a profile sourced automatically when you're doing things from autosys/cron. What you should do is source the profile manually prior to executing your job in autosys/cron. 

What is jobscape, hostscape, timescape? 

These are vendor provided GUI based applications to help you manage your jobs. 

Why can't I start the autosys GUI? 

If you're using autosys 3.5.0, and you get the error below when launching autosc 

    ld.so.1: autosc: fatal: relocation error: file autosc: symbol
    __snprintf: reference symbol not found

it's because you don't have enough free colors available in your colormap; there's a good chance you're using 8 bit graphics on old unix hardware. You should either quit out of applications that consume a lot of colors, i.e. netscape, or log out of your workstation and log back in. 

If you're using autosys 4.0, the GUI usage is permission based. We want to move away from using the old GUI tools in favor of the new Java GUI web based tools that CA provides with 4.0. 

When is AutoSys going to get BCP? 

The sybase servers and event processors for autosys are BCP. It is the user's responsibility to put autosys jobs on servers that are BCP. 

Is there a PERL API for AutoSys? 

No, there is no PERL API available for AutoSys. 

 

How do I use Autosys with Linux? 

It's basically no different from Solaris. Note however that the Linux environment can be different. Environment variables on which you may have grown dependent on Solaris may not be present on Linux; commands may be found at different locations (or not found as the PATH may be different), etc. 

How do I see what real machines are contained in a virtual machine? 

     autorep -M virtual_machine -q

To see a list of all of real and virtual machines defined: 

     autorep -M ALL 

How do I see a list of all autosys variables? 

     autorep -G ALL

What does Read stream socket FAILED in autorep -J jobname -d mean? 

When a job is started, the event processor connects to the remote agent, auto_remote, on the client machine. After the TCP connection is established (via inetd), the auto_remote is started and the event processor expects some input from it (the EP reads on the opened socket). If there is no input for a certain period, the read system call in the EP is interrupted and the message that you can see with autorep -J jobname -d is Read stream socket FAILED

This type of problem can be due to many different reasons: high cpu load on client machine, latency on the network, backups kicking in, AFS time outs, infrastructure outages. 

If this type of error happens, Autosys will put the job in a state called RESTART and it will try to restart it up to 4 more times at increasing intervals of between 5 and 20 minutes 

How often can I run a job? 

The autosys policy forbids running jobs more often than every 10 minutes. That is to protect autosys from overload. But there are other reasons why users should not use Autosys for jobs that need to run more often than that: 

  • Autosys can have significant job start delays (5-10 mins) and your job will not be started on time. 
  • If your job's end is delayed past the time interval (either because of autosys delays or because your job occasionally takes longer to execute than the interval), autosys will not make up for that job start. It will start the job on the next start time. So your job may miss a run. 

If you need run a program often and you would like to use the monitoring capabilities of Autosys, you can use the script pointed below. It will start your job with any frequency, and if the job fails, it will return a failure to autosys. 

 (The script is as is, unsupported, but others have used it without problems.) 

What happends to my jobs if sybase is not available for maintenance? 

Suppose the sybase server is not available for failover between 11:30pm to 12:00am, then the following will happen: 

In brief, the side effect will be a 30 minute delay in job scheduling. 

Any jobs that get kick off before the start time 11:30pm will run, however, the job on your client machine will not be able to update sybase with the status success or fail until sybase is back which is 12:00am. So, any downstream jobs that run based on the success or failure of the previous jobs won't be able to start till 12:00am earliest. Any jobs that are scheduled to run during the window between 11:30pm to 12:00am won't start till sybase returns online, and chase runs to kick start the jobs. chase runs every 30 minutes. 

What are the Autosys environment variables 

The following environment variable are defined in the environment of a job when the command is excuted. 

AUTOPID 

The PID of the auto_remote process spawned for that job run. 

AUTORUN 

The value of the run_num/ntry. 

AUTOSERV 

The three letter Autosys instance identifier. 

AUTOSYS 

The Autosys root directory

 Useful in referring to the Autosys executables patch as $AUTOSYS/bin

AUTO_JOB_NAME 

The name of the autosys job being executed. This is useful in creating job-specific output file by using the variable in the file name. 

AUTO_JOB_PID 

The PID of the command process, i.e. the direct child of the auto_remote. 

For other environment variables, see the next Q and A. 

What exactly is a job's environment under Autosys? 

A job's environment depends on many variables (pun not intended), including the OS and the user. Therefore the best way to find out what is actually defined is by setting up a job like the one below: 

    /* ----------------- gvp-printenv ----------------- */ 
    insert_job: gvp-printenv   job_type: c 
    command: /usr/ucb/printenv
    machine: macallan
    permission: mx,me
    std_out_file: /tmp/gvp-printenv.out
    std_err_file: /tmp/gvp-printenv.err
    alarm_if_fail: 0

After running it (using sendevent -J gvp-printenv -E STARTJOB) the file /tmp/gvp-printenv.out will contain the output of printenv

The above works on Solaris. On linux the command should be 

    command: /usr/bin/printenv

This method is also useful when you define the job profile to make sure it works as expected. 

Do not use undocumented environment variables, as they may not be supported in future releases. 

What happens if Autosys cannot start a job 

Autosys starts jobs on a client by connecting to the client's inetd daemon's port for the auto_remote agent. Thus, inetd starts an auto_remote process, which in turn executes the command specified in the job definition. If any of these steps fails, it is considered a system failure. System failures include: 

  • not being able to connect to the client (network problems, client down, inetd not running) 
  • inetd not able to fork an auto_remote process 
  • auto_remote process not being able to open a file, including the command, profile, stdout or stderr files 

In case of system errors, Autosys reports a job failure and schedules the job to be restarted after an interval governed by the following configuration parameters (in $AUTOSYS/autouser/config.$AUTOSERV): 

    # Max number of times to RESTART a job due to system errors
    MaxRestartTrys=5
    # Formula for computing the Wait time between restart attempts:
    #   WaitTime = RestartConstant + ( Num_of_Trys * RestartFactor )
    #   if ( WaitTime > MaxRestartWait )  then  WaitTime = MaxRestartWait
    RestartConstant=330
    RestartFactor=300
    MaxRestartWait=1230

(Note that through experience we have learned that MaxRestartTrys=5 means in fact that the job start will be attempted a total of 5 times. This includes the initial start and 4 retries.) 

System failures are different from job command failures. Autosys considers that a command has failed if the command exits with a value greater than zero (or max_exit_success) if specified. In this case, the number of retries is governed by the n_retrys job attribute. The timing of the retries is the same as above. 

With this knowledge, running the command below will help troubleshoot the type of error: 

    autorep -d -J job_name

Why did my job run twice (for jobs with multiple conditions) 

Consider this typical scenario, a job depending on the success of two other jobs, and no time conditions: 

    insert_job: job_A
    ...
    condition: success(job_B) && sucess(job_C)

Question: when will job_A run? Well, it depends. 

Assume that all three jobs were inserted at the same time. This means that they are all in INACTIVE (IN) state. Let's say job_B completes successfully. At that time, Autosys (knowing that job_A depends on job_B) evaluates the starting conditions for job_A. They are obviously not met since job_C is not in SU. 

Later, job_C ends with SU. Autosys evaluates the starting conditions for job_A, finds both job_B and job_C in SU and thus starts job_A again. 

Now, let's say job_B runs again and finishes with SU. Autosys evaluates the starting conditions for job_A and finds that both job_B and job_C are SU, thus it starts job_A. 

This is surprising for some. What needs to be understood is that: 

  • Autosys evaluates whether to start job_A only when an event on the predecessor jobs (job_B, job_C) occurs. 
  • The time stamp associated with the status of a job (e.g. SU in this case) is irrelevant. In our example, job_C's SU can be from 3 seconds or 3 months ago. 

There are several ways to avoid this trap. One is to place all jobs in a box. When the box starts, all jobs' statuses are set to ACTIVATED (AC), thus erasing any previous SU. The other is to have in job_A's command do sendevent's to set the status of job_B and job_C to INACTIVE. Both these methods have side effects, so handle with care. 

The situation can be more complex when job_B and job_C have a different owner, because you cannot use sendevent to change their status. In this case, you can use dummy intermediary jobs, e.g. job_B1 and job_C1, which you would own. These jobs depend on job_B and job_C respectively. Your job_A now should depend on job_B1 and job_C1, the status of which you can reset. The dummy jobs can execute a command that is know to always return success, such as /bin/true

How does Autosys start a job with date/time conditions AND dependencies on other jobs 

For a job as the one in the example below, Autosys acts as follows: 

    insert_job: gvp-doubledepend-B
    machine: macallan
    command: date
    condition: S(gvp-doubledepend-A)
    date_conditions: 1
    days_of_week: all
    start_times: "15:25"

Assume that initially gvp-doubledepend-A's status is not SUCCESS. At 15:25, Autosys attemps to start gvp-doubledepend-B, but after verifying the status of its predecessor, it does not. The Event Processor log reads: 

    [1] [15:27:03.214982] EVENT: STARTJOB         JOB: gvp-doubledepend-B
    [1]   
    [1]   This job's starting conditions have not been met - cannot be started.
    [1]   

When, at a later time, gvp-doubledepend-A's status becomes SUCCESS, gvp-doubledepend-B is started immediately. 

If the status of gvp-doubledepend-A becomes SUCCESS again, it will NOT trigger another run of gvp-doubledepend-B before next day at 15:25. 

As long as the status of gvp-doubledepend-A remains SUCCESS, the next run of gvp-doubledepend-B will occur at 15:25 every day without waiting for a new change in the status of gvp-doubledepend-A. (In other words, you should make sure that gvp-doubledepend-A's status is reset -- to, for example, INACTIVE -- or gvp-doubledepend-B will run based on the old status.) 

What does EVENT_HDLR_ERROR mean? 

It means that Autosys ran into a problem with your job that it could not solve by itself. This requires that you take a look at the status of the job and possibly make some manual adjustments. 

Two situations when this may happen come to mind: 

  • Sybase sever goes down and EP cannot update event status 
  • User changes status in the middle of job run to a value that is inconsistent 

These situations are typical, but by no means exhaustive. 

Bottom line: check you job, at least with autorep -d