 Note in SLC6 cluster the files have moved to:
less /var/lib/torque/spool/[PBS_ID].OU for output or .ER for error

Webpage Monitoring (inc Hadoop)

  For the SLC5 Faraday cluster (accessible using qsub, qstat on linappserv0)
 For the SLC6 Faraday cluster (accessible using qsub, qstat on linappserv5)


For the Hadoop file system, information can be accessed from the following url (when connected to the pp private network, however Stella does not appear to be sufficient so connect via ssh tunnel or firefox window from linappserv1)

Help! My jobs queued and died and I did not get a log file

This normally happens if one of the nodes has a problem. If you have been fortuantely to run a group of jobs, look at the nodes they have been sent to (qstat -tn1), if you see a pattern of numbers which then stops (ie node28, node29, node30) and then no more jobs, the likelihood is that node31 has a problem. The system sees the queue is empty and passes jobs to it. They then fail to run without any error report back to the system, so the next queued job is also sent to that node. In this way, a whole range of jobs can disappear down a black hole. If you suspect this, check the webpage monitoring system to see if there are any other nodes without any jobs running on them and then send a email to sysadmin. You can sometimes check by attempting to ssh onto the node. If you cannot, there is a problem. If you can, see if you can access cvmfs. In some cases, automount will have failed which causes new jobs to fail when sent to the node.


Deleting all your jobs quickly

You can do this with qselect and xargs:

