Tuesday, October 8, 2013

Unexpected AIX reboot at random days, always at 6:00 am

WWW.UNIXBABUFORUM.INI'm being disconcerted by one of the LPARS of an AIX box, a 9133-MMA (570). 
It runs AIX 5.3 TL7 and from january, the 2nd, it has rebooted three times without having being programmed to do it. 
There is only one line at crontab at 6:00 am and it starts a "dsmc incremental" (a backup). 
This LPAR runs only ORACLE RAC 10g, and there is another LPAR in another node (same model, same config.) wich is the other side of the cluster. 
There is no scripts wich contains "shutdown" not even "reboot". 
Where should I look in addition? 
WWW.UNIXBABUFORUM.IN had a problem with disk conection lost. 
Oracle RAC reboots a node when something wrong occurs (like loosing a disk), in the attempting of recovery an abnormal condition, so we are doing some tasks to avoid this: 
1.- SAN microcode update 
2.- Oracle RAC upgrade to 10.2.0.4 
3.- Upgrade AIX TL to TL 10 SP1 (don't want to be at latest level)

WWW.UNIXBABUFORUM.IN10g prior to 10.2.0.4 had a bug in CSS computational algorithm. Check with 
your DBA and make sure he adjusted settings accordingly. 
Subject: 
CSS Timeout Computation in Oracle Clusterware 
Doc ID: 
294430.1 
Type: 
BULLETIN 
Modified Date : 
19-APR-2009 
Status: 
PUBLISHED


WWW.UNIXBABUFORUM.INI hope the dba is looking at CSSD.log and CRSD.log (Though I have no idea about these) these are oracle/RAC logs. I recall the recommendations from Oracle was to have a dedicated switch between the two cluster nodes instead a simple cross over cable for the purpose of heartbeat. The other potential reason could be a version of the oracle RAC itself, it may require an upgrade. Well to confirm, we did both things, a switch between the RAC and the upgrade of RAC.
WWW.UNIXBABUFORUM.IN, AIX 5.3 TL 7 runs Oracle 10g release 2, and yes we had same reboot issue. The cause for the AIX reboot was oracle RAC. I would strongly suggest you to ask your dba to check the RAC logs, and there must be some evidence of this crash. RAC stops the services and it bounces the OS 


WWW.UNIXBABUFORUM.INYou may find that oracle is getting a heartbeat timeout thinking one node is 
dead, kicking it out of the cluster (RAC). As part of that process, it will 
crash / reboot the evicted node in an attempt to bring it back into 
operation. Have the DBA team open an Oracle TAR ticket, upload requested 
logs, etc to get to the root cause / suggested fix. 

0 comments:

Post a Comment

 
Design by BABU | Dedicated to grandfather | welcome to BABU-UNIX-FORUM