Wednesday, July 31, 2013

IBM P7 Strange Behaviour

We have a P7 frame that has 4 LPARs that are used as TSM storage agents from which snapshots of our SAP DB's are mounted for backup. They have always had great performance until one LPAR had a bad HBA that phoned home and was replaced. After it was replaced performance for backups dramatically decreased from 800MB/s to 150MB/s and overall performance of the server would drastically drop. When the DB requiring backup is over 25TB that is a huge hit, and we could not find the root cause.  At first IBM said it was our Hitachi disk that was the problem. We eliminated that right away, so we then replaced the new HBA, checked our fiber, and then checked the GBIC and nothing seemed to fix the situation. During the first week I asked the IBM service technician if we could possibly have a bad drawer or slot and he emphatically said "No! If you did you would have errors all over the place." So we checked firmware, we moved cards within the frame (again), we double checked the fiber, now we were going into the third week. So I kept asking if something could be wrong with the drawer/slots and I kept getting the same answer. The reason I suggested it was due to previous experience. I have seen hardware go bad without totally going "out". So after exhausting everything other than the replacing the slots, IBM finally replaced the slots. Viola! Backup speeds went back to normal and system degradation during the backup disappeared.  So the slots/drawer was the issue. No errors relating to a slot/drawer hardware issue occurred but something caused the slots to degrade performance.  It took almost a month to resolve the issue, I wouldn't say that IBM support was very thorough and at times tried to push off the problem to other vendors (i.e. Hitachi). I can only suggest in the future you trust your instincts and push the CE's to follow down every avenue. My headache is over, but now the RCA begins.