Tuesday, February 16, 2010

Scheduler problems and server map

I don't know if you remember my previous post about the scheduler server not starting, but I am happy to report that I found the solution to that problem. For those of you too lazy to click the above link, here is the short version: After moving the scheduler to the production server, it would not start.

I gathered logs and dozens of screen shots to share with my coworkers and Oracle. Nobody could figure out what the problem was. Everything looked as if it were set up correctly, so everyone was scratching their heads.

The call with Oracle was a travesty... I opened it almost two weeks ago and the tech in charge of the case would not respond. I would go days without any contact. Finally he started asking me for more screenshots, which I provided last Thursday. This morning, after hearing nothing, I called Oracle and escalated the issue. I then talked to a senior engineer who did a web conference with me to check out the system.

During this web conference, I saw an SQL statement in the scheduler kernel debug log that was looking for an entry for the enterprise server in SVM900.F98611. Whoa! I never thought to look in there! Why? I don't know. At any rate, I ran that SQL directly on the database and it returned zero records. The server map data source for the production enterprise server was missing.

I recreated the entry in the table manually and restarted the scheduler - it worked! I tested a few jobs and it looked good. I thanked the Oracle guys for their help (not much) and let the client know.

My question is: Why was that entry missing? How did we get all the way to a month after go live before noticing we were missing that data source? When did it go missing? Who knows. I care, but only a little bit now that it's working.

This issue once again shows me that I need to look better at the logs and rely especially on my debug log. Those logs are packed with information and it's sometimes hard to find what you need, but they almost always have the answers.

Bottom line is: I didn't troubleshoot this well enough. Once I saw the debug log for the scheduler kernel I figured it out right away. I hadn't run the debug logs yet, and nobody else that looked at the issue had asked for them.

I can take some solace in the fact that I was not the only one who didn't think about the debug log, but I was the person who was supposed to figure it out. I'm also the one who installed their system, so it's doubly on me. I just wish I knew how it happened.

I think that on these difficult issues I should go through and verify the basics every time - path codes, environments, data sources, OCM. It seems like a lot of problems revolve around a mistake with one of those,  especially with new installations and upgrades.

No comments:

Post a Comment