Wednesday, July 29, 2015

I was just doing an environment refresh and the database crashed!


Business as usual
At the end of the day, you get asked to refresh PY business data from production. No problem, you think. You'll kick it off, let it run, then the next morning it'll be good to go. After all, nobody uses the non-prod environments after hours so it's not going to matter if they get a bit wonky while the refresh is in progress.

When things hit fans
Around 8pm you get a call: JDE is down! You get off the couch and log in to the system. Sure enough, you try to run something inside E1 but you get an error like this:



If you were to look in the server jde.logs you'd see this:

3956/6844 MAIN_THREAD                            Fri Jun 05 12:17:20.351000            Jdb_ctl.c4091
            Starting OneWorld

3956/6844 MAIN_THREAD                            Fri Jun 05 12:18:04.439000            dbinitcn.c899
            OCI0000065 - Unable to create user session to database server

3956/6844 MAIN_THREAD                            Fri Jun 05 12:18:04.439001            dbinitcn.c904
            OCI0000141 - Error - ORA-00257: archiver error. Connect internal only, until freed.
3956/6844 MAIN_THREAD                            Fri Jun 05 12:18:04.439002            dbinitcn.c520
            OCI0000367 - Unable to connect to Oracle ORA-00257: archiver error. Connect internal only, until freed.
3956/6844 MAIN_THREAD                            Fri Jun 05 12:18:04.439003            Jdb_drvm.c798
            JDB9900164 - Failed to connect to ovsorcl


Maybe you're an experienced or brave CNC and you know how to log in to the database via SQL*Plus. You'd be greeted by this message:


The first screenshot from E1 is the most descriptive message you will get, but it's not very understandable unless you are an experienced DBA. What can add to the confusion is when the server has plenty of free space available.

Everyone is yelling at me to fix it
First, just stop everything you and everyone else are doing. You're probably hearing people throwing out ideas that range from good to crazy - but they're all wrong. Most people's first reaction is to restart E1, the database, or reboot all the servers. When that doesn't work you'll start deleting files on the database server and you might even figure out where the archive logs are and delete those. Unfortunately, none of those actions are going to fix the problem. 

I'm so confused
To understand what's going on and how to fix it, you have to know a little bit about how the Oracle database works. Whenever you do something inside E1 via an application or a UBE it sends a SQL statement to the database. It could be requesting data, updating rows, or inserting new records. Every single action E1 performs on the database is recorded into the redo log files. 

Those redo logs collect every SQL statement executed and are used to recover the database. When a redo log gets full, it writes its contents to another file called an archive log. There are only a handful of redo logs and they get overwritten frequently. Archive logs are continuously created and can be backed up, then later used to restore your database to any point in the past.

So what's the problem?
Obviously if you run out of disk space you're going to have issues with any piece of software. It's the same with a database. If the database can't write any new archive logs (for example), then it can't recover from a failure. When archiving stops, the database stops.

But I have free disk space!
This is entirely possible. Space for archive logs is allocated from within Oracle. You can have 200 GB of space available, but if you've only set aside 5 GB for archive logging inside Oracle you're never going to max out your disk.

Okay, enough talk! How do I fix this?
The easiest, fastest way to get your database going is to expand the space Oracle uses for the archive logs - IF you have extra disk space. First, you need to find out where your archive logs are going. An example default destination is /u01/app/oracle/fast_recovery_area/ORCL/archivelog. Look in there (or similar) and if you see a bunch of directories named in the style of YYYY_MM_DD then you've struck gold. If you can't find them you're going to have to look for them.

If SQL*Plus isn't letting you connect due to the archiver error, you can connect like so:

sqlplus /nolog 
SQL> conn sys/ as sysdba 

Once connected you can run the following SQL statement to find out where they're going:

SQL> show parameter db_recovery_file_dest

Wherever the log files are going, find out how much room is left on the disk. Let's say that on your /u01 mount you have 50GB free and you want to expand the archive logs into that. The next thing you need to do is see how much space Oracle thinks it has for the logs.

SQL> show parameter db_recovery_file_dest_size

We'll pretend this returns a value of 20GB. That means we can increase the allocated space to 70GB which will use up all available space on the disk. Now all you have to do is tell Oracle to use it:

SQL> alter system set db_recovery_file_dest_size = 70GB scope=BOTH

As soon as you execute this statement your database will start working again. No need for restarts or reboots, the DB just goes on about its business.

My archive log destination is as big as the disk
Well now you're in a real pickle. Still, there is a simple solution - do a full backup including archive logs. I'm not going to go in to every possible way to back up a database, but here is a simple command to clear out archive logs using RMAN:

RMAN> BACKUP ARCHIVELOG ALL DELETE ALL INPUT;

When it's finished you'll have a backup file and all the archive logs will be gone. You'll have some space freed up, but it'll be a good idea to move that backup file off of the server to open it up even more.

Backup takes too long/goes to same disk as archive logs/I have nowhere to put it.
Back in the pickle again I see. Your next best option is pretty simple but also the most dangerous: Delete the archive logs. This frees up space but you negatively affect (or completely eliminate) your ability to restore your database to any point after the last backup. However, on a non-production system or even during a production emergency this might be the best option.

First thing to do is find your archive log destination. Use the same instructions from above (show parameter db_recovery_file_dest) then delete the directories and files in the archive log destination. To finish up you have to tell RMAN the files are gone:

rman target /
crosscheck archivelog all;
delete noprompt expired archivelog all;
exit

That connects you to RMAN, compares the list of archive log files Oracle knows about to what actually exists, then deletes the entries for any that don't match up. As soon as you finish the RMAN statements your database will resume on its own. There is nothing more you should have to do for operations to continue. Now you have free space on your disk and within Oracle. 

The horrible, horrible catch when deleting archive logs
You've broken your backups. If the database were to fail due to data corruption, power failure, tornado, or a giant monster attack then you're going to have some very unhappy executives. You need to immediately take a full backup of your database to un-break it.

How do we stop this from happening again?
  1. Make sure your database has enough space. Not just for archive logs, but also for your data files and everything else. Figure out how much everything grows on a day-to-day basis, then bust out your calculator and find out where it's going over the next 6-12 months.
  2. Be prepared for large data load operations. If you're space constrained you probably need to take precautions before doing a refresh. Your users need to let you know before they do any large data inserts.
  3. Set up database and server monitors to warn you well in advance. If you can be alerted to the space issues in enough time you can fix everything before anybody knows there is a problem.

Reality check
I kind of glossed over some details and skipped some advanced techniques so I could keep things relatively simple while also providing solutions almost anyone could use. Having a database halt is a very stressful situation and it can be difficult to troubleshoot especially if you're not a DBA. The actions I talked about in this post should only be followed by a non-DBA in an extreme emergency. There is too much that could go wrong and you don't want to make a bad situation worse. 

Proceed with caution.

No comments:

Post a Comment