Tuesday, 31 December 2013

The dreaded Production Outage


Outage. The word which sends shivers down our spine when me or my team mates hear about it. It is one of the most fearful words of the software industry. It can vanish all of your joy/confidence in a heartbeat. Just when you think you are doing good and gaining the confidence/trust of your client, suddenly, out of nowhere, in a pleasant morning, you will see an email from your client (with a little red exclamation mark) saying that they are facing an outage in the production environment. Generally, it will be followed by a couple of escalation emails from your manager as well as leader. Bang! All your good deeds are as good as nothing now. The misery won’t stop here; you will be immediately called into an emergency meeting. In this, you will get your ass whipped and will be asked to find the root cause of this outage and provide a lengthy, Shakespeare style email (with all the decorations and formatting) to the client. And, all you will have for analysis (i.e. data and information) will be, well, nothing! This is the subject of this blog. Let’s start with it.

There are multiple definitions for outage. Assume that yours is a web application which is hosted on so and so server. One instance may refer to the server crash (due to any hardware/software failure) which would make all requests get rejected with http 4XX (or any similar) error message. One instance may refer to the crashing of the application hosted on server, resulting in http 5XX error message. One instance may refer to database crash, which will make all the operations fail with ‘technical error’ (or similar) error. One instance (which we will emphasize more in this blog) may refer to the application becoming unresponsive. This is very mysterious instance. The server seems to be working fine, the application seems to be working fine, the database seems to be working fine but still, when someone sends request to server, the browser will display ‘processing’ symbol and then it will time out eventually. See? I told you it is much deeper than it seems.

We faced similar situations multiple times in last month. The client kept on saying that the services became unresponsive and they had to restart the server to make the services available again. For our analysis, we asked for server logs, error logs, database logs, application logs, call logs, message logs, Whatsapp logs and any other type of logs which may exist on this planet. We got all the logs in some time and found nothing. Absolutely nothing! We were certain that there was not any hardware/server/database fault. After some googling and analysis, we found that there must be something wrong with JVM. Hence, we asked them to provide the heap dump of the JVM of production environment.

Let me explain about the heap dump first, well, not exactly heap dump. Let me start with the heap first. Heap (or I should say, heap memory) is an area which is utilized by JVM to allocate the memory for newly created objects. Say I have written a line in java program like Date d = new Date();. This will create a new Date object. That object will be stored in the heap memory. It will be alive as long as we have at least one reference (in our case, d) in our program. When all the references are destroyed, the object will be garbage collected by garbage collector (more on that later). Heap has two parts mainly:

  • Eden Space: This is the part of the heap where the newly created objects will be stored. The memory for all the new objects is initially allocated from this part. In our life, Eden space refers to school.
  • Survivor Space: When JVM ‘thinks’ that young gen space is almost full, it invokes garbage collector on it. GC cleans up all the objects which do not have any valid reference in the executing program(s). The objects which survive this garbage collection are moved to Survivor space (a reward for their so called achievement!). In our life, Survivor space refers to college. Ironically, our days to survive start from college too!
              Both Eden and Survivor space are parts of Young Gen space.
  • Tenured Generation: This area contains the objects which are alive for quite a long time. Examples of such objects are static objects, instance variables of servlet class etc. It is also known as Old Gen space. In our life, this refers to the place where you are probably sitting and reading this blog i.e office.
That was all about heap. Before we start with heap dump, let me explain one more term:
  • Permanent Generation:  When we create an instance of a class or let’s say, when we call any static method/member of a class, the respective class gets loaded in the memory.  In our previous example, Date class will be loaded in the memory before Date object is created. Permanent Generation space is the space where all the loaded classes are stored. Permanent Generation is not a part of heap memory. I couldn't find any mapping for this space in our life. Suggestions are welcome.
Now, let’s start with the heap dump. A heap dump is a binary or textual representation of Java heap which is written to a file. It is like a black box of the air craft. A heap dump has all the information about the objects present in the heap, their object tree and the memory occupied by them. There are many tools available in the market which can read such heap dumps and create the object charts with all the statistics. MAT plugin for eclipse is the popular and free tool. Another licensed tool is YourKit. JVisualVM (comes packaged along with java 6) is also very useful tool in heap dump analysis.

Back to our story, when we got the heap dumps, we started analyzing them. During our analysis, we saw a couple of large stringbuffer objects into it (around 3 GB each). They were occupying 6 GB. Also, there were many other heavy objects which ate up to 2 GB. In the production environment, JVM heap size was configured to 8 GB. All these large objects ate up all the memory, leaving no room for creation of any new object. Hence, the whole system went into comma and stopped responding the requests. The mystery was finally solved!

This was not the end though; identifying the problem was just a small step. We had to find the root cause of generation of that object and eventually, provide a patch which would prevent such situation from occurring. We analyzed that and it was one specific case of looping which appended characters to string buffer object (more on this later). But the presence of large objects in heap dump spawned a very interesting question: Java has much applauded Garbage collection algorithm. What the heck is it and why it was not able to clear these large objects?

Main task of Garbage collector is to clean up the objects which are not needed by the program. The criteria to decide which objects are ‘not needed’ is very complex. There are plenty of algorithms for garbage collection. Detailed explanation for the same is provided here. Instead of going into more details of those algos, we will focus on how garbage collector works with the JVM. There are 2 types of garbage collections:

  1. Concurrent GC: Concurrent garbage collector uses a single thread which runs concurrently with the java application. The main motive of concurrent generator is to prevent tenured generation space from filling up. When concurrent GC is unable to reclaim the memory, full GC is triggered.
  2. Parallel GC or Full GC: Parallel garbage collector belongs to the family of stop the world collector. It won’t start until old generation space runs out of memory. But as soon as that happens, Full GC starts, it will cause all the application thread to pause, resulting in unresponsiveness of the application.
In the production environment, Full GC is configured to kick in when old gen space is 70% used. When that happens, Full GC will kick in and CPU spike will touch 100% mark, causing all the requests to halt.

Back to our string buffer problem, we were storing a string buffer object in user’s session. We were performing some operations on it which used to result in increase/decrease in the size of that object. However, in one specific scenario, such objects (in user’s sessions) used to grow indefinitely. As they were in session, they were live. And GC was not able to recollect them as garbage (obviously it shouldn’t!), this increase led to triggering of full GC and eventually, bringing down the whole application. What a misery!

  As it is very difficult to trace and fix such problems, I have provided some guidelines which one should consider while coding to avoid such disasters.
  • Never store any heavy objects in session, store them in requests instead.
  • While writing any recursive or looping method/logic, be extra careful. Test all the scenarios. In most of the cases, the method will work in 99 out of 100 scenarios. But, the remaining 1 scenario will come back to bite you in the back side when your application gets deployed on production.
  • Be very careful while using third party jars. Use object caching/object pooling concepts in a thriftier manner. Gather enough information about the class loading/memory consumption of a third party library before implementing in the application.
  • Execute the performance testing of each and every smallest possible feature of the application. There are plenty of free tools available in the market for performance testing (like BadBoy, JMeter, Load runner etc).
  • It is always better to have 5 null pointer exceptions than 1 out of memory error in your code.
  • There are plenty of such points. You can search on google. It seems tedious to me to write such points here.
Well, back to the original story (for the 135762432nd time), we managed to trace and eventually fix that problem. The system then started working like a charm! The joy was back. The confidence was back. The trust was back. Things started to work very smoothly. However, the god of outage, sitting in a certain corner of the world didn't like that. Hence, he sent another one from his troops to say hi to us a month later.

And, here we are, battling against another outage. This time, even heap dump has refused to provide the enough information. Preliminary analysis says that the permanent generation space (does it ring any bells?) gets filled up with the classes which get loaded multiple times. This seems to be the whole new world. However, we are ready for the battle, with our weapons, to fire in the dark J

~ Au revoir!

No comments:

Post a Comment