Outage.
The word which sends shivers down our spine when me or my team mates hear about it. It is one of the most fearful words of the software industry. It
can vanish all of your joy/confidence in a heartbeat. Just when you think you
are doing good and gaining the confidence/trust of your client, suddenly, out
of nowhere, in a pleasant morning, you will see an email from your client (with
a little red exclamation mark) saying that they are facing an outage in the
production environment. Generally, it will be followed by a couple of
escalation emails from your manager as well as leader. Bang! All your good
deeds are as good as nothing now. The misery won’t stop here; you will be
immediately called into an emergency meeting. In this, you will get your ass
whipped and will be asked to find the root cause of this outage and provide a
lengthy, Shakespeare style email (with all the decorations and formatting) to
the client. And, all you will have for analysis (i.e. data and information)
will be, well, nothing! This is the subject of this blog. Let’s start with it.
There
are multiple definitions for outage. Assume that yours is a web application
which is hosted on so and so server. One instance may refer to the server crash
(due to any hardware/software failure) which would make all requests get
rejected with http 4XX (or any similar) error message. One instance may refer
to the crashing of the application hosted on server, resulting in http 5XX
error message. One instance may refer to database crash, which will make all
the operations fail with ‘technical error’ (or similar) error. One instance (which
we will emphasize more in this blog) may refer to the application becoming
unresponsive. This is very mysterious instance. The server seems to be working
fine, the application seems to be working fine, the database seems to be
working fine but still, when someone sends request to server, the browser will
display ‘processing’ symbol and then it will time out eventually. See? I told
you it is much deeper than it seems.
We
faced similar situations multiple times in last month. The client kept on
saying that the services became unresponsive and they had to restart the server
to make the services available again. For our analysis, we asked for server
logs, error logs, database logs, application logs, call logs, message logs, Whatsapp
logs and any other type of logs which may exist on this planet. We got all the
logs in some time and found nothing. Absolutely nothing! We were certain that
there was not any hardware/server/database fault. After some googling and
analysis, we found that there must be something wrong with JVM. Hence, we asked
them to provide the heap dump of the JVM of production environment.
Let
me explain about the heap dump first, well, not exactly heap dump. Let me start
with the heap first. Heap (or I should say, heap memory) is an area which is
utilized by JVM to allocate the memory for newly created objects. Say I have
written a line in java program like Date d = new Date();. This will create a
new Date object. That object will be stored in the heap memory. It will be
alive as long as we have at least one reference (in our case, d) in our
program. When all the references are destroyed, the object will be garbage
collected by garbage collector (more on that later). Heap has two parts mainly:
- Eden Space: This is the part of the heap where the newly created objects will be stored. The memory for all the new objects is initially allocated from this part. In our life, Eden space refers to school.
- Survivor Space: When JVM ‘thinks’ that young gen space is almost full, it invokes garbage collector on it. GC cleans up all the objects which do not have any valid reference in the executing program(s). The objects which survive this garbage collection are moved to Survivor space (a reward for their so called achievement!). In our life, Survivor space refers to college. Ironically, our days to survive start from college too!
- Tenured Generation: This area contains the objects which are alive for quite a long time. Examples of such objects are static objects, instance variables of servlet class etc. It is also known as Old Gen space. In our life, this refers to the place where you are probably sitting and reading this blog i.e office.
- Permanent Generation: When we create an instance of a class or let’s say, when we call any static method/member of a class, the respective class gets loaded in the memory. In our previous example, Date class will be loaded in the memory before Date object is created. Permanent Generation space is the space where all the loaded classes are stored. Permanent Generation is not a part of heap memory. I couldn't find any mapping for this space in our life. Suggestions are welcome.
Back
to our story, when we got the heap dumps, we started analyzing them. During our
analysis, we saw a couple of large stringbuffer objects into it (around 3 GB
each). They were occupying 6 GB. Also, there were many other heavy objects
which ate up to 2 GB. In the production environment, JVM heap size was
configured to 8 GB. All these large objects ate up all the memory,
leaving no room for creation of any new object. Hence, the whole system went
into comma and stopped responding the requests. The mystery was finally solved!
This
was not the end though; identifying the problem was just a small step. We had
to find the root cause of generation of that object and eventually, provide a
patch which would prevent such situation from occurring. We analyzed that and
it was one specific case of looping which appended characters to string buffer
object (more on this later). But the presence of large objects in heap dump spawned a very interesting question: Java has much applauded Garbage collection
algorithm. What the heck is it and why it was not able to clear these large
objects?
Main
task of Garbage collector is to clean up the objects which are not needed by
the program. The criteria to decide which objects are ‘not needed’ is very
complex. There are plenty of algorithms for garbage collection. Detailed
explanation for the same is provided here.
Instead of going into more details of those algos, we will focus on how garbage
collector works with the JVM. There are 2 types of garbage collections:
- Concurrent GC: Concurrent garbage collector uses a single thread which runs concurrently with the java application. The main motive of concurrent generator is to prevent tenured generation space from filling up. When concurrent GC is unable to reclaim the memory, full GC is triggered.
- Parallel GC or Full GC: Parallel garbage collector belongs to the family of stop the world collector. It won’t start until old generation space runs out of memory. But as soon as that happens, Full GC starts, it will cause all the application thread to pause, resulting in unresponsiveness of the application.
Back
to our string buffer problem, we were storing a string buffer object in user’s
session. We were performing some operations on it which used to result in
increase/decrease in the size of that object. However, in one specific scenario,
such objects (in user’s sessions) used to grow indefinitely. As they were in
session, they were live. And GC was not able to recollect them as garbage (obviously
it shouldn’t!), this increase led to triggering of full GC and eventually,
bringing down the whole application. What a misery!
As it
is very difficult to trace and fix such problems, I have provided some
guidelines which one should consider while coding to avoid such disasters.
- Never store any heavy objects in session, store them in requests instead.
- While writing any recursive or looping method/logic, be extra careful. Test all the scenarios. In most of the cases, the method will work in 99 out of 100 scenarios. But, the remaining 1 scenario will come back to bite you in the back side when your application gets deployed on production.
- Be very careful while using third party jars. Use object caching/object pooling concepts in a thriftier manner. Gather enough information about the class loading/memory consumption of a third party library before implementing in the application.
- Execute the performance testing of each and every smallest possible feature of the application. There are plenty of free tools available in the market for performance testing (like BadBoy, JMeter, Load runner etc).
- It is always better to have 5 null pointer exceptions than 1 out of memory error in your code.
- There are plenty of such points. You can search on google. It seems tedious to me to write such points here.
Well,
back to the original story (for the 135762432nd time), we managed to
trace and eventually fix that problem. The system then started working like a
charm! The joy was back. The confidence was back. The trust was back. Things
started to work very smoothly. However, the god of outage, sitting in a certain
corner of the world didn't like that. Hence, he sent another one from his troops
to say hi to us a month later.
And,
here we are, battling against another outage. This time, even heap dump has
refused to provide the enough information. Preliminary analysis says that the
permanent generation space (does it ring any bells?) gets filled up with the
classes which get loaded multiple times. This seems to be the whole new world.
However, we are ready for the battle, with our weapons, to fire in the dark J
~ Au
revoir!
No comments:
Post a Comment