Inside the JVM - Profiling & Diagnosing Memory Leaks in Java

August 2025| Tags: JVM, Java, OOM, Memory Leak
image

A few months back, while working on a backend service that used Drools for rule evaluation, I noticed something odd - JVM heap usage kept climbing, GC metrics were off, and eventually, the service crashed with an OutOfMemoryError. Profiling the JVM with VisualVM and Java Flight Recorder, and analyzing a heap dump revealed a memory leak tied to rule engine objects.

The root cause? A Drool’s Session resource that wasn’t closed properly. That one oversight let to retained references, bloated heap and a production outage.

In this post, I’ll walk through how I diagnosed the issue, the tools I used, and the fix that restored heap sanity. If you’re looking for a JVM memory leak debugging – this one’s for you.

When the Heap started groaning

Before diving into heap dumps and leak analysis, it’s worth looking at how the problem surfaced - through JVM metrics and container behavior in production.

Monitoring Setup: JVM in Fargate

This Java service was deployed on AWS Fargate, running containerized workloads with no direct shell access. To monitor JVM health:

  1. Datadog was used to track JVM metrics like heap usage, GC activity, thread counts, and latency.

    Enable JMXFetch and use Datadog Java Agent to collect run-time metrics.

  2. Heap dumps were configured and written to Amazon S3 for offline analysis.

    Enable —XX:+HeapDumpOnOutOfMemoryError flag and modify the Dockerfile to use AWS CLI to write the .hprof file to S3 bucket.

jvm

The first sign of trouble appeared in the dashboards:

  1. Heap Usage - Gradually increased over time, even during low traffic.
  2. GC Frequency - Minor GCs became frequent, followed by full GCs that reclaimed very little memory and a spike in GS Pause Time.
  3. Full GCs: Became increasingly frequent - a strong indicator that the JVM was struggling to reclaim space. In healthy applications, full GCs should be rare and efficient.

Debugging: Tracing the Leak

Once the Fargate container crashed with OOM, JVM wrote the heap dump directly to Amazon S3, thanks to —XX:+HeapDumpOnOutOfMemoryError flag and custom container logic. This gave me a snapshot of the JVM’s memory state at the moment of failure — perfect for post-mortem analysis.

Tools Used

  1. Intellij IDEA Profiler Tool Window: Loaded the .hrpof file to inspect retained objects and reference chains.
  2. VisualVM and JFR: Used for live profiling in local machines.

profiler

What the Heap Dump Revealed

  1. A large number of retained objects were linked to Drools internals - specifically the working memory. These were large objects too, considering the fact that a lot of payment records were fetched to run the rule.
  2. Profiler window flagged a KnowledgeBaseImpl and StatefulKnowlegeSessionImpl instance with a high retained size (KieSession internals).

The Fix

Above analysis confirmed that an KieSession was anchoring working memory and preventing garbage collection. The fix was straightforward:

	KieSession session = kieContainer.newKieSession();
	try {
		session.insert(fact);
		session.fireRule();
		...
	} finally {
		session.dispose(); // THE FIX!!!
	}

A better solution:

	// Using try-with-resources
	try(KieSession session = kieContainer.newKieSession()) {
		session.insert(fact);
		session.fireRule();
		...
	}

Lessons Learned

This wasn’t just a memory leak — it was a reminder that good coding practices are your first line of defence. The root cause wasn’t exotic or obscure. It was a missing close-the-resource call. One line. One habit. One oversight.

Even with solid coding habits, bugs slip through. But when they do, we’re equipped — with the right tools, clear steps, and a proven path to trace, diagnose, and fix the issue before it spirals.

Let's Connect

for a cup of coffee, challenges, or conversations that spark something new

dakshin.g [at] outlook [dot] com
www.dakshin.cc