Code Review 24 de January de 2026

Post-Mortem: Architecture Migration – January 2025

Gabriel Malinosqui

This is a transparent write-up of the issues we ran into over the last few weeks while migrating our architecture.

The goal here is not to justify decisions or hide mistakes. It is simply to explain what changed, what went wrong, how we debugged it, and what we are doing next.

Executive Summary

We migrated from a monolithic architecture to a service-based setup using RabbitMQ. Shortly after the migration, we started seeing instability and significant slowdowns in code reviews.

The root cause was not a single bug. It was a combination of factors that interacted badly: inefficient queries, high memory pressure, and a RabbitMQ deployment that was never actually running as a real cluster. Because of that, load was unevenly distributed across workers, which made the system look slow and unreliable under load.

1. Previous Context

Old Architecture

We were running everything on a single EC2 instance in AWS.

That machine handled webhook ingestion, code review processing, and the API used by the web app. There was no queue and no explicit concurrency control. Once a webhook arrived, processing started immediately and ran in the background.

Memory and CPU usage were implicitly managed by Node.js and V8.

Code Review Characteristics

Code review is a heavy process for us.

To build proper context, we fetch all files touched in a pull request. For large pull requests, this can easily allocate between 1 and 2 GB of RAM during processing. This was already true before the migration.

Limitations of the Old Setup

This approach worked up to a point, but it had clear limits.

Some reviews were slow, but usually tolerable. The bigger problems showed up under load:

Heavy processing caused webhook timeouts and temporary deactivation in Git providers like GitHub and GitLab.
When multiple reviews ran at the same time, the web application became unstable due to resource contention.
We had very little observability. Without a queue, it was hard to reason about failures, measure performance, or retry failed work safely.

2. Why We Changed the Architecture

As usage increased, it became clear this setup would not scale.

We wanted better isolation between responsibilities, more control over concurrency, and proper observability around background processing. The migration was meant to give us better stability, predictable performance, and the ability to scale horizontally as demand grew.

3. New Architecture

Service Split

We split the monolith into three services:

Webhooks, responsible only for receiving events from GitHub, GitLab, and others.
API, used by the frontend and integrations.
Workers, responsible for running code reviews.

Infrastructure Choices

We moved to AWS ECS to allow horizontal scaling and introduced RabbitMQ as a message broker to control concurrency and distribute work across workers.

The idea was that workers could scale independently based on queue depth, instead of everything competing for resources on a single machine.

Queue Implementation Reality

The queue implementation was partial.

We were not able to fully break the code review pipeline into small, independent jobs. Instead, the queue acted more like a concurrency gate than a true multi-stage pipeline. That was a known limitation, and the plan was to evolve it incrementally after the migration.

4. Incident Timeline

Initial Deploy – January 4

The migration was completed and the system appeared to be working under the new architecture.

Incident 1 – January 13

We ran into an issue with GitHub licenses affecting paying customers.

This happened because a late-night deploy meant to shut down old infrastructure deployed an incorrect version. The issue was fixed around mid-afternoon, and service returned to normal shortly after.

Ongoing Issue: Slow Code Reviews

After the migration, we started getting reports that code reviews were taking much longer than usual. At that point, we began a deeper investigation into performance.

Finding 1: Technical Debt and Circular Dependencies

Because of existing technical debt and the way the services were split, we ended up with circular dependencies between modules. This forced us to duplicate parts of the codebase between the API and the workers.

The result was increased memory usage and slower execution.

Our first reaction was to scale workers, since that was easy to try and aligned with our initial assumptions.

That did not fix the problem.

Finding 2: Performance Issues Identified With Profiling

Using Pyroscope, we found several concrete issues:

High memory usage and slow API queries, especially on the pull request list page.
Machines crashing under load.
N+1 queries spread across many parts of the system, including inside the code review process itself.

We fixed the most obvious N+1 queries and optimized several data access paths.

This had a real impact. Memory usage dropped significantly, the API stopped crashing, the pull request list became much faster, and worker CPU usage went down.

Even with these improvements, the system was still unstable at peak load.

Scaling Workers Again

We increased the number of worker machines from three to five.

This improved throughput slightly, but reviews were still slow and backlog continued to grow.

Finding 3: Prefetch Configuration

At this point, we looked more closely at how jobs were being consumed.

Roughly 95 percent of review time is spent waiting for LLM responses. Based on that, we had configured a high RabbitMQ prefetch value of 60 jobs per worker, assuming this would help hide LLM latency.

That assumption turned out to be flawed.

In the old fire-and-forget model, HTTP handlers returned immediately and work continued in the background. In the new setup, RabbitMQ consumers await the full execution of a job before acknowledging it.

With a high prefetch value, this meant dozens of heavy jobs were sitting in memory at the same time.

We reduced prefetch from 60 to 20. This helped, but the system was still not behaving as expected.

Root Cause Discovery – January 23

After spending almost an entire day reviewing recent changes, deployment configs, metrics, and RabbitMQ behavior, we realized the issue was not in the workers themselves.

RabbitMQ was never actually running as a real cluster.

Each RabbitMQ node was operating independently, and workers connected to nodes at random. Depending on which node a worker landed on, it could be overloaded or completely idle. From the outside, this looked like slow processing. Internally, it was a badly unbalanced system.

The reason was a bug in our Terraform deployment. Private IPs were not configured for node auto-discovery, so the nodes never joined into a single cluster.

By the time we found this, there were roughly 900 pull requests waiting in the queue.

Secondary Incident: Retry Behavior and Rate Limits (January 24-26)

This incident was not caused by new load or a new deployment, but by residual effects of rate limiting and retries that were already in progress while we were fixing the main issues.

During the weekend of January 24 and 25, we focused on reducing errors that were increasing pressure on the queue and the servers. At the same time, we optimized our usage of GitHub and GitLab APIs to reduce rate limit consumption.

These changes reduced error rates and overall load, but they also exposed another issue.

Some teams had already hit GitHub rate limits earlier. Because of retry behavior in our pipeline, those teams remained effectively stuck in the queue over the weekend. Even though we reduced API usage and fixed several error paths, retries kept those reviews active and prevented the rate limit from fully resetting.

On January 26, we received new reports of slow reviews. The root cause was that a small number of teams were blocked by rate limits and retries. Because those jobs could not complete or fail fast, they effectively blocked the queue for everyone else. This bottleneck only became visible once traffic picked up again on Monday.

We fixed the bugs that allowed jobs to remain stuck in this state, ensured blocked reviews could exit the queue correctly, and after that the system returned to normal behavior.

5. Resolution

Once the issue was clear, the fix was straightforward.

We corrected the Terraform configuration to include private IPs for auto-discovery. All three RabbitMQ nodes joined into a single cluster, queues were unified, and all five workers were able to consume work evenly.

The backlog of around 900 pull requests was processed in a matter of minutes.

6. Additional Issues Identified

While debugging this incident, we also identified other problems that need attention:

Some BYOK models, including synthetic GLM variants and Gemini preview models, have extremely slow response times. We observed individual calls taking up to 30 minutes, while the p75 for other models is around three minutes.
Git providers are becoming increasingly restrictive with rate limits for bot-driven API usage. Our review process is intensive and not well optimized in this area, which leads to reviews being skipped when limits are hit.
Error handling and transparency around skipped reviews are not good enough. This is a UX problem that leads to confusion and unnecessary support requests.

7. Next Steps

Priority 1

Implement a public status page with automatic alerts for incidents and degraded services.
Automate incident notifications to our Discord community.
Reduce memory usage in the code review process, especially in the FetchChangedFilesStage.
Improve error handling and graceful failure behavior in the web application.
Expose clearer review statuses and detailed reasons when a review is skipped or fails.
Add automated rules to our deployment setup to prevent Terraform misconfigurations.

Priority 2

Split the code review process into smaller, independent jobs.
Remove circular dependencies between modules.
Address blocking behavior in slow BYOK models.

8. Closing Notes

We know this caused real frustration for teams relying on Kodus day-to-day. We’re sorry for the disruption, and we appreciate everyone who reported issues and stuck with us while we worked through it.

This incident reinforced a few lessons for us:

Scaling workers does not help if the bottleneck is coordination and configuration.
Retries and rate limits need explicit control, fast failure paths, and clear isolation so they cannot block the entire system.
Observability and guardrails need to exist before the migration, not after.

We’ll keep this post updated if anything materially changes, and we’ll continue improving reliability, transparency, and the overall review experience based on the next steps above.