This is a transparent write-up of the issues we ran into over the last few weeks while migrating our architecture.
The goal here is not to justify decisions or hide mistakes. It is simply to explain what changed, what went wrong, how we debugged it, and what we are doing next.
Executive Summary
We migrated from a monolithic architecture to a service-based setup using RabbitMQ. Shortly after the migration, we started seeing instability and significant slowdowns in code reviews.
The root cause was not a single bug. It was a combination of factors that interacted badly: inefficient queries, high memory pressure, and a RabbitMQ deployment that was never actually running as a real cluster. Because of that, load was unevenly distributed across workers, which made the system look slow and unreliable under load.
1. Previous Context
Old Architecture
We were running everything on a single EC2 instance in AWS.
That machine handled webhook ingestion, code review processing, and the API used by the web app. There was no queue and no explicit concurrency control. Once a webhook arrived, processing started immediately and ran in the background.
Memory and CPU usage were implicitly managed by Node.js and V8.
Code Review Characteristics
Code review is a heavy process for us.
To build proper context, we fetch all files touched in a pull request. For large pull requests, this can easily allocate between 1 and 2 GB of RAM during processing. This was already true before the migration.
Limitations of the Old Setup
This approach worked up to a point, but it had clear limits.
Some reviews were slow, but usually tolerable. The bigger problems showed up under load:
- Heavy processing caused webhook timeouts and temporary deactivation in Git providers like GitHub and GitLab.
- When multiple reviews ran at the same time, the web application became unstable due to resource contention.
- We had very little observability. Without a queue, it was hard to reason about failures, measure performance, or retry failed work safely.
2. Why We Changed the Architecture
As usage increased, it became clear this setup would not scale.
We wanted better isolation between responsibilities, more control over concurrency, and proper observability around background processing. The migration was meant to give us better stability, predictable performance, and the ability to scale horizontally as demand grew.
3. New Architecture
Service Split
We split the monolith into three services:
- Webhooks, responsible only for receiving events from GitHub, GitLab, and others.
- API, used by the frontend and integrations.
- Workers, responsible for running code reviews.
Infrastructure Choices
We moved to AWS ECS to allow horizontal scaling and introduced RabbitMQ as a message broker to control concurrency and distribute work across workers.
The idea was that workers could scale independently based on queue depth, instead of everything competing for resources on a single machine.
Queue Implementation Reality
The queue implementation was partial.
We were not able to fully break the code review pipeline into small, independent jobs. Instead, the queue acted more like a concurrency gate than a true multi-stage pipeline. That was a known limitation, and the plan was to evolve it incrementally after the migration.
4. Incident Timeline
Initial Deploy – January 4
The migration was completed and the system appeared to be working under the new architecture.
Incident 1 – January 13
We ran into an issue with GitHub licenses affecting paying customers.
This happened because a late-night deploy meant to shut down old infrastructure deployed an incorrect version. The issue was fixed around mid-afternoon, and service returned to normal shortly after.
Ongoing Issue: Slow Code Reviews
After the migration, we started getting reports that code reviews were taking much longer than usual. At that point, we began a deeper investigation into performance.
Finding 1: Technical Debt and Circular Dependencies
Because of existing technical debt and the way the services were split, we ended up with circular dependencies between modules. This forced us to duplicate parts of the codebase between the API and the workers.
The result was increased memory usage and slower execution.
Our first reaction was to scale workers, since that was easy to try and aligned with our initial assumptions.
That did not fix the problem.
Finding 2: Performance Issues Identified With Profiling
Using Pyroscope, we found several concrete issues:
- High memory usage and slow API queries, especially on the pull request list page.
- Machines crashing under load.
- N+1 queries spread across many parts of the system, including inside the code review process itself.
We fixed the most obvious N+1 queries and optimized several data access paths.
This had a real impact. Memory usage dropped significantly, the API stopped crashing, the pull request list became much faster, and worker CPU usage went down.
Even with these improvements, the system was still unstable at peak load.
Scaling Workers Again
We increased the number of worker machines from three to five.
This improved throughput slightly, but reviews were still slow and backlog continued to grow.
Finding 3: Prefetch Configuration
At this point, we looked more closely at how jobs were being consumed.
Roughly 95 percent of review time is spent waiting for LLM responses. Based on that, we had configured a high RabbitMQ prefetch value of 60 jobs per worker, assuming this would help hide LLM latency.
That assumption turned out to be flawed.
In the old fire-and-forget model, HTTP handlers returned immediately and work continued in the background. In the new setup, RabbitMQ consumers await the full execution of a job before acknowledging it.
With a high prefetch value, this meant dozens of heavy jobs were sitting in memory at the same time.
We reduced prefetch from 60 to 20. This helped, but the system was still not behaving as expected.
Root Cause Discovery – January 23
After spending almost an entire day reviewing recent changes, deployment configs, metrics, and RabbitMQ behavior, we realized the issue was not in the workers themselves.
RabbitMQ was never actually running as a real cluster.
Each RabbitMQ node was operating independently, and workers connected to nodes at random. Depending on which node a worker landed on, it could be overloaded or completely idle. From the outside, this looked like slow processing. Internally, it was a badly unbalanced system.
The reason was a bug in our Terraform deployment. Private IPs were not configured for node auto-discovery, so the nodes never joined into a single cluster.
By the time we found this, there were roughly 900 pull requests waiting in the queue.
5. Resolution
Once the issue was clear, the fix was straightforward.
We corrected the Terraform configuration to include private IPs for auto-discovery. All three RabbitMQ nodes joined into a single cluster, queues were unified, and all five workers were able to consume work evenly.
The backlog of around 900 pull requests was processed in a matter of minutes.
6. Additional Issues Identified
While debugging this incident, we also identified other problems that need attention:
- Some BYOK models, including synthetic GLM variants and Gemini preview models, have extremely slow response times. We observed individual calls taking up to 30 minutes, while the p75 for other models is around three minutes.
- Git providers are becoming increasingly restrictive with rate limits for bot-driven API usage. Our review process is intensive and not well optimized in this area, which leads to reviews being skipped when limits are hit.
- Error handling and transparency around skipped reviews are not good enough. This is a UX problem that leads to confusion and unnecessary support requests.
7. Next Steps
Priority 1
- Implement a public status page with automatic alerts for incidents and degraded services.
- Automate incident notifications to our Discord community.
- Reduce memory usage in the code review process, especially in the
FetchChangedFilesStage. - Improve error handling and graceful failure behavior in the web application.
- Expose clearer review statuses and detailed reasons when a review is skipped or fails.
- Add automated rules to our deployment setup to prevent Terraform misconfigurations.
Priority 2
- Split the code review process into smaller, independent jobs.
- Remove circular dependencies between modules.
- Address blocking behavior in slow BYOK models.