Skip to main content

Command Palette

Search for a command to run...

Engineering ResumeRoast: Building a Resilient Async Pipeline

Updated
9 min read
Engineering ResumeRoast: Building a Resilient Async Pipeline
M
Software Engineer focused on building scalable web applications using Python, React and AWS.

If you haven't read the previous posts, here's the short version: ResumeRoast is a full-stack app that takes your resume, runs it through an LLM, and gives you structured, brutal feedback. You can check it out here.

Before I wrote a single line of the processing logic, I already knew one thing for certain, this pipeline had to be async from day one.

When your core feature involves parsing a document, making an LLM call, and formatting a response, you're looking at anywhere from 3 to 10 seconds of work per request. Making a user stare at a spinner while an HTTP request hangs for 10 seconds isn't a UX problem, it's an architecture problem. And blocking a server thread for that long while other users are waiting isn't just slow , it's a system waiting to fall over.

So async wasn't a refactor I did later. It was the starting point. This blog is about how I designed that pipeline, the tradeoffs I consciously made, and the non-obvious things that will bite you if you're not looking for them.

Why Async , And Why Not Just asyncio?

The case for going async was obvious. The core work , parsing, LLM call, formatting, takes several seconds and hits an external API. No reason that should block a web server thread.

But the more interesting question was how. Python has asyncio built in. FastAPI is async-native. I could have just await-ed the LLM call and called it done.

I didn't, and here's why, asyncio gives you concurrency, not isolation. If the LLM call hangs for 30 seconds, that coroutine is still tied to your web server's event loop. You're also not getting retries, worker management, or persistence for free. The moment you need any of that, you're building a task queue anyway, just a worse one.

Going with a proper task queue from the start meant the processing pipeline was completely decoupled from the API layer. The web server's only job is to accept requests and drop tasks onto a queue. Everything else happens independently.

User → POST /resume → job_id (instant)
                           ↓
                        [Queue]
                           ↓
                    [Celery Worker]
                           ↓
                         parse → LLM → format → save

The API stays fast. Workers process at their own pace. And if something downstream breaks, it breaks in isolation , not in the middle of serving a user request.

The Design: Queue - Broker - Worker

Once I committed to a task queue architecture, the pieces fell into place quickly. A queue sits between the API and the workers. The API drops a message in. A worker picks it up. Neither side needs to know about the other.

This decoupling is what makes the system resilient by default. If a worker crashes, the message stays in the queue. If the LLM API goes down for a few minutes, tasks accumulate and drain when it comes back up. The queue absorbs pressure that would otherwise crash the system.

For ResumeRoast the choice was Celery as the task queue layer, with Redis as the broker. Here's what the actual setup looked like:

# celery_app.py
from celery import Celery

app = Celery(
    'resumeroast',
    broker='redis://localhost:6379/0',
    backend='redis://localhost:6379/1'
)

@app.task
def analyze_resume(resume_id: str):
    resume = db.get_resume(resume_id)
    feedback = llm_client.analyze(resume.text)
    db.save_feedback(resume_id, feedback)
    db.update_status(resume_id, 'complete')

Calling it from the FastAPI route:

@router.post("/resume")
async def upload_resume(file: UploadFile):
    resume_id = db.save_resume(file)
    analyze_resume.delay(resume_id)  # returns in < 1ms
    return {"job_id": resume_id}

That .delay() is the handoff. The API is done. The worker takes it from there. Starting workers is just:

celery -A celery_app worker --loglevel=info

Why Redis and Not AWS SQS?

I'm AWS SAA certified, so the natural instinct was to reach for SQS. It's managed, durable, and built for exactly this kind of workload. I know the service well. And I still chose Redis , deliberately.

The honest tradeoff: Redis is in-memory. If the instance goes down without persistence configured, you can lose your queue. I knew this going in. For ResumeRoast at this stage, I made the call that it was an acceptable risk , I enabled AOF persistence to mitigate it, but I wasn't going to lose sleep over a few dropped jobs on a personal project.

What Redis gave me in return was zero friction. One line in docker-compose.yml and I had a broker running locally that behaved identically to production. No IAM policies to configure, no AWS credentials in my dev environment, no per-message costs to track while iterating. Celery's Redis integration is rock solid and the local development loop was just faster.

SQS is the right answer at scale or in a team environment where you need strong durability guarantees and you're already in the AWS ecosystem. Its visibility timeout semantics and native DLQ support are genuinely well designed. But for an indie project where I wanted to move fast, the operational overhead wasn't worth it at this stage.

The best part: if I outgrow Redis, migrating to SQS is a one-line config change in Celery. The task code doesn't change at all.

I chose Redis knowing its limits, not because I didn't know SQS existed.

Where It Gets Real , Designing for Failure

This is the part that actually took thought. Getting a task queue working is easy. Getting it to behave correctly when things go wrong is the real engineering.

Retries , But Do Them Right

The LLM API is the most likely failure point , rate limits, timeouts, occasional 500s. A naive retry immediately on failure just hammers an already struggling service. Exponential backoff is the standard answer and it's the right one

@app.task(bind=True, max_retries=3)
def analyze_resume(self, resume_id: str):
    try:
        resume = db.get_resume(resume_id)
        feedback = llm_client.analyze(resume.text)
        db.save_feedback(resume_id, feedback)
    except Exception as exc:
        raise self.retry(
            exc=exc,
            countdown=2 ** self.request.retries  # 1s → 2s → 4s
        )

bind=True gives the task access to self, which is what enables self.retry(). The countdown doubles on each attempt. After 3 failures, the exception propagates and the task is marked permanently failed.

Retries Are Only Safe If Your Task Is Idempotent

This is something a lot of people add retries without thinking about. If a task partially completes and then fails, you need the retry to be safe to run again. Otherwise you get duplicate side effects , two emails sent, two rows inserted, two LLM calls billed.

For ResumeRoast the fix was simple:

@app.task(bind=True, max_retries=3)
def analyze_resume(self, resume_id: str):
    if db.feedback_exists(resume_id):
        return  # already done, skip silently

    feedback = llm_client.analyze(resume.text)
    db.save_feedback(resume_id, feedback)

Check before acting. Now a retry is just a rerun, not a duplicate operation.

Dead Letter Queues , Because Silent Failures Are the Worst Kind

After 3 retries, what happens to a task that still fails? In Celery's default config , nothing. It disappears. The user's resume never gets feedback and you have absolutely no idea.

This is where a Dead Letter Queue (DLQ) should live. The idea is straightforward , instead of silently dropping a failed task, you route it to a separate queue where it waits. You can inspect it, alert on it, and replay it when you're ready. Nothing disappears.

I haven't wired one up in ResumeRoast yet. Right now, when a task exhausts all its retries, it updates the resume status to 'failed' in the database and I can query those manually. It looks like this:

@app.task(bind=True, max_retries=3)
def analyze_resume(self, resume_id: str):
    try:
        ...
    except Exception as exc:
        if self.request.retries >= self.max_retries:
            db.update_status(resume_id, 'failed')  # at least it's visible
        raise self.retry(exc=exc, countdown=2 ** self.request.retries)

It works at this scale. But it's a known gap , I can see that something failed, not easily replay it without manual intervention. The proper fix is a dedicated DLQ, either as a Redis list for simplicity or a proper SQS Dead Letter Queue if I migrate the broker. That's the next thing I'd add before taking this anywhere near production traffic.

The Two Celery Settings Nobody Tells You About

These two caught me off guard and they'll catch you too if you don't know about them.

By default, Celery acknowledges a task the moment a worker picks it up , not when it finishes. So if your worker process dies mid-execution, the task is marked as done and gone. This is almost certainly not what you want.

# celeryconfig.py
task_acks_late = True           # acknowledge only after task completes
worker_prefetch_multiplier = 1  # one task at a time per worker

acks_late flips the acknowledgement to happen after completion. If the worker dies mid-job, the task goes back to the queue. prefetch_multiplier=1 stops workers from grabbing more tasks than they can actually execute right now , which matters when your tasks are long-running and you don't want a single worker hoarding the queue.

These aren't optional polish. Without them, your pipeline can silently lose work under failure conditions that are completely normal in production.

The Full Picture

With all of this in place, here's the complete ResumeRoast flow:

User uploads resume
  → FastAPI saves it, returns job_id instantly
  → analyze_resume.delay(resume_id) fires
  → Task lands in Redis queue
  → Celery worker picks it up (acks only after completion)
  → LLM call fails? Retry with backoff (up to 3 attempts)
  → Still failing? Push to DLQ, update status to 'failed'
  → Success? Save feedback, update status to 'complete'
  → User polls GET /resume/{job_id}/status → gets their feedback

Every failure is recoverable. Every lost job lands somewhere visible. The API stays fast regardless of what workers are doing.

The system isn't over-engineered , it's appropriately engineered for where the project is right now, with clear upgrade paths if the scale demands it.

Thank you for reading the article. If you found it informative or interesting, please give it a thumbs up. I would highly appreciate it if you could share it with your friends as well.