Assessment of Async Tools in Building AI Tools: My Experience with HTMX, FastAPI, LLM, SSE, and Jinja

,


The author, contemplating async tools for his 2001 Honda Insight hybrid. Just kidding.

Building internal AI tools is both exciting and daunting. Our user base includes both staff and partners who expect a relatively seamless experience. At Oppkey, I’ve been exploring how asynchronous (async) tools can improve our AI-powered applications—particularly those that rely on large language models (LLMs) handling multiple users waiting for responses at the same time.

Why Async Matters for LLM-Powered Tools

One of the main challenges I face is the inherent wait time for LLM responses. Even internally, no one likes waiting in uncertainty for a response that might take 30 seconds, a minute, or sometimes even longer. This is especially true when the output may need further correction or refinement.

Working with AI requires human guidance and lots of iterations. The traditional approach—where a user submits a prompt and then stares at a spinning icon—feels outdated and frustrating, especially when compared to the fluid, streaming outputs seen in tools like ChatGPT.

That’s why I believe streaming data to the user as it becomes available is crucial. Not only does it make the wait feel shorter, but it also reassures users that the system is working and gives them a sense of progress. In demos, I often talk over the wait and highlight other features, but the delay is noticeable. If partners or staff see the tool in action and experience these waits, it can negatively affect their perception and willingness to use the tool.

The Tech Stack: HTMX, FastAPI, LLMs, SSE, and Jinja

We are testing multiple technology stacks. Here’s how we are approaching this challenge currently:

  • FastAPI: This modern Python web framework is built for async operations. It allows me to handle multiple requests concurrently, but it’s basic structures are similar to Django which we have used extensively.
  • HTMX: HTMX enables dynamic web interfaces without heavy JavaScript. It lets me update parts of the page in real-time as new data streams in, making the UI feel more responsive and interactive. This is simple but currently very exciting (to me, anyway).
  • Server-Sent Events (SSE): SSE is a simple way to push updates from the server to the client. By combining SSE with FastAPI, LLM output can be streamed to the browser as it’s generated, rather than waiting for the entire response.
  • Jinja: As a templating engine, Jinja works well for rendering HTML on the server side. It keeps the codebase simple and maintainable.

User Experience: Why Streaming Matters

Streaming output isn’t just a technical improvement—it’s a key part of the user experience. Here’s why:

  • Immediate Feedback: Users see results as they’re generated, reducing anxiety and uncertainty.
  • Perceived Performance: Even if the total wait time is unchanged, streaming makes the process feel faster and more engaging.
  • Encourages Use: When staff and partners see that the tool is responsive, they’re more likely to use it. This is important even with internal tools, even when there’s an explicit understanding that the UI does not need to be “polished.”

Lessons Learned and Next Steps

We’re not planning to sell our AI tools externally at this point, but internal adoption is likely a key part of our strategy moving forward. Async streaming helps bridge that gap.

Looking ahead, I plan to continue refining the streaming experience and explore further ways to optimize performance. Async tools will make our internal AI tool more usable, more enjoyable, and ultimately more valuable to our team.

1 Like

comment on reddit about the possible need for async

There is an interesting discussion on Reddit about why Oppkey is looking at async tools.

We think we need async because we are using multiple LLM calls from multiple people. We do not have thousands of requests a second. However, we do have a streams of data coming in from multiple sources, including the Django database, which is a vector database, Postgresql with pgvector.

We have things working on Django, but it is confusing for us as the application was originally written for sync.

These are my initial notes:
htmx-tutorials/docs/djangjo-vs-fastapi.md at main · codetricity/htmx-tutorials · GitHub

Our previous experience was working with Django with the data coming from PostgreSQL, which is generally fast. In the past, we had some problems with waiting for a complex sort to finish processing, but we generally solved it with optimization of the SQL calls or dividing it up.

Now, we have a call to an LLM which might take several minutes.

I want several people to start several LLM calls at the same time and be able to do other things while they wait. so, there may be hundreds, but not thousands, of requests running at the same time.

The LLM is going to be OpenAI or Anthropic eventually. Our experience is that the response is slow.

Is there an easier way I should look at?

Your assessment is likely influenced by this set of videos, which focused on the stream.

  1. FastAPI Introduction - Publish HTML Directly - https://youtu.be/fmrQVbrQ9kw
  2. FastAPI Streaming to HTMX with SSE - https://youtu.be/D5l_A_kqUhI
  3. FastAPI and Ollama - Getting Response with HTMX - https://youtu.be/El_-vCpxmTQ
  4. HTMX with Stream of Chunks from LLM - https://youtu.be/pL86FqeRX08

In addition, we can do the following:

  • send messages on status while the person is waiting
  • run a report in the background