The author, contemplating async tools for his 2001 Honda Insight hybrid. Just kidding.
Building internal AI tools is both exciting and daunting. Our user base includes both staff and partners who expect a relatively seamless experience. At Oppkey, I’ve been exploring how asynchronous (async) tools can improve our AI-powered applications—particularly those that rely on large language models (LLMs) handling multiple users waiting for responses at the same time.
Why Async Matters for LLM-Powered Tools
One of the main challenges I face is the inherent wait time for LLM responses. Even internally, no one likes waiting in uncertainty for a response that might take 30 seconds, a minute, or sometimes even longer. This is especially true when the output may need further correction or refinement.
Working with AI requires human guidance and lots of iterations. The traditional approach—where a user submits a prompt and then stares at a spinning icon—feels outdated and frustrating, especially when compared to the fluid, streaming outputs seen in tools like ChatGPT.
That’s why I believe streaming data to the user as it becomes available is crucial. Not only does it make the wait feel shorter, but it also reassures users that the system is working and gives them a sense of progress. In demos, I often talk over the wait and highlight other features, but the delay is noticeable. If partners or staff see the tool in action and experience these waits, it can negatively affect their perception and willingness to use the tool.
The Tech Stack: HTMX, FastAPI, LLMs, SSE, and Jinja
We are testing multiple technology stacks. Here’s how we are approaching this challenge currently:
- FastAPI: This modern Python web framework is built for async operations. It allows me to handle multiple requests concurrently, but it’s basic structures are similar to Django which we have used extensively.
- HTMX: HTMX enables dynamic web interfaces without heavy JavaScript. It lets me update parts of the page in real-time as new data streams in, making the UI feel more responsive and interactive. This is simple but currently very exciting (to me, anyway).
- Server-Sent Events (SSE): SSE is a simple way to push updates from the server to the client. By combining SSE with FastAPI, LLM output can be streamed to the browser as it’s generated, rather than waiting for the entire response.
- Jinja: As a templating engine, Jinja works well for rendering HTML on the server side. It keeps the codebase simple and maintainable.
User Experience: Why Streaming Matters
Streaming output isn’t just a technical improvement—it’s a key part of the user experience. Here’s why:
- Immediate Feedback: Users see results as they’re generated, reducing anxiety and uncertainty.
- Perceived Performance: Even if the total wait time is unchanged, streaming makes the process feel faster and more engaging.
- Encourages Use: When staff and partners see that the tool is responsive, they’re more likely to use it. This is important even with internal tools, even when there’s an explicit understanding that the UI does not need to be “polished.”
Lessons Learned and Next Steps
We’re not planning to sell our AI tools externally at this point, but internal adoption is likely a key part of our strategy moving forward. Async streaming helps bridge that gap.
Looking ahead, I plan to continue refining the streaming experience and explore further ways to optimize performance. Async tools will make our internal AI tool more usable, more enjoyable, and ultimately more valuable to our team.