In this post, we’ll explore a testing approach we used at Lincoln Loop for a Django project that interacts with large language models (LLMs). This approach helps us stay on top of a rapidly evolving API landscape and efficiently respond to breaking changes in third-party libraries.
This post focuses on LangChain, but a similar approach could be used with a different library. The question remains the same: how can we test our LLM application without relying on mocks?
Why LangChain?
We chose LangChain for a large LLM project in 2023 because it was one of the first full-featured libraries available. Similar to how Django’s ORM abstracts the database, LangChain provides an abstraction around different LLM providers (OpenAI, Anthropic, etc.)
LangChain allowed us to integrate and evaluate different LLM providers without having to rewrite application code.
The problem
Testing interactions with external services like LLMs can be challenging; tests shouldn’t connect to live services, but they should capture what an application sends (prompts, contextual information) and how it behaves in response to the messages it receives.
In other words, we need to be able to inspect the requests sent, programmatically define responses, and this must be done in isolation from other tests.
It may be tempting to use unittest.mock.patch
to mock functions that make
external API calls, but this approach has drawbacks. Mocks tend to couple tests
to specific implementations, making refactoring harder, and they can reduce the
surface area of code execution, hiding potential errors.
Why mocking falls short
Let’s consider an example using mocks. The code below uses a Chain
object
to interact with the LLM–how it works isn’t important. What is important to
know is that in the first versions of LangChain, requests were sent to the LLM
by using the __call__
method on the Chain instance. This implementation has
since been deprecated and replaced by the invoke()
method.
As of LangChain 1.0, the following code raises an exception:
However, if the __call__
method was mocked as shown below, tests
would continue to pass and obfuscate the problem. This is a common pitfall with
mocks.
A better way
What if instead of mocks, we used a test-specific LLM backend? This approach could help avoid the scenario shown above by invoking the client library API rather than mocking it, while also preventing requests from being made to a live service provider.
Pushing this further, this testing backend should allow us to:
- Set expected responses dynamically in a test.
- Inspect the messages being sent to the LLM, ensuring that the application behaves as intended.
By removing mocks, we eliminate brittle dependencies on specific implementations while gaining the ability to thoroughly validate our system’s interactions with the LLM.
Faking it
Let’s consider an example application using LangChain’s ChatModel. There are dozens of LLM integrations provided by the library, including fake ones. While these are not well documented, they are intended for testing purposes.
Let’s take a look at one of these: the FakeListChatModel class. It’s instantiated with a list of predefined responses which are returned when the model is invoked.
Let’s see how we can use this in a test suite.
Example project
To demonstrate, let’s create a small Django project. It will expose a single REST API endpoint that forwards a question to an LLM and returns the response.
We’ll use nanodjango to create
a single-file Django project, and we’ll use pytest
with the
pytest-django
plugin for our test suite.
The complete example can be found at: https://github.com/lincolnloop/fake-chat-example
Let’s begin with a test that defines a happy-path scenario:
Then, we’ll write some code to make it pass, using the FakeListChatModel:
This test passes, but this isn’t exactly a useful implementation. We aren’t actually reading the POST data and messages are hardcoded.
We also need to define our responses in our test setup, and assert that the message sent to the LLM is the value posted by the user.
We’ll begin by creating a helper function responsible for returning an instance of the chat model. In our example, it only returns a FakeListChatModel instance, but this helper can be extended to support the LLM provider of your choice in production.
Now, in order to manipulate the responses from our tests, we need to have
access to the chat model object used in the view. One possible approach is to
memoize the get_chat_model
function using the lru_cache
decorator.
Now, each time get_chat_model()
gets called, the same object is returned. To
isolate tests from each other and avoid responses leaking from one test to
another, we can reset the function’s cache so a fresh object gets created when
called again.
We can leverage a pytest fixture for this–it can provide our tests with the chat model object and clear it after each function executes:
Inspecting messages
At the time of writing, LangChain’s fake chat models don’t provide a way to
inspect messages sent to the LLM. The FakeChatListModel
source shows that
messages are passed around internally, so by extending this class, we can
capture the messages.
We’ll override the _call
method and bind messages to the object so we can
inspect them and write assertions in our tests.
To get a sense of what this looks like in practice, let’s fiddle around in the Python REPL. Let’s get the chat model, define a response and invoke it:
Now, let’s take a look at the new messages property we defined:
As expected, there’s the message we sent!
Let’s write a new test to assert that the question posted to the endpoint is what’s sent to the LLM:
Now, this fails because we haven’t actually implemented that behaviour in the view. So, the updated implementation could look like:
See the complete example.
Our experience at Lincoln Loop
This testing approach proved its worth for us at Lincoln Loop. The Retrieval-Augmented Generation (RAG) pipeline we began developing for a client in 2023 relied on an API that has since been deprecated. Using a custom LLM backend for our tests, like the one shown here, produced a test suite that is loosely coupled to the implementation. We were able to rewrite the RAG pipeline, a core part of the application, without having to change any of the hundreds of related tests. Our test suite served us by supporting the rewrite rather than get in its way.
Closing thoughts
Using a fake LLM backend which provides the ability to set responses and inspect messages was key to building a robust and reliable LLM application. As shown, this approach:
- Avoids the pitfalls of using mocks
- Makes refactoring possible
While this strategy is rooted in the idea of keeping tests loosely coupled to implementations, it is tied to LangChain. Replacing it with another library would require changes to the test suite. A library-agnostic solution could be achieved by stubbing the underlying network layer. But if the application uses multiple providers, this can quickly become impractical. Depending on LangChain has been an acceptable trade-off.
What do you think?
Happy testing!
Photo by Bartek Pawlik. Provided by DEFNA.