In this post, we’ll explore a testing approach we used at Lincoln Loop for a Django project that interacts with large language models (LLMs). This approach helps us stay on top of a rapidly evolving API landscape and efficiently respond to breaking changes in third-party libraries.
This post focuses on LangChain, but a similar approach could be used with a different library. The question remains the same: how can we test our LLM application without relying on mocks?
Why LangChain?
We chose LangChain for a large LLM project in 2023 because it was one of the first full-featured libraries available. Similar to how Django’s ORM abstracts the database, LangChain provides an abstraction around different LLM providers (OpenAI, Anthropic, etc.)
LangChain allowed us to integrate and evaluate different LLM providers without having to rewrite application code.
The problem
Testing interactions with external services like LLMs can be challenging; tests shouldn’t connect to live services, but they should capture what an application sends (prompts, contextual information) and how it behaves in response to the messages it receives.
In other words, we need to be able to inspect the requests sent, programmatically define responses, and this must be done in isolation from other tests.
It may be tempting to use unittest.mock.patch
to mock functions that make
external API calls, but this approach has drawbacks. Mocks tend to couple tests
to specific implementations, making refactoring harder, and they can reduce the
surface area of code execution, hiding potential errors.
Why mocking falls short
Let’s consider an example using mocks. The code below uses a Chain
object
to interact with the LLM–how it works isn’t important. What is important to
know is that in the first versions of LangChain, requests were sent to the LLM
by using the __call__
method on the Chain instance. This implementation has
since been deprecated and replaced by the invoke()
method.
As of LangChain 1.0, the following code raises an exception:
# views.py
from langchain.chains import LLMChain
def chat_view(request):
chain = LLMChain(...)
result = chain(request.POST) # TypeError: object is not callable
return {"response": result}
However, if the __call__
method was mocked as shown below, tests
would continue to pass and obfuscate the problem. This is a common pitfall with
mocks.
from unittest.mock import patch
from langchain.chains import Chain
def test_mocked_llm(client):
with patch.object(Chain, "__call__", return_value={"output": "Hi there!"}):
response = client.post(
"/api/chat/",
{"input": "Hello, world!"},
content_type="application/json",
)
assert response.json()["response"] == {"output": "Hi there!"}
A better way
What if instead of mocks, we used a test-specific LLM backend? This approach could help avoid the scenario shown above by invoking the client library API rather than mocking it, while also preventing requests from being made to a live service provider.
Pushing this further, this testing backend should allow us to:
- Set expected responses dynamically in a test.
- Inspect the messages being sent to the LLM, ensuring that the application behaves as intended.
By removing mocks, we eliminate brittle dependencies on specific implementations while gaining the ability to thoroughly validate our system’s interactions with the LLM.
Faking it
Let’s consider an example application using LangChain’s ChatModel. There are dozens of LLM integrations provided by the library, including fake ones. While these are not well documented, they are intended for testing purposes.
Let’s take a look at one of these: the FakeListChatModel class. It’s instantiated with a list of predefined responses which are returned when the model is invoked.
from langchain_community.chat_models.fake import FakeListChatModel
chat_model = FakeListChatModel(responses=["What do you think?"])
>>> chat_model.invoke("What's the meaning of life?")
AIMessage(
content='What do you think?',
additional_kwargs={},
response_metadata={},
id='run-58c73e87-098b-48e1-9f38-a11001fb7139-0',
)
Let’s see how we can use this in a test suite.
Example project
To demonstrate, let’s create a small Django project. It will expose a single REST API endpoint that forwards a question to an LLM and returns the response.
We’ll use nanodjango to create
a single-file Django project, and we’ll use pytest
with the
pytest-django
plugin for our test suite.
The complete example can be found at: https://github.com/lincolnloop/fake-chat-example
Let’s begin with a test that defines a happy-path scenario:
# tests/test_chat.py
def test_valid_post_returns_answer_from_llm(client):
question = "What is the meaning of life?"
response = client.post(
"/api/chat/",
{"question": question},
content_type="application/json",
)
answer = "What gives your life meaning?"
assert response.status_code == 200
result = response.json()
expected = {"answer": answer}
assert result == expected
Then, we’ll write some code to make it pass, using the FakeListChatModel:
# chat.py
from langchain_community.chat_models.fake import FakeListChatModel
from nanodjango import Django
app = Django()
@app.api.post("/chat/")
def chat(request):
chat_model = FakeListChatModel(responses=["What gives your life meaning?"])
ai_message = chat_model.invoke("What is the meaning of life?")
answer = ai_message.content
return {"answer": answer}
This test passes, but this isn’t exactly a useful implementation. We aren’t actually reading the POST data and messages are hardcoded.
We also need to define our responses in our test setup, and assert that the message sent to the LLM is the value posted by the user.
We’ll begin by creating a helper function responsible for returning an instance of the chat model. In our example, it only returns a FakeListChatModel instance, but this helper can be extended to support the LLM provider of your choice in production.
# chat.py
...
def get_chat_model():
return FakeListChatModel(responses=["What gives your life meaning?"])
@app.api.post("/chat/")
def chat(request):
chat_model = get_chat_model()
ai_message = chat_model.invoke("What is the meaning of life?")
answer = ai_message.content
return {"answer": answer}
Now, in order to manipulate the responses from our tests, we need to have
access to the chat model object used in the view. One possible approach is to
memoize the get_chat_model
function using the lru_cache
decorator.
# chat.py
...
from functools import lru_cache
...
@lru_cache
def get_chat_model():
return FakeListChatModel(responses=[])
Now, each time get_chat_model()
gets called, the same object is returned. To
isolate tests from each other and avoid responses leaking from one test to
another, we can reset the function’s cache so a fresh object gets created when
called again.
We can leverage a pytest fixture for this–it can provide our tests with the chat model object and clear it after each function executes:
# tests/test_chat.py
import pytest
from chat import get_chat_model
@pytest.fixture
def fake_chat_model():
chat_model = get_chat_model()
yield chat_model # The shared instance
get_chat_model.cache_clear() # Clear before next test
def test_valid_post_returns_answer_from_llm(client, fake_chat_model):
answer = "What gives your life meaning?"
fake_chat_model.responses.append(answer)
question = "What is the meaning of life?"
response = client.post(
"/api/chat/",
{"question": question},
content_type="application/json",
)
assert response.status_code == 200
result = response.json()
expected = {"answer": answer}
assert result == expected
Inspecting messages
At the time of writing, LangChain’s fake chat models don’t provide a way to
inspect messages sent to the LLM. The FakeChatListModel
source shows that
messages are passed around internally, so by extending this class, we can
capture the messages.
We’ll override the _call
method and bind messages to the object so we can
inspect them and write assertions in our tests.
# chat.py
from langchain_community.chat_models.fake import FakeListChatModel
from langchain_core.messages import BaseMessage
from pydantic.fields import Field
class FakeChatModel(FakeListChatModel):
messages: list[BaseMessage] = Field(default_factory=list)
responses: list[str] = Field(default_factory=list)
def _call(self, messages: list[BaseMessage], *args, **kwargs) -> str:
self.messages.extend(messages) # Intercept & save messages
return super()._call(messages, *args, **kwargs)
@lru_cache
def get_chat_model():
return FakeChatModel()
...
To get a sense of what this looks like in practice, let’s fiddle around in the Python REPL. Let’s get the chat model, define a response and invoke it:
>>> from chat import get_chat_model
>>> chat_model = get_chat_model()
>>> chat_model.responses.append("Hi")
>>> chat_model.invoke("Hello")
AIMessage(
content='Hi',
additional_kwargs={},
response_metadata={},
id='run-51c43dbe-c72d-44ad-aa3b-9e63f2f4b210-0',
)
Now, let’s take a look at the new messages property we defined:
>>> chat_model.messages
[HumanMessage(content='Hello', additional_kwargs={}, response_metadata={})]
As expected, there’s the message we sent!
Let’s write a new test to assert that the question posted to the endpoint is what’s sent to the LLM:
# tests/test_chat.py
from langchain_core.messages import HumanMessage
...
def test_messages_sent_to_llm(client, fake_chat_model):
question = "What is the answer to the ultimate question of life?"
fake_chat_model.responses.append("42")
response = client.post(
"/api/chat/",
{"question": question},
content_type="application/json",
)
assert response.status_code == 200
expected = [HumanMessage(content=question)]
result = fake_chat_model.messages
assert result == expected
Now, this fails because we haven’t actually implemented that behaviour in the view. So, the updated implementation could look like:
# chat.py
from functools import lru_cache
from langchain_community.chat_models.fake import FakeListChatModel
from langchain_core.messages import BaseMessage
from nanodjango import Django
from pydantic.fields import Field
app = Django()
class FakeChatModel(FakeListChatModel):
messages: list[BaseMessage] = Field(default_factory=list)
responses: list[str] = Field(default_factory=list)
def _call(self, messages: list[BaseMessage], *args, **kwargs) -> str:
self.messages.extend(messages)
return super()._call(messages, *args, **kwargs)
@lru_cache
def get_chat_model():
return FakeChatModel()
class RequestData(app.ninja.Schema):
question: str
@app.api.post("/chat/")
def chat(request, data: RequestData):
chat_model = get_chat_model()
ai_message = chat_model.invoke(data.question)
answer = ai_message.content
return {"answer": answer}
# tests/test_chat.py
import pytest
from langchain_core.messages import HumanMessage
from chat import get_chat_model
@pytest.fixture
def fake_chat_model():
chat_model = get_chat_model()
yield chat_model
get_chat_model.cache_clear()
def test_valid_post_returns_answer_from_llm(client, fake_chat_model):
answer = "What gives your life meaning?"
fake_chat_model.responses.append(answer)
question = "What is the meaning of life?"
response = client.post(
"/api/chat/",
{"question": question},
content_type="application/json",
)
assert response.status_code == 200
result = response.json()
expected = {"answer": answer}
assert result == expected
def test_messages_sent_to_llm(client, fake_chat_model):
question = "What is the answer to the ultimate question of life?"
fake_chat_model.responses.append("42")
response = client.post(
"/api/chat/",
{"question": question},
content_type="application/json",
)
assert response.status_code == 200
expected = [HumanMessage(content=question)]
result = fake_chat_model.messages
assert result == expected
See the complete example.
Our experience at Lincoln Loop
This testing approach proved its worth for us at Lincoln Loop. The Retrieval-Augmented Generation (RAG) pipeline we began developing for a client in 2023 relied on an API that has since been deprecated. Using a custom LLM backend for our tests, like the one shown here, produced a test suite that is loosely coupled to the implementation. We were able to rewrite the RAG pipeline, a core part of the application, without having to change any of the hundreds of related tests. Our test suite served us by supporting the rewrite rather than get in its way.
Closing thoughts
Using a fake LLM backend which provides the ability to set responses and inspect messages was key to building a robust and reliable LLM application. As shown, this approach:
- Avoids the pitfalls of using mocks
- Makes refactoring possible
While this strategy is rooted in the idea of keeping tests loosely coupled to implementations, it is tied to LangChain. Replacing it with another library would require changes to the test suite. A library-agnostic solution could be achieved by stubbing the underlying network layer. But if the application uses multiple providers, this can quickly become impractical. Depending on LangChain has been an acceptable trade-off.
What do you think?
Happy testing!
Photo by Bartek Pawlik. Provided by DEFNA.