Primary image for Avoiding Mocks: Testing LLM Applications with LangChain in Django

Avoiding Mocks: Testing LLM Applications with LangChain in Django

In this post, we’ll explore a testing approach we used at Lincoln Loop for a Django project that interacts with large language models (LLMs). This approach helps us stay on top of a rapidly evolving API landscape and efficiently respond to breaking changes in third-party libraries.

This post focuses on LangChain, but a similar approach could be used with a different library. The question remains the same: how can we test our LLM application without relying on mocks?

Why LangChain?

We chose LangChain for a large LLM project in 2023 because it was one of the first full-featured libraries available. Similar to how Django’s ORM abstracts the database, LangChain provides an abstraction around different LLM providers (OpenAI, Anthropic, etc.)

LangChain allowed us to integrate and evaluate different LLM providers without having to rewrite application code.

The problem

Testing interactions with external services like LLMs can be challenging; tests shouldn’t connect to live services, but they should capture what an application sends (prompts, contextual information) and how it behaves in response to the messages it receives.

In other words, we need to be able to inspect the requests sent, programmatically define responses, and this must be done in isolation from other tests.

It may be tempting to use unittest.mock.patch to mock functions that make external API calls, but this approach has drawbacks. Mocks tend to couple tests to specific implementations, making refactoring harder, and they can reduce the surface area of code execution, hiding potential errors.

Why mocking falls short

Let’s consider an example using mocks. The code below uses a Chain object to interact with the LLM–how it works isn’t important. What is important to know is that in the first versions of LangChain, requests were sent to the LLM by using the __call__ method on the Chain instance. This implementation has since been deprecated and replaced by the invoke() method.

As of LangChain 1.0, the following code raises an exception:

# views.py
from langchain.chains import LLMChain


def chat_view(request):
    chain = LLMChain(...)
    result = chain(request.POST)  #  TypeError: object is not callable
    return {"response": result}

However, if the __call__ method was mocked as shown below, tests would continue to pass and obfuscate the problem. This is a common pitfall with mocks.

from unittest.mock import patch
from langchain.chains import Chain

def test_mocked_llm(client):
    with patch.object(Chain, "__call__", return_value={"output": "Hi there!"}):
      response = client.post(
          "/api/chat/",
          {"input": "Hello, world!"},
          content_type="application/json",
      )

    assert response.json()["response"] == {"output": "Hi there!"}

A better way

What if instead of mocks, we used a test-specific LLM backend?  This approach could help avoid the scenario shown above by invoking the client library API rather than mocking it, while also preventing  requests from being made to a live service provider.

Pushing this further, this testing backend should allow us to:

  • Set expected responses dynamically in a test.
  • Inspect the messages being sent to the LLM, ensuring that the application behaves as intended.

By removing mocks, we eliminate brittle dependencies on specific implementations while gaining the ability to thoroughly validate our system’s interactions with the LLM.

Faking it

Let’s consider an example application using LangChain’s ChatModel. There are dozens of LLM integrations provided by the library, including fake ones. While these are not well documented, they are intended for testing purposes.

Let’s take a look at one of these:  the FakeListChatModel class. It’s instantiated with a list of predefined responses which are returned when the model is invoked.

from langchain_community.chat_models.fake import FakeListChatModel
chat_model = FakeListChatModel(responses=["What do you think?"])
>>> chat_model.invoke("What's the meaning of life?")
AIMessage(
  content='What do you think?',
  additional_kwargs={},
  response_metadata={},
  id='run-58c73e87-098b-48e1-9f38-a11001fb7139-0',
)

Let’s see how we can use this in a test suite.

Example project

To demonstrate, let’s create a small Django project. It will expose a single REST API endpoint that forwards a question to an LLM and returns the response.

We’ll use nanodjango to create a single-file Django project, and we’ll use pytest with the pytest-django plugin for our test suite.

The complete example can be found at: https://github.com/lincolnloop/fake-chat-example

Let’s begin with a test that defines a happy-path scenario:

# tests/test_chat.py

def test_valid_post_returns_answer_from_llm(client):
    question = "What is the meaning of life?"

    response = client.post(
        "/api/chat/",
        {"question": question},
        content_type="application/json",
    )
    answer = "What gives your life meaning?"
    assert response.status_code == 200

    result = response.json()
    expected = {"answer": answer}

    assert result == expected

Then, we’ll write some code to make it pass, using the FakeListChatModel:

# chat.py

from langchain_community.chat_models.fake import FakeListChatModel
from nanodjango import Django

app = Django()

@app.api.post("/chat/")
def chat(request):
    chat_model = FakeListChatModel(responses=["What gives your life meaning?"])
    ai_message = chat_model.invoke("What is the meaning of life?")
    answer = ai_message.content
    return {"answer": answer}

This test passes, but this isn’t exactly a useful implementation. We aren’t actually reading the POST data and messages are hardcoded.

We also need to define our responses in our test setup, and assert that the message sent to the LLM is the value posted by the user. 

We’ll begin by creating a helper function responsible for returning an instance of the chat model. In our example, it only returns a FakeListChatModel instance, but this helper can be extended to support the LLM provider of your choice in production.

# chat.py
...

def get_chat_model():
    return FakeListChatModel(responses=["What gives your life meaning?"])


@app.api.post("/chat/")
def chat(request):
    chat_model = get_chat_model()
    ai_message = chat_model.invoke("What is the meaning of life?")
    answer = ai_message.content
    return {"answer": answer}

Now, in order to manipulate the responses from our tests, we need to have access to the chat model object used in the view. One possible approach is to memoize the get_chat_model function using the lru_cache decorator.

# chat.py
...
from functools import lru_cache
...

@lru_cache
def get_chat_model():
    return FakeListChatModel(responses=[])

Now, each time get_chat_model()gets called, the same object is returned. To isolate tests from each other and avoid responses leaking from one test to another, we can reset the function’s cache so a fresh object gets created when called again.

We can leverage a pytest fixture for this–it can provide our tests with the chat model object and clear it after each function executes:

# tests/test_chat.py
import pytest

from chat import get_chat_model


@pytest.fixture
def fake_chat_model():
    chat_model = get_chat_model()
    yield chat_model  # The shared instance
    get_chat_model.cache_clear()  # Clear before next test


def test_valid_post_returns_answer_from_llm(client, fake_chat_model):
    answer = "What gives your life meaning?"
    fake_chat_model.responses.append(answer)

    question = "What is the meaning of life?"
    response = client.post(
        "/api/chat/",
        {"question": question},
        content_type="application/json",
    )
    assert response.status_code == 200

    result = response.json()
    expected = {"answer": answer}

    assert result == expected

Inspecting messages

At the time of writing, LangChain’s fake chat models don’t provide a way to inspect messages sent to the LLM. The FakeChatListModel source shows that messages are passed around internally, so by extending this class, we can capture the messages. 

We’ll override the _call method and bind messages to the object so we can inspect them and write assertions in our tests.

# chat.py
from langchain_community.chat_models.fake import FakeListChatModel
from langchain_core.messages import BaseMessage
from pydantic.fields import Field


class FakeChatModel(FakeListChatModel):
    messages: list[BaseMessage] = Field(default_factory=list)
    responses: list[str] = Field(default_factory=list)

    def _call(self, messages: list[BaseMessage], *args, **kwargs) -> str:
        self.messages.extend(messages)  # Intercept & save messages
        return super()._call(messages, *args, **kwargs)


@lru_cache
def get_chat_model():
    return FakeChatModel()
...

To get a sense of what this looks like in practice, let’s fiddle around in the Python REPL. Let’s get the chat model, define a response and invoke it:

>>> from chat import get_chat_model
>>> chat_model = get_chat_model()
>>> chat_model.responses.append("Hi")
>>> chat_model.invoke("Hello")
AIMessage(
  content='Hi',
  additional_kwargs={},
  response_metadata={},
  id='run-51c43dbe-c72d-44ad-aa3b-9e63f2f4b210-0',
)

Now, let’s take a look at the new messages property we defined:

>>> chat_model.messages
[HumanMessage(content='Hello', additional_kwargs={}, response_metadata={})]

As expected, there’s the message we sent!

Let’s write a new test to assert that the question posted to the endpoint is what’s sent to the LLM:

# tests/test_chat.py
from langchain_core.messages import HumanMessage
...

def test_messages_sent_to_llm(client, fake_chat_model):
    question = "What is the answer to the ultimate question of life?"
    fake_chat_model.responses.append("42")
    response = client.post(
        "/api/chat/",
        {"question": question},
        content_type="application/json",
    )
    assert response.status_code == 200

    expected = [HumanMessage(content=question)]
    result = fake_chat_model.messages

    assert result == expected

Now, this fails because we haven’t actually implemented that behaviour in the view. So, the updated implementation could look like:

# chat.py

from functools import lru_cache

from langchain_community.chat_models.fake import FakeListChatModel
from langchain_core.messages import BaseMessage
from nanodjango import Django
from pydantic.fields import Field

app = Django()


class FakeChatModel(FakeListChatModel):
    messages: list[BaseMessage] = Field(default_factory=list)
    responses: list[str] = Field(default_factory=list)

    def _call(self, messages: list[BaseMessage], *args, **kwargs) -> str:
        self.messages.extend(messages)
        return super()._call(messages, *args, **kwargs)


@lru_cache
def get_chat_model():
    return FakeChatModel()


class RequestData(app.ninja.Schema):
    question: str


@app.api.post("/chat/")
def chat(request, data: RequestData):
    chat_model = get_chat_model()
    ai_message = chat_model.invoke(data.question)
    answer = ai_message.content
    return {"answer": answer}
# tests/test_chat.py
import pytest

from langchain_core.messages import HumanMessage
from chat import get_chat_model


@pytest.fixture
def fake_chat_model():
    chat_model = get_chat_model()
    yield chat_model
    get_chat_model.cache_clear()


def test_valid_post_returns_answer_from_llm(client, fake_chat_model):
    answer = "What gives your life meaning?"
    fake_chat_model.responses.append(answer)

    question = "What is the meaning of life?"
    response = client.post(
        "/api/chat/",
        {"question": question},
        content_type="application/json",
    )
    assert response.status_code == 200

    result = response.json()
    expected = {"answer": answer}

    assert result == expected


def test_messages_sent_to_llm(client, fake_chat_model):
    question = "What is the answer to the ultimate question of life?"
    fake_chat_model.responses.append("42")
    response = client.post(
        "/api/chat/",
        {"question": question},
        content_type="application/json",
    )
    assert response.status_code == 200

    expected = [HumanMessage(content=question)]
    result = fake_chat_model.messages

    assert result == expected

See the complete example.

Our experience at Lincoln Loop

This testing approach proved its worth for us at Lincoln Loop. The Retrieval-Augmented Generation (RAG) pipeline we began developing for a client in 2023 relied on an API that has since been deprecated. Using a custom LLM backend for our tests, like the one shown here, produced a test suite that is loosely coupled to the implementation. We were able to rewrite the RAG pipeline, a core part of the application,  without having to change any of the hundreds of related tests. Our test suite served us by supporting the rewrite rather than get in its way.  

Closing thoughts

Using a fake LLM backend which provides the ability to set responses and inspect messages was key to building a robust and reliable LLM application. As shown, this approach:

  1. Avoids the pitfalls of using mocks
  2. Makes refactoring possible

While this strategy is rooted in the idea of keeping tests loosely coupled to implementations, it is tied to LangChain. Replacing it with another library would require changes to the test suite. A library-agnostic solution could be achieved by stubbing the underlying network layer. But if the application uses multiple providers, this can quickly become impractical. Depending on LangChain has been an acceptable trade-off.

What do you think?

Happy testing!


Photo by Bartek Pawlik. Provided by DEFNA.

Marc Gibbons

About the author

Marc Gibbons

Marc caught the programming bug as a child when the internet was still text-based and accessed by a 9600 baud modem. His career path is unique; he initially studied music and played the oboe professionally with …

View Marc's profile