Confident AI QuickStart
Are you following best LLM evaluation practices? Without a serious evaluation workflow, your testing results aren't really valid, and you might be wasting a lot of time iterating on the wrong things.
Confident AI is the LLM evaluation platform for DeepEval. It is native to DeepEval, and was designed for teams building LLM applications to maximize its performance, and to safeguard against unsatisfactory LLM outputs. Whilst DeepEval's open-source metrics are great for running evaluations, there is so much more to building a robust LLM evaluation workflow than collecting metric scores.
If you're serious about LLM evaluation, Confident AI is for you.
Apart from running the actual evaluations, you'll need a way to:
- Curate a robust testing dataset
- Perform LLM benchmark analysis
- Tailor evaluation metrics to your opinions
- Improve your testing dataset over time
Confident AI enforces this by offering an opinionated, centralized platform to manage all the things mentioned above, which for you means "more accurate, informative, and faster insights", and allows you to "identify any performance gaps and identify how to improve my LLM system".
Why Confident AI?
If your team has ever tried building its own LLM evaluation pipeline, here are the list of problems your team has likely encountered (and it's a long list):
Dataset Curation Is Fragmented And Annoying
- Your team often juggle between tools like Google Sheets or Notion to curate and update datasets, leading to constant back-and-forth between engineers and domain expert annotators.
- There is no "source of truth" since datasets aren't in-sync with your codebase for evaluations.
Evaluation Results Are (Still) More Vibe Checks Rather Than Experimentation
- You basically just look at failing test cases, but they don’t provide actionable insights, and sharing it among your team is hard.
- It’s impossible to compare benchmarks side-by-side to understand how changes impact performance for each unit test, making it more guesswork than experimentation.
Testing Data Are Static With No Easy Way To Keep Them Updated
- Your LLM application needs and priorities evolves in production, but your datasets don’t.
- Figuring out how to query and incorporate real-world interactions into evaluation datasets is tedious and error-prone.
Building A/B Testing Infrastructure Is Hard And Current Tools Don't Cut It
- Setting up A/B testing for prompts/models to route traffic between versions is easy, but figuring out which version performed better and on what areas is hard.
- Tools like PostHog or Mixpanel give user-level analytics, while other LLM observability tools focus too much on cost and latency, none of which tell you anything about the end output quality.
Human Feedback Doesn't Lead to Improvements
- Teams spend time collecting feedback from end-users or internal reviewers, but there’s no clear path to integrate it back into datasets.
- A lot of manual effort is needed to make good use of feedback, and unfortunately it is a waste of everyone's time.
There's No End To Manual Human Intervention
- Teams rely on human reviewers to gatekeep LLM outputs before it reaches users in production, but the process is random, unstructured, and never ending.
- No automation to focus reviewers on high-risk areas or repetitive tasks.
Confident AI solves all of your LLM evaluation problems so you can stop going around in circles. Here's a diagram outlining how Confident AI works:
Installation
Go to the root directory of your project and create a virtual environment (if you don't already have one). In the CLI, run:
python3 -m venv venv
source venv/bin/activate
In your newly created virtual environment, run:
pip install -U deepeval
We always recommend keeping deepeval
updated to its latest version to use Confident AI.
Login to Confident AI
Everything in deepeval
is already automatically integrated with Confident AI, including any custom metrics you've built on deepeval
. To start using Confident AI with deepeval
, simply login in the CLI:
deepeval login
Follow the instructions displayed on the CLI (to create an account, get your Confident API key, paste it in the CLI), and you're good to go.
You can also login directly in Python if you already have a Confident API Key:
deepeval.login_with_confident_api_key("your-confident-api-key")
Or, via the CLI:
deepeval login --confident-api-key "your-confident-api-key"
Setup Your Evaluation Model
You can also use ANY custom LLM of your choice, although we DON'T recommend it for this quickstart guide due to some custom models being especially error-prone to outputting valid Jsons.
You'll need to set your OPENAI_API_KEY
as an enviornment variable before running an evaluation, since the metrics we'll be using is an LLM-evaluated metric.
export OPENAI_API_KEY=<your-openai-api-key>
Alternatively, if you're working in a notebook enviornment (Jupyter or Colab), set your OPENAI_API_KEY
in a cell:
%env OPENAI_API_KEY=<your-openai-api-key>
Please do not include quotation marks when setting your OPENAI_API_KEY
if you're working in a notebook enviornment as it is invalid syntax.
You can also run evaluations on Confident AI using our models, but that's a more advanced topic for later on in this documentation.
Run Your First Evaluation
Now that you're logged in, create a python file, for example say experiment_llm.py
. We're going to be evaluating a medical chatbot for this quickstart guide, but it can be any other LLM systems that you are building.
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
# See above for contents of fake data
fake_data = [...]
# Create a list of LLMTestCase
test_cases = []
for fake_datum in fake_data:
test_case = LLMTestCase(
input=fake_datum["input"],
actual_output=fake_datum["actual_output"],
retrieval_context=fake_datum["retrieval_context"]
)
test_cases.append(test_case)
# Define metrics
answer_relevancy = AnswerRelevancyMetric(threshold=0.5)
faithfulness = FaithfulnessMetric(threshold=0.5)
# Run evaluation
evaluate(test_cases=test_cases, metrics=[answer_relevancy, faithfulness])
In reality, you'll want to generate actual_output
s and retrieval_context
s at evaluation time. This is because by having a static dataset of input
s as a list of LLMTestCase
s, you'll be able to experiment which two versions of your LLM application performs better.
Finally, run experiment_llm.py
to have Confident AI benchmark your LLM application for you:
python experiment_llm.py
Congratulations 🎉! You just ran your first evaluation which created a test run on Confident AI. Before diving into the platform, let's breakdown what happened.
- We looped through the
fake_data
dataset, and created a list ofLLMTestCase
s. - The
LLMTestCase
variableinput
mimics a user input, andactual_output
is a placeholder for what your application's supposed to output based on this input. - The
LLMTestCase
variableretrieval_context
contains the retrieved context from your knowledge base, andAnswerRelevancyMetric(threshold=0.5)
andFaithfulnessMetric(threshold=0.5)
is an default metric provided bydeepeval
for you to evaluate your LLM output's relevancy based on the provided retrieval context. - All metric scores range from 0 - 1, which the
threshold=0.5
threshold ultimately determines if your metric have passed or not. A test case only passes if all of its metric passes.
🚨 But unfortunately, not all test cases have passed. Can you click on View Test Case Details on the failing test case to figure out why?
Failing test cases represents areas of improvement, and by improving your LLM application through various means, such as writing better prompts, using better tool calls, or even fine-tuning a custom model, you'll be able to make failing test cases passing.
In the next section, we'll learn how to create and use a dataset on Confident AI so we can move away from fake_data
.