To read about how to use the Evaluations feature, see the Evaluations User Guide.
UI-First Evaluations
PromptLayer’s evaluation system is UI-first. We have all the features to run evals via code, but we built the system from the ground up to be fully featured and functional in the UI. Why? We want evals to be a collaborative experience with multiple stakeholders—technical and non-technical. This fits with our product ethos: interacting with LLMs is a new form of work that needs to be collaborative, transparent, and include multiple stakeholders.How It Works
An evaluation is a pipeline of steps (columns) that you run on a series of data. It lets you run a repeatable set of steps on multiple rows and combine them to create a score. We’ve spent a lot of time enabling all types of evaluations—both super simple and super sophisticated—to be successful on PromptLayer. You can read more about how it works and how to create your own evals in the Evaluations User Guide.Ad-Hoc vs Templated Evals
While you can use evals for any use case, we have a few mental models we like to share: Ad-hoc evals are used to answer one specific question. Maybe you want to analyze some production data or try out a new model. These are one-off. We have also found customer running ad-hoc evals that are not really evals but more like data analysis or data exploration. For example taking a subset of production data and running a prompt on them to categorize the type of query. This is a one-off experiment that leverages the easy to use evals platform on PromptLayer and the fact that we have access to your production data. Templated evals follow a specific rubric that your team will re-run over and over—either nightly or as part of a big product or feature push. These are more structured and defined, with specific metrics.Our Take: Use Assertions
Whatever type of eval you create, we’ve found the best rubrics are a series of true/false questions. Our LLM Assertion column is perfect for this, but you can create your own rubric too. Basically, your team should come up with a series of yes/no questions to ask on your data. For example:- Does the output have at least one header?
- Does it reference the correct customer’s name?

