Skip to main content
To read about how to use the Evaluations feature, see the Evaluations User Guide.

UI-First Evaluations

PromptLayer’s evaluation system is UI-first. We have all the features to run evals via code, but we built the system from the ground up to be fully featured and functional in the UI. Why? We want evals to be a collaborative experience with multiple stakeholders—technical and non-technical. This fits with our product ethos: interacting with LLMs is a new form of work that needs to be collaborative, transparent, and include multiple stakeholders.

How It Works

An evaluation is a pipeline of steps (columns) that you run on a series of data. It lets you run a repeatable set of steps on multiple rows and combine them to create a score. We’ve spent a lot of time enabling all types of evaluations—both super simple and super sophisticated—to be successful on PromptLayer. You can read more about how it works and how to create your own evals in the Evaluations User Guide.

Ad-Hoc vs Templated Evals

While you can use evals for any use case, we have a few mental models we like to share: Ad-hoc evals are used to answer one specific question. Maybe you want to analyze some production data or try out a new model. These are one-off. We have also found customer running ad-hoc evals that are not really evals but more like data analysis or data exploration. For example taking a subset of production data and running a prompt on them to categorize the type of query. This is a one-off experiment that leverages the easy to use evals platform on PromptLayer and the fact that we have access to your production data. Templated evals follow a specific rubric that your team will re-run over and over—either nightly or as part of a big product or feature push. These are more structured and defined, with specific metrics.

Our Take: Use Assertions

Whatever type of eval you create, we’ve found the best rubrics are a series of true/false questions. Our LLM Assertion column is perfect for this, but you can create your own rubric too. Basically, your team should come up with a series of yes/no questions to ask on your data. For example:
  • Does the output have at least one header?
  • Does it reference the correct customer’s name?
One advanced trick: earlier in your eval, generate a series of assertions based on information in that row, then use those to judge the output. You can import the dynamically generated assertions into the LLM Assertion column. That said, use whatever eval type fits your needs. Assertions are just what we’ve found easiest to understand and build out.

Use Production Data

Another strong reccomendation is to use production data. We believe the best data is production data. A lot of teams come to us without a dataset or an evaluation, but they have tons of production data. That is a perfect place to start. We’ve built a fully featured dataset builder that lets you build really sophisticated datasets from production data. If you’re not using that, you’re missing one of the most powerful features of PromptLayer.