Custom Scores via API/SDKs

Langfuse gives you full flexibility to ingest custom scores via the Langfuse SDKs or API. If scores are required to follow a specific schema, you can define and refer a score configuration (config) in the Langfuse UI or via our API. The scoring workflow allows you to run custom quality checks on the output of your workflows at runtime, or to run custom human evaluation workflows.

Exemplary use cases:

Deterministic rules at runtime: e.g. check if output contains a certain keyword, adheres to a specified structure/format or if the output is longer than a certain length.
Custom internal workflow tooling: build custom internal tooling that helps you manage human-in-the-loop workflows. Ingest scores back into Langfuse, optionally following your custom schema by referencing a config.
Automated data pipeline: continuously monitor the quality by fetching traces from Langfuse, running custom evaluations, and ingesting scores back into Langfuse.

How to add scores

You can add scores via the Langfuse SDKs or API. When ingesting data via API you can define scores to be of data type numeric, categorical or boolean. Further, if you'd like to ensure that your scores follow a specific schema, you can define a score config in the Langfuse UI or via our API.

When ingesting scores, you may provide:

Value: required, of type string or float
Data Type: optional, of type NUMERIC, CATEGORICAL or BOOLEAN
Config Id: optional, of type string

Certain score properties might be inferred based on your input. If you don't provide a score data type it will always be inferred. See tables below for details. For boolean and categorical scores, we will provide the score value in both numerical and string format where possible. The score value format that is not provided as input, i.e. the translated value is referred to as the inferred value in the tables below. On read for boolean scores both numerical and string representations of the score value will be returned, e.g. both 1 and True. For categorical scores, the string representation is always provided and a numerical mapping of the category will be produced only if a score config was provided.

Score ingestion: provide value only

Value	Data Type	Config Id	Description	Inferred Data Type	Inferred Value representation	Valid
`1`	`Null`	`Null`	Data type is inferred	`NUMERIC`	`Null`	Yes
`depth`	`Null`	`Null`	Data type is inferred	`CATEGORICAL`	`Null`	Yes

Score ingestion: provide data type and no config

Value	Data Type	Config Id	Description	Inferred Value representation	Valid
`1`	`NUMERIC`	`Null`	No properties inferred	`Null`	Yes
`depth`	`CATEGORICAL`	`Null`	No properties inferred	`Null`	Yes
`1`	`Boolean`	`Null`	String value representation inferred	`True`	Yes
`depth`	`NUMERIC`	`Null`	Error: data type of value does not match provided data type		No
`1`	`CATEGORICAL`	`Null`	Error: data type of value does not match provided data type		No
`true`	`Boolean`	`Null`	Error: boolean data type expects numeric input value		No
`3`	`Boolean`	`Null`	Error: boolean data type expects `0` or `1` as input value		No

Score ingestion: provide config

Whenever you provide a config, the score data will be validated against the config. The following rules apply:

Score Name: Must equal the config's name
Score Data Type: When provided, must match the config's data type
Score Value: Must match the config's data type and be within the config's value range:
- Numeric: Value must be within the min and max values defined in the config (if provided, min and max are optional and otherwise are assumed as -∞ and +∞ respectively)
- Categorical: Value must map to one of the categories defined in the config
- Boolean: Value must equal 0 or 1

Value	Data Type	Config Id	Description	Inferred Data Type	Inferred Value representation	Valid
`1`	`Null`	`78545`	Data type inferred	`NUMERIC`	`Null`	Conditional on config validation
`depth`	`Null`	`12345`	Data type and numeric value inferred	`CATEGORICAL`	`4` numeric config category mapping	Conditional on config validation
`1`	`NUMERIC`	`78545`	No properties inferred		`Null`	Conditional on config validation
`depth`	`CATEGORICAL`	`12345`	Numeric value inferred		`4` numeric config category mapping	Conditional on config validation
`1`	`BOOLEAN`	`93547`	String value inferred		`True`	Conditional on config validation
`depth`	`NUMERIC`	`78545`	Error: data type of value does not match provided data type			No
`1`	`CATEGORICAL`	`12345`	Error: data type of value does not match provided data type			No
`true`	`BOOLEAN`	`93547`	Error: boolean data type expects numeric input value			No

langfuse.score(
    trace_id=message.trace_id,
    observation_id=message.generation_id, # optional
    name="depth",
    value="Good", # required, can be passed as string or float 
    comment="Factually correct", # optional
    id="unique_id" # optional, can be used as an indempotency key to update the score subsequently
    config_id="12345-6565-3453654-43543" # optional, to ensure that the score follows a specific schema
    data_type="CATEGORICAL" # optional, possibly inferred
)

→ More details in Python SDK docs

Creating Score Config object in Langfuse

A score config includes the desired score name, data type, and constraints on score value range such as min and max values for numerical data types and custom categories for categorical data types. See API reference for more details on POST/GET score configs endpoints. Configs are crucial to ensure that scores comply with a specific schema therefore standardizing them for future analysis.

Attribute	Type	Description
`id`	string	Unique identifier of the score config.
`name`	string	Name of the score config, e.g. user_feedback, hallucination_eval
`dataType`	string	Can be either `NUMERIC`, `CATEGORICAL` or `BOOLEAN`
`isArchived`	boolean	Whether the score config is archived. Defaults to false
`minValue`	number	Optional: Sets minimum value for numerical scores. If not set, the minimum value defaults to -∞
`maxValue`	number	Optional: Sets maximum value for numerical scores. If not set, the maximum value defaults to +∞
`categories`	list	Optional: Defines categories for categorical scores. List of objects with label value pairs
`description`	string	Optional: Provides further description of the score configuration

Data pipeline example

You can run custom evaluations on data in Langfuse by fetching traces from Langfuse (e.g. via the Python SDK) and then adding evaluation results as scores back to the traces in Langfuse.