LLMs Ops’ Recent Focus on Several Tools

5 min readSep 9, 2024

Summarize some AI application experiences and then introduce langsmith, langfuse, and dify.

Preface

A new AI project has been initiated, and we are looking for a batch of tools and services that can be quickly adopted to build products, based on the pain points of last year’s AI project. The principle is to prioritize open source, on the one hand, to save costs, including money and personnel effort, and on the other hand, for the team to get a feel for the specific processes so that they can better evaluate and purchase commercial solutions in the future.

By the way, we are a Chinese team, so it is important to comply with Chinese regulations.

In last year’s AI group buying assistant (solving the text organization order scenario of WeChat group buying), we completely implemented the prompt logic chain and standard evaluation with pure code. The pain points we encountered include:

Data sets were maintained using Feishu documents.

Synchronization between product, operations, and R&D was done verbally.
Data set tagging and field updates were time-consuming, but fortunately, the total data volume was only a few hundred entries, which already significantly occupied manpower.

Version management of prompt words and models

The version management of prompts and models mainly relied on code constraints. If the R&D team forgot to change it, it was forgotten, such as modifying the prompt words but having two versions of 3.2.2 in the test data results.
Our logs were completely code-based. Reviewing test results and comparing prompt words was laborious for the eyes. In other words, visualization is of great value for review.

It was inconvenient to switch between large models, resulting in a lack of willingness.

Each model that wanted to be tested required R&D to connect it, which was quite labor-intensive. As a result, people would first see what everyone in the circle said, and only after someone praised it would they run it.
In fact, the problems brought last year were not big, because the commercial GPT35 on the market was the best cost-performance ratio, and there was no doubt. The situation has changed completely this year, and more and more people are switching to Claude, including our consideration of domestic compliance approval, and we need to compare several domestic models.

Automated testing of evaluation standards

Each batch of effect evaluation tests (50 entries) takes about ten minutes for a model, and running a large test set (200 entries) takes about half an hour. This means that prompt engineers have to wait when adjusting a few cases, and their minds are in a strange asynchronous state, or they simply wait for the results to come back before continuing.
The same problem with logs, although we have made a summary output, carefully comparing cases still requires going back to the original data set text. For example, after one version was modified, 17 lines of data were wrong, and at the same time, 18 lines of data were wrong. At first, we were not sure which type of error increased after this version and which type decreased. Later, through continuous definition of error types and improvement of error classification statistics in automated testing, this part was previously thought with a hard coding approach, but now I would hand it over to AI for classification. In short, it is very necessary to toolize.

For tracking and evaluation of response latency and cost

This part was not there before. Before, due to the slow service of Azure and the unreliability of intermediate results, the product process was to cooperate with the technical side to break down the multi-step chain into independent user interfaces. In other words, the intermediate process was exposed to users for correction. Even so, each step was often more than 2 seconds. However, users were willing to use it in this scenario, and we thought it was not good enough. Now that the technical workflow has reached a certain level of maturity, it is very important to track and compare the time and cost of each step, especially in the to C scenario, where users have no patience, and both technology and interaction need to consider optimizing this part.
Cost tracking is definitely necessary, as commercial projects always need to calculate profit margins.

Other new potential needs include:

Low-code creation and management of AI applications, empowering other teams within the company.
Support for different RAG solutions.
Richness and extensibility of tools.

LangSmith

https://smith.langchain.com/

It is mainly a platform focused on the debugging, testing, and monitoring of large language model applications. It helps developers transition LLM applications from the prototype stage to the production environment, especially in enterprise-level applications that require high reliability and performance.

Provides the ability to quickly debug new chains, agents, or tool sets.
Allows users to evaluate the effects of different prompts and language models.
Supports running a given chain multiple times on a dataset to ensure quality standards.
Captures usage traces to generate insights.
Provides tracking capabilities for executing complex reasoning tasks.
Supports batch data testing and evaluation.
Includes LangSmith Hub for sharing and discovering excellent prompt templates.

Commercial Agreement: Not open source

Langfuse

https://github.com/langfuse/langfuse

It is mainly a monitoring and analysis platform that provides observability, analytical techniques, real-time monitoring, and optimization capabilities to help developers and operations teams manage and maintain their applications more efficiently. Its core features include:

Core Tracing: Detailed examination and analysis of each step of a chained application.
Cost Tracking: Understand the cost and token usage of the application and can be subdivided by user, function, etc.
Dashboard: Provides an overview of the application’s changes over time, including statistics and charts.
Evaluation: Comprehensive configuration evaluation of new entries.
Dataset: Add traces to the dataset and prepare to test them against their “gold data.”

Dify

https://github.com/langgenius/dify

It is mainly a low-code development platform for large language model (LLM) applications, integrating the concept of Backend as a Service (Backend as Service) and LLMOps, allowing developers to quickly build production-level generative AI applications, which can serve as an enterprise-level LLM gateway for centralized management.

Visual Workflow: Quickly create AI applications through a drag-and-drop interface.
Model Support: Supports hundreds of models, including GPT, Mistral, Llama3, etc.
Prompt IDE: An intuitive interface for creating prompts, comparing model performance, and using additional features such as text-to-speech to enhance applications.
Retrieval-Augmented Generation (RAG) Engine: Covers everything from document extraction to retrieval, supporting text extraction from various document formats.
AI Agent Framework: Use LLM function calls or ReAct to define AI agents and integrate pre-built or custom tools.
Backend as a Service: Provides corresponding APIs for all features, making it easy to integrate into existing business logic.
Cloud services and self-hosting options: Provides zero-setup cloud services and also supports quick setup in any environment.

Commercial Agreement: Apache-2.0 license allows commercial use, as a product backend, but cannot be used as SaaS.

Similar types also include: FastGPT, Flowise.

LLMs Ops’ Recent Focus on Several Tools

Preface

LangSmith

Langfuse

Dify

Written by Aloea

No responses yet