No LLM is an island

A successful LLM is part of a tech team

5 min readApr 2, 2024

Remember those heady days when all you thought you needed was an LLM and a dream to solve all your problems? Well today we’ve come to face the reality that while prompt tuning is important it is not only one of the important techniques and tools, but even it is something where we look at using tools to do the tuning. One of the major problems for companies is building up the right technologies around the LLM that they need to make them successful.

A diagram depicting a central blue rectangle labeled “LLM” which stands for Large Language Model. Surrounding this central node are eight green rectangles, each representing a component that supports the LLM. The components are “Prompt Optimization,” “Guardrails,” “FinOps,” “Model Evaluation,” “GPU Optimization,” “Provisioning,” “Fine Tuning,” and “Vector Store/RAG”. Arrows point from each of these components to the LLM, indicating their relationship and suggesting that each plays a key part.

RAGs to Riches

Picking the right RAG (Retrieval Augmented Generation) technology and approach is pretty important, and equally important is making sure that if someone has already create a RAG over a set of data that you know about it. One of the things about the “right” technology is having the experience required to develop, support and maintain the technology and solutions. So having 98 different RAG technologies because each is the ‘perfect’ for a specific use case rapidly ends up being a nightmare of information and technology integration and support.

For instance securing data within RAG so information access is properly linked backed to corporate policies requires a consistency and industrialization that will not happen if everyone is rolling their own. You also want to make sure that as the business turns their documents into information via RAG that this is going to be accessible across the organization.

An example of vectors and a graph on a document

This means that unlike traditional application specific databases we should instead think of RAG as being context specific which is not the same. If for instance the tech support team has created a RAG for the various support documents, then we want the customer service people to be able to use that in the context of providing support. This means that as we create the RAG databases we need to be clear not just on their construction, but also be able to clearly share the context of their construction. The same raw information might be used within multiple different contexts, each with their own RAG for that context, this means having a consistency of RAG approaches so they can be used consistency across the business, combining multiple contexts into a coherent approach.

Quis custodiet ipsos custodes? — Guardrails at scale

The next part of the team are the guardrails, the things that make sure that both the inputs and outputs to the LLM are validated, that ensure the LLM operates with the bounds that it should. Some of these guardrails will be specific to a given task, but some of them will be common across tasks, and even potentially common across the organization. This means two things

We need a consistent way to apply guardrails
We need a library of our guardrails

widescreen image depicting the concept of input and output guards for a Large Language Model (LLM). The image shows two shields, one labeled “Input Guards” on the left and one labeled “Output Guards” on the right, with a representation of an LLM core between them.

The first ensures that we don’t have the same guardrail being implemented differently, and thus producing different results. The second enables us not only to reuse guardrails across solutions, it also enables us to audit and verify the use of those guardrails. By being able to determine what guardrails are assigned against a given solution we can add in processes which automatically verify if the guardrails are sufficient, if inputs are being validated properly and if outputs are being protected and filtered.

Allowing every team across the business to bury guardrails within their point solutions is a significant risk to any organization. So centralizing and governing this across all implementations becomes essential.

Let me translate for you — Prompt Optimization

When you ask ChatGPT to create an image and it gives you an image, ask it then “what prompt did you use to create that image?”. You’ll find that your initial request was re-written in a way that was more likely to get the correct answer from ChatGPT, in the opinion of the ChatGPT creators. Prompt Optimization takes many forms, from reducing the number of input tokens while retaining the meaning, reducing the number of output tokens to gain a more concise answer, both of which help manage cost, through to tuning the prompt specifically for a given question or LLM to improve the likelihood of a valid answer.

Prompt Optimization is therefore a combination of skill, experience and technology. There are tools being created specifically to improve LLM outputs based on inputs, and specific techniques for doing so. Additionally as models change so the need to change those optimizations must keep pace.

Prompt Optimization is part of ensuring that solutions across the organization are working efficiently and effectively, and sharing information across multiple implementations is essential to keep ahead of changes.

Time for your LLM performance appraisal

As more LLMs are used in more solutions, so the need to understand their performance in context becomes fundamental to future improvement. If the metrics and measurements are isolated against each solution and have no common baseline there will be no way to understand whether an LLM model is performing better or worse than expected in a given scenario, you won’t be able to understand whether the prompt optimizations are working as expected, and you really won’t be able to understand whether you could switch out a the current LLM for a cheaper one without impacting performance.

In a dimly lit, futuristic office, a humanoid figure with a downcast expression sits across a human supervisor. The room brims with advanced tech and screens showing data. The supervisor points at a holographic display with performance metrics, indicating areas for improvement. The tension in the air underscores the seriousness of the feedback moment, with the humanoid figure looking disheartened by the appraisal.

LLMs therefore need to be measured consistently, to have common baselines as well as task specific metrics, and all of this need to be available across the business to help tune and improve LLMs wherever they are used.

RAISEing a team

This need to assemble the team around the LLM is another reason that RAISE was built, not just for the task of authorizing, managing and selecting LLMs:

Managing your LLM Library

Enabling and Controlling in a world of model abundance

blog.metamirror.io

But also for the task of ensuring that those LLMs are surrounded by the rest of the technical team that they need and that team being managed in a consistent way across the business. RAISE was developed because LLMs are not a standalone solution, and the future of AI isn’t about single point models operating in isolation.