Keeping up with LLM Operations

How do you keep up with model evolution and cost optimization?

Steve Jones
9 min readApr 24, 2024

It is hard to remember a technology that has evolved as far and as fast as LLMs, and this gives unique challenges in cost control for companies in two interesting directions. The first is the prices of the models themselves, and the second is the way in which you optimize your use of models and the costs. This is all complex, but great, when considering new solutions, however once solutions have gone live we are not used to these sorts of adaptation challenges at quite this rate.

These new models and cost optimizations also mean considering new approaches for guardrails, keeping your RAG and LLM aligned and making sure that your operational processes can handle the new and interesting ways that LLM solutions can fail with each new generation.

The day ends in Y, there is probably a new model

The pace of model creation is not slowing down, the number of models is huge and growing, and we are seeing the rise of “SLMs” and purpose specific models. As a company you absolutely do not need to investigate every model that is released, but you do need three clear things:

  1. A way to evaluate models on your use cases
  2. A managed way to release models for use
  3. A mechanisms to add and decommission models

I’ve talked about it before and this challenge really isn’t going away. You need to start thinking about your digital workforce and how you are going to manage it.

Models keep dropping, in and out

Since the start of 2024 there have been lots of movements in model availability. OpenAI decommissioned some of their older models, that they’d announced last year they would be decommissioning. Anthropic announced three new Claude models, which not only give different speeds and capabilities, but operate on radically different cost points. Meta have announced Llama 3 with 3 different models (and cost/performance points), Mistral have announced Mistral Large and this is just from 4 players that we’ve seen significant changes.

And we aren’t even out of April.

So if you deployed a solution only 4 months ago, the odds are that the model you deployed on has either been replaced, exceeded in function or matched in function for significantly lower costs. Taking that last piece alone, if you were running on a Claude 2.1 and could use the new Claude 3 Haiku model then you are looking at over 96% cost reduction for your solution. Being able to move from LLama2 13B to Llama3 7B would save you 50%.

To use the Claude example, if you expected to make $4m in additional revenue on a $3m model cost, you can now make that same $4m for a $108,000 model cost. The business case has gone from $1m a year to $3.1m a year.

This is the optimization available in a single quarter.

To do this though you need to be able to verify that the model will perform to the same level, be able to update the solution to use the model, and have a continuity plan in place to handle cases where it doesn’t work as intended, and all of this just in 3 months. So while you might have a nice single API infrastructure, like AWS Bedrock, you still need to do all the other work to decide when, where and how to bring these new models into your existing solutions.

New ways to optimize your digital workforce

So back when I wrote that, and the cost saving post associated with it, a whole (checks calendar) 5 months ago. The thought was to minimize the use of the LLM, to put in place intelligent caching, and when you do use an LLM to use a cascade approach to drive cost savings. But at the end of that article I talked, very briefly, about a new technique where you would use a purpose based approach to route requests, for instance maths, to specific models or approaches.

Well since then we’ve seen quite the evolution, and a big shout out to WeiWei Feng and Marek Sowa who have been talking me through two different approaches to routing. I’ll attempt to describe them simply here and so apologies to those two if I get something wrong. First off a summary of the ‘old’ approach:

LLM Cascade

The idea of an LLM cascade is ‘simple’ the biggest models cost the most amount of money, but many requests can be answered by the simpler models that cost a lot less. Therefore if you can cascade requests from the cheapest up to the most expensive, stopping when you get a valid answer, you can get significant cost savings (up to 80%).

There are other clever bits in this particular approach, like caching and prompt optimization that drove the best way to cost optimize 5 months ago. This still remains a very effective way at cost optimizing in some scenarios and is one of the easiest to implement, if you’ve got the right framework, and has the real advantage that adding new models into the cascade, and removing them from the cascade, is pretty much a one stop job. So for some scenarios you are still going to want to stick with this.

Purpose Specific Routing

The next approach I’m going to call Purpose Specific Routing, in this approach we build an intelligent agent which intercepts requests based on an understanding of their purpose and directs them to task specific models. The easiest example of this is someone asking for the answer to “15 * 48 + 3”, that is something that an LLM could do, but should not as it is a massive waste of cost and energy. The idea in purpose specific routing is that I have a system of systems and for some tasks I want to direct them to specific LLM, or non-LLM, solutions. This is a one flavour of the growing approach of Agentic LLM solutions but has the advantage of not risking adding a massive unhandled cyberthreat that can come with those approaches.

So in this approach I need to be able to configure the Router to understand the different tasks and be able to distribute the work to its ‘team’. At this stage you are cost optimizing and outcome optimizing based on your explicit direction, although it is possible that you could chain this approach with a cascade, so when calling the ‘full-fat’ model it might actually be a cascade. An advantage of this approach is that you can add in new models, and decommission old ones pretty quickly as you just need information from experimentation and development to alter your routing approach.

Experienced based routing

The next technique requires you to have consistent evaluation metrics for your models against your use cases, across large numbers of runs and to be then able to use that to create a ‘traditional’ AI model that does a machine learning exercise on that information to create a recommendation of the right model to use. Again this approach can be used in conjunction with a cascade, but the objective is to start with the model that is most likely to return a successful result at the lowest price.

This doesn’t simply mean “lowest price” it is also optimizing for the outcome, but where two models can produce similar likelihood of success then you get both the performance optimization and cost.

This approach can be thought of as applying a recommendation engine to the routing challenge. In the same way as we use large volumes of customer transactions to optimize recommended products, so in this case we use the evaluation metrics of model execution to create the recommendation, so in the same way as a recommendation you can optimize for conversion (the performance), the profit (cost) or balance between both.

Now an advantage of this approach is that it can constantly update, taking the latest set of requests and results and being able to systematically update the routing model. The challenge of this approach is that it does need that requests and results data, so if you are introducing new models you will need to have a pipeline in place that takes historical requests and generates results before deploying into this sort of optimization.

New features, new risks

Another ‘great’ thing about the ever increasing power of models is that they can do more and more things that you absolutely don’t care about as a business. Llama 3 appears to play chess just as well as GPT-4 apparently, which sounds call until you think “hang on someone could phone the call center and play chess for hours with me paying for it?”, or how about an essay on the issues and challenges with the 17th President of the United States? These are all ‘features’ that are available in models today, and as models get more and more of these ‘features’ the greater the opportunity for cyber-risk and wasted cycles.

The other ‘fun’ thing about models is that these new features are often discovered long after release, because there isn’t a complete list of what can be asked of the model and a response obtained. This means you need to constantly keep up to date with what features you don’t want turned on.

This means that in operations you need to keep a constant monitor on what models are being used for and identify where there are potential risks either in terms of cost burning or in terms of cyberthreat.

LLM operations is an entire business challenge

The point of this is that to drive LLM operations and get tens and then hundreds of LLMs into production requires a level of industrialization and adaptation that most businesses are not only not used to, but also are woefully unprepared for. While there are literally billions of funding out there in adding new models, new features and new approaches to execution, very little of this is actually aligned to the dull challenge of actually managing lots of these solutions across a company. This is where love my job, I get to speak to super smart people from AWS and Anthropic, and work with super smart people like Mark, Bikash, WeiWei and Marek (and many others) and together we look at how to solve that problem.

This is where RAISE, which was soft-launched last year, and hard-launched last month, is developing at an incredible pace. At the soft-launch there was ‘just’ the cascade approach to support multi-model cost-optimization, with prompt-optimization and caching to support both single and multi-model optimization, and the construction of the whole “team” around the LLM that makes it successful.

This focus on the operationalization of LLM solutions is really important, and the fact that there is a way of adding in new LLM infrastructure capabilities to existing solutions is the only way that companies can hope to manage the increasing number of deployed solutions in a way that has any degree of operational control.

This is something that any company that wants to actually scale GenAI needs to think about, not the shiny models — because there are literally billions invested in those already — but in how you as a company manage those models and handle the unique operational challenges of GenAI and the insane pace of evolution in this market.

A modern office with panoramic windows features professionals at digital workstations, large screens displaying data on large language models, and a vibrant atmosphere of technological innovation.

--

--

My job is to make exciting technology dull, because dull means it works. All opinions my own.