Building an LLM Digital Turk

Making the business case means one size doesn’t fit all

7 min readJan 8, 2024

You wouldn’t ask Michelangelo to emulsion a ceiling, Tesla to change a light bulb, or Turing to do your taxes. You’d also not choose a Ferrari to help you move house, or a Ming vase to bash in a nail. Expensive doesn’t mean its the right tool for the job, and the current and next generation of massive scale LLMs are giving a very real cost/benefit question as they aim to become ever more powerful generalists.

This means that they cost more and more, but might not actually be adding facilities that you need as a business. As the Economist has said, in 2024 we will see lots of smaller, faster and more specialized models, and already today an awful lot of business use cases do not require a generalist LLM.

AI models will become smaller and faster

How will AI develop in 2024? There are three main dimensions on which researchers are improving models | The World…

www.economist.com

So the challenge is how to get the benefits quickly, but with out the costs?

Prototype on a full-fat model, then tune

Premature optimization is the root of all Evil

This famous saying by Sir Tony Hoare, popularised by Donald Knuth, remains as true today as ever. Trying to set out to find the most optimal LLM at the same time as trying to find the best solution to your business problem is a route to madness. So don’t.

Prototype using a full-fat mega-model LLM (HLM — Humongous Language Model?) prove out the value case and then worry about the business case.

Value case says its worth investing in a business case

The value case is the piece that says “if we do X then we can make the company $Y” or “If we do A then we can save $Z”. Whether it is worth doing is down to the business case, and the business case, in simple terms, is whether the cost to do X or A is less than the savings.

Table showing the different prices of the different GPT models on Azure based on the context size and model, GPT-3 with a 4K context is $0.002 per 1000 tokens for a result, while GPT-4 with a 32K context is $0.12 per 1000 tokens

So while you’ve prototyped on GPT-4, if you could use a 4K context and GPT 3 is good enough then its less than 2% of the cost. If the value case is $1m per year, and you will require 10bn completion tokens and 1bn input tokens per year to achieve this based on your PoC then you are looking at around -$250k as a business case for GPT-4 32K, but if you could use GPT 3.5 you’d be at it costing only $21,600, so you are making nearly the full $1m a year.

So a first simple check is whether a cheaper model can meet the value case. If it is a binary question then you are done.

Business challenges aren’t always binary

The problem is that often in a solution parts of the solution, specific prompts, are best done against one engine, and others against another. You could spend a lot of time trying to work out those nuances and spend a lot of money, to only then find out there are edge cases you hadn’t considered.

So what you really need to be able to do is dynamically tune the LLM for the task in order to cost manage what it does.

Cost Management for Generative AI

Generative AI, LLM, GenAI, Cost Management, FinOps, AI FinOps, GenAI FinOps, LLM Cascade, Prompt Cache

www.linkedin.com

You need a Digital Turk

What you need is something that can do three things

Don’t hit the LLM at all if you can avoid it
Don’t send long meandering and vague requests
Optimize the call based on cost

Think of your available, and growing, set of LLMs (and other models) as a workforce, each one has skills, some more than others, and each one comes with a rate card. Your objective is to offer out the work to the LLMs in a way that helps you optimize the price.

The original “Mechanical Turk” was a fraudulent machine that played chess, the fraud? That a person was sitting inside it. AWS took this concept to create the AWS Mechanical Turk where you can put work out digitally that humans are going to complete. What we need today is that sort of approach but with a growing army of LLMs competing for the work.

Don’t hit the LLM if you can help it

The first part is an age old technical cost optimizer — caching — with a minor nuance. Because we are dealing in natural language we want to take equivalent questions as the key, not directly linear questions. Fortunately there are decent NLP models out there that cost almost nothing to run that can do this sort of thing. So we can determine (from our GPT understands nothing test)

Can you tell me what the fastest flying mammal is please?
what is the fastest flying mammal
fast flying mammal

That each of these is meaning the same thing. So if my smart LLM cache can parse the prompt and determine we already have that answer, I can just pull the response directly from the cache and save a hit on even the cheapest LLM. This is particularly important in cases where the result has come from the most expensive LLM.

Describe in your own words

The next challenge it turn the prompt into something that is more likely to get the right result and to do so with less characters. DALL-E 3 actually does this automatically for you, except for the character optimization.

If we ask:

“Create an image of Michelangelo and Leonardo da Vinci being painting the kitchen ceiling of a suburban kitchen”

We get:

Three Renaissance painters painting a kitchen ceiling, a more detailed description is below, but there are clearly 3 people in the picture, not the two asked for

And when we ask for the prompt that DALL-E 3 used it gives us:

A humorous scene in a suburban kitchen with Michelangelo and Leonardo da Vinci, both dressed in Renaissance attire, collaboratively painting the ceiling. Michelangelo is on a ladder, painting in his signature style reminiscent of the Sistine Chapel, while Leonardo is mixing paints, looking contemplative. The kitchen is modern, with stainless steel appliances and a central island. Light streams in from a window, casting a warm glow on the scene.

(no idea who the third guy is BTW)

So prompt re-writing should be part of your LLM workflow anyway, and here its a way to turn general human statements into specific ones, and to help reduce embellished human phrases into simple ones. As a simple calculation, my request was 19 tokens, the internal prompt was 85. A saving of nearly 80% on the input tokens.

The other element is to limit the returns for the purpose. If I’m asking the LLM to give me a part number then I just want the part number that is referenced in a document, I don’t need the LLM to say “Part number XVY01248345 is for the reverse flange widget with the extended foobars and has been in production since 1848 with modifications in 1953 to deal with the Kaskerperneky order that required a modifiable doohickey, it is referenced in the document as part of an expedited order that was delivered on July 6th, July was named after the Roman Emporer Julius Caesar”, I just need “XVY01248345”.

Do I have a volunteer?

There are two approaches for the final stage — choosing the right LLM. The first one is to build a sophisticated engine that can interpret the prompts and dynamically work out the single right engine. I dare say these sorts of things will happen, and sometimes will be total snake-oil, but there is a much simpler model of economy that you can apply.

If we look at the pricing table above we can see that if I was using just the lowest and highest price, then I would only need 1 in 59 calls to the lowest model to succeed to save money by calling the higher model only if the lower model fails. This means for testing I can run an LLM Cascade to see whether across all my use cases what the TCO would be if I run a dual model set up, and I can of course modify that cascade to have many LLMs which then combined provide a strong financial calculation as to how the cascade will operate in production. This then becomes something I can monitor operationally, and something I can add additional constraints around cost management as required.

The flowchart outlines an LLM optimizer that uses a prompt-response cache and an escalating series of language models to efficiently generate high-quality answers, storing them for future use to conserve resources.

Sometimes the answer isn’t an LLM

The final point to raise on optimizing LLMs is to be clear that you need an LLM, a SQL lookup, a standard calculator, a specialized deep learning model are better, more efficient and cheaper solutions for specific challenges than an LLM. It is worth asking that question before you start, it doesn’t mean a technology, including AI, solution isn’t the right answer, just that an LLM isn’t a universal tool. Sometimes in that prompt adaptation you might process the result and be able to say “This is a basic maths equation, I’m not even going to use an LLM, because computers are very good and very cheap for basic maths.” and then just call a basic calculator.