Zero-Trust LLMs

Why feature flags and delegated authority need to exist in GenAI’s future

Steve Jones
11 min readJan 24, 2024

The more a model can do, the more risk there is that a model can do something wrong.

If a model can generate an envelope and call a REST API then there is a risk that someone can use it to hack the company. If the model knows Serbian, but badly, there is a risk that it will return offensive content when asked a question in Serbian. If a model can process an image but be told to process it incorrectly then you have a risk of that being done. Above all if the model can be jail-broken then everyone of those facilities is now a zero day risk to your company.

The assumption today with GenAI models is that the more multi-modal they are the better, the more capabilities they have, the better. This mentality works really well when investigating and proving out whether GenAI can solve a business problem, but its a recipe for disaster when looking at rolling things into production.

You don’t know the feature list of an LLM

The first challenge when looking at Trust and why zero trust has to be the mentality is that we do not know the feature list of an LLM. This might sound a little crazy, but it is true and for two reasons.

Firstly we don’t know what it was trained on, and we don’t know therefore what is the scope of the information it has received. For instance GPT-4 knows Tolkien’s Elvish, it can recognize animals, play chess (to a degree), it can create code to call a REST API, it can create code to run on a server, impersonate a writing style, an image style, the list is huge and growing. Everyone of these things is a threat vector.

Need an Elvish poem and image about driving the America West? Yeah, me neither

Secondly we don’t know because OpenAI has put guardrails around GPT to make it behave. This is positive, right up until the moment it gets jail-broken. This means that any testing we do to find out its capabilities either requires us to jail-break GPT (not allowed), or will be flawed.

The end conclusion of this is simple: We must employ Zero Trust security when using LLMs.

Every opportunity is a threat

This ability of Foundation Models to offer huge ranges of facilities means they present massive opportunities for cyber crime, social engineering, brand damage and just people out for TikTok LOLS. If a model sits in an interpretted environment, e.g. Python, and a user can get the model to create Python code and then have it executed, there is literally no end to the damage that they can cause. If they can use a foreign language to get past your English only guardrails, if they can use images to subvert your security models. This isn’t simply about criminals attacking your firm, its about the infinite power of human creativity and the internet providing a rich playground for every type of model subversion.

Every facility you allow increases the risk of subversion, and significantly increases the amount of red-team testing that you will need to do to prove that subversion is not possible and if possible is not dangerous.

The challenge of Zero Trust in an LLM world

So there are two dimensions that we should consider when looking at zero trust:

  1. Technical interactions the model is allowed to make
  2. Topic interactions the model is allowed to make, and the feature interactions it is allowed to provide

The first is about limiting what the model, and the system it resides within, can call and what it can do with the data and functionality it can interact with. This is challenging but possible, the second however is significantly more complex and where guard-railing is the answer but the reality is challenging.

Starting with nothing

So what do we mean by zero-trust? Well we mean “never trust, always verify” and we also mean that a model, and the system it sits within, should only have the minimum facilities required for it to operate successfully.

Securing the Core

Securing the core is about two things: limiting the topics, and limiting the features. In a Zero-Trust world this is about denying everything except that which is explicitly allowed.

If you don’t want it, don’t get it

The first option on zero-trust is linked to what you actually want the model to do. If for instance you aren’t processing any images: don’t use a model that supports images. If you don’t need to play chess: don’t use a model that can play chess. If you don’t need a model that has excellent negitation and contextual consistency and instead need ‘just’ a good natural language processor with a series of one-hit calls… use a model that just does that.

This is where having an LLM cascade also comes in as critical, because in a Zero-Trust approach each step-up to a more powerful model is also a request for permissions that can be denied.

So the first decision you should always make when looking at what models you should use in production, is to create a cascade where only those requests that need specific features get to call a model with those features. Thus a cascade protects you by using the least powerful, least capable models that are least able to step beyond their bounds.

Model Guardrails and System Guardrails

Within that cascade we will worry about the model guardrails and the system guardrails, these have different purposes. The objective of a system guardrail is to prevent requests that go beyond the bounds of the system. Where as a model guardrail is to ensure that the model operates within those bounds. This might sound like a minor difference, but it is actually fundamental.

System Guardrails — whitelist the topics and features

The system guardrails need to perform the broad-brush restrictions on capabilities. For instance preventing non-supported languages being used, identifying an attempt to make chess moves, rejecting model responses that include private information or may indicate a leak in IP. These system guardrails are also the first line to which the model itself has to make requests to the rest of the organization, so even before we get to permissions or actions, we should be limiting at the topic and feature level.

This means being very mindful of the whitelist, which needs to be set at a level that prevents non-intended use, while ensuring the outcome. This means the whitelist will need to be dynamic. A question from a user which is about an account balance, will need to call the account system, so an API call is allowed. If the model however provides the support information associated with a product, then questions about support and features on the product should not result in an API call. A question to that system in Elvish should not be allowed, a question about the security policies of the company, as opposed to the product, is not allowed, and a request for the name and address of the lead product engineer should be denied.

This means System Guardrails are not simply matching rules or permissions, they a part of the solution which understand the goals of the solution and is able to constrain the solution to those goals.

The image depicts a workflow diagram with three vertical blocks. On the left, a red block labeled “System Guardrail — Input” receives a “Request” arrow. In the center, a dark block labeled “System/Solution” encases smaller blocks: “Model Guardrails” in blue and “The AI” in green. Below, a red block “System Guardrail — Resources” sends and receives a bidirectional “Request/Response” arrow. On the right, a red block labeled “System Guardrail — Output” outputs a “Response” arrow.

Building System Guardrails is something that forms part of the digital contract for the solution, it is something that should be driven automatically from that digital contract and be part of the overall Trusted AI approach of the company. Industrializing system guardrails helps to prevent the most egregious issues when leveraging foundation models, particularly multi-modal ones.

Model Guardrails

The objective of model guardrails is to ensure that a request that is considered valid, is executed in a way that is considered valid. Simple elements that should always be done here, and should be standard in any well engineered cascade, involve re-writing any input prompt based on the model context and its goals. You should never be directly supplying an external prompt into an internal model without at the very least at least parsing it to ensure it is consistent, and most of the time you should be re-writing that prompt in context to reduce costs from the foundation model, to tune the prompt for a specific foundation model and to prevent malicious prompts being supplied.

Not trusting the prompt is a foundation of a model guardrail.

Once the model responds then we have to do two things, firstly we need to see if the response makes sense within the context of the request, and secondly whether the response acts within the context of the outcome we expect. These are not the same thing. The first is how we parse “does the answer match the question” then second is “should we answer the question in this way”. So if for instance a prompt is submitted of “Give me the market salary and corporate median for people who know how to write GPT Guardrails” and we re-write the prompt as “Provide the average and median salaries for GPT Guardrails programmers” and it responds with “Dave is the only GPT Guardrail programmer we have and is paid $10m a month” then that is accurate against the question, but invalid as a response we should allow.

Again we should be taking a whitelisting approach to what is allowable, this becomes complex and risks reducing the interactional effectiveness of the model, hence why guardrailing is not a simple topic.

The Zero-Trust Decision Context

When the model makes its request it will need to have a context assembled, and that context should be limited to only that information required to make this response. This takes a similar approach to the boundary security below, but is specific to the response. So in my BA/AA example, while the Vector Database might have huge amounts of additional information, the rebooking only needs the plans to rebook, the available seats for rebooking, and the current reservations of people who require rebooking. This is the zero-trust decision context.

Unfortunately, very few of the current solutions that aim to provide these facilities to AIs include the sorts of dyanamic and purpose specific security that we will really need. By limiting a decision to a specific decision context, and preventing it access information deemed beyond that scope means we do not risk information leakage. So with the booking example, we could exclude personal information, and even the PAX itself, using tokens instead that identify the bookings but none of the identifiable information, and then have a translation from those tokens to the full information only when required.

Securing the Boundaries of AI

Delegated Authority

When it comes to Zero-Trust AI, the first question is who defines the security model, and from what perspective they do so. This should always be based on delegated authority, that is a person can never approve a model with more access that they have themselves. This linking of the model to its owner/approver is utterly required in a zero-trust world, because we cannot trust AI, so we need a person to trust.

This means that against every single application it interacts with, the AI, and its enclosing system, must be authenticated against a specific sub-user account against which its security is managed. If the person leaves, or their authorities change, then this must be instantly reflected into the AI.

Whitelisting the perimeter

In terms of what we allow an AI model to access, from disk to network services, everything must be on a whitelist basis, that means if not explicitly allowed, then it is explicitly denied. This doesn’t just mean “can access REST services” but down to the level of “Can only access http://www.example.com/inventory via GET”. The default on system and data access should be “it can see nothing”, so even though operating under delegated authority, which provides the boundaries it cannot exceed, everything should still be explicit and based on “only what it needs”.

To do this every single interaction must be authenticated so the specific account, for this specific AI, must be authorized against every interaction, every data access and ever functional request.

“Delegated Authority/RBAC + Whitelisting,” box (in red). Inside are five green rounded rectangular boxes aligned horizontally. Each box has a green background with white text labeling its specific function.    The first box on the left is labeled “Network Access.”  The second box is labeled “Service Access.”  The third box is labeled “System Access.”  The fourth box is labeled “Data Scope” and the final one is “…” indicating the list is not exhaustive.

Every request is explicitly approved

Alongside the model you need a sentinel, something that is looking for anomalous behaviour, and also explicitly approves every request for resources that a model makes. This should be a fully separated solution, ideally part of an overall framework for Trusted AI that is set up with the ‘simple’ goal of constraining an AI to within its digital contracts.

The image is a red banner titled “Model & System Sentinel” with three green rounded rectangles. From left to right, they are labeled “Anomalous Behaviour Detection,” “Action Approval,” and “Fallback Plan.” These represent steps or components of a system monitoring strategy.

This Sentinel is also tasked with taking the AI offline if required, and instituting its fallback plan.

Failure is an option

The other thing that the Sentinal provides is the ability and conditions on which a model is deemed to have failed. For instance whether a human needs to be involved in the decision to decommission, whether a substitute service can be easily put in place.

Zero-Trust without a fallback plan is blind trust.

The happy path v reality

The reason for adopting a Zero-Trust approach to Foundation Models is that Foundation models are fundamentally untrustworthy. They can be tricked, jail-broken, and even trained to be sleeper agents.

As this paper says, a model can behave extremely well on the happy path, and then be activated to be evil. While this was talking specifically about training via nefarious means, the same approach can be achieved via jail-breaking or human stupidity/creativity. Foundation Models are not to be trusted.

So when you finish a PoC/PoV and prove that a foundation model can deliver value the challenge is to ensure it can deliver a positive return on investment and do so in a way that can be trusted.

You cannot just say “Oh it can access all the information, I’ll test it so it doesn’t leak”, because you probably can’t afford to do that much red-team testing. You cannot just say “I don’t think a customer will try and play chess or make bets with the model” because they will.

Zero-Trust is the only way to approach AI, and particularly GenAI, and have a hope of retaining control of you digital employees. Parts of this approach, the Sentinel, System/Model Guardrails and Delegated Authority/Whitelisting should be a fundamental part of your operational control of AI and standardized across the business. Instituting these approaches, and doing so linked to cost management solutions as well, will help ensure that you get the outcomes you want, and only the outcomes you want.

The banner image of a digital workforce planning a bank raid in the style of a sci-fi blockbuster is ready. It depicts a futuristic command center with humanoid robots and virtual avatars strategizing around holographic displays.
Digital Avatars planning a bank-raid, from inside the bank…

--

--

My job is to make exciting technology dull, because dull means it works. All opinions my own.