Why LLMs mean bad code has never been easier

Buy Cloud Shares, because compute is going to go through the roof

7 min readJan 13, 2023

I saw a thread today and it gave me pause to think. It was positing that thanks to LLMs people wouldn’t need to understand SQL.

Now I think this is interesting for three reasons

Not the first time this has been said
The technical challenge of efficiency
The challenge of translating language into outcomes

Not the first time, won’t be the last

At Uni, so we’re talking EARLY 90s here, because I’m ancient. I took a course called “Natural Language Interfaces to Knowledge Based Systems”, which was a course in how to understand natural language, how to under stand “knowledge” and how to enable the former to be used to access the latter. The professor who took the course was also the one who had taken our database course the previous year, and in his words:

“Some of you are here because you took my database course, that course was in technical language ‘a piece of piss’, this course is ‘an absolute bastard’”

And so it proved. Since then we’ve seen people embark on the quest of enabling people to efficiently interact with data without needing to understand the technical aspects. There are two parts to this challenge, the first is the query part, and the second is the representation or explanation part. Tools like Toucan Toco help with the second part, and most modern BI tools like Power BI have natural language extensions.

So this is an age old journey, where there has been significant progress, and one where for general, lightweight, workload managed querying it can work today. Therefore its worth looking at the two biggest issues in this really replacing technical skills for all usage.

Answers are easy, efficient answers are hard

Most examples in this space are like those that Data Virtualization companies like to use: small limited schemas with very clear join criteria via keys. That however isn’t the reality of the challenge. Which is where the technical aspects really come in and why I’m not a fan of Data Virtualization.

Generation technologies suffer the same fate, they can create code that runs but that doesn’t mean its going to be efficient. One of the largest parts of this is adding temporary tables, caching, materialized views, and of course recognizing when a business process or schema needs to be changed to make the analytical side work.Even technical SQL to SQL translators when moving from one engine to another can suffer this fate, where the different engines of one database vendor optimizes differently to another, meaning that a straight SQL transformation just doesn’t work in a feasible amount of time. Theoretically it is possible for an LLM to be trained on lots of different “styles” of databases and thus write queries that are more aligned to each one, but a lot of those optimizations can be low level specifics of the database design.

The creation of SQL code from natural language, or psuedo-code, is not a new challenge and while great strides have been made its still an awesome way to burn massive amounts of cloud CPUs because a generated statement just doesn’t fly, even if technically it is going to produce the right answer. So today what we find is things like metrics, and simple queries can be generated, but when it gets complex then its only the cloud companies who benefit, which leads us on to the problem of people and complex queries.

Humans can be confidently wrong

So lets run through with Chat GPT on an example:

A relatively simple database schema around customers, orders and shipments

ChatGPT giving Inner and non-inner join options for something that should be an inner join

So the first piece is good, we’ve got an inner or normal join (they’re the same thing) and this is all great, next up we want to find all people who have opened (non-delivered orders).

Chat GPT creating a query to find non-delivered orders, there is a problem though

Now here comes in the first problem. In this query it assumes that all shipments have been created for the order, so you just needed non-delivered ones. Of course if there aren’t yet any shipments then it wouldn’t appear, so I need to tell ChatGPT.

After being told of the change, ChatGPT does a good job, but there is another problem

Boom goes the query

Ah, but an orderline is only delivered if ALL of the items ordered are delivered

ChatGPT does a great job of generating this SQL.. but…

Now this is getting impressive, but wrong, because its just summing up the shipment totals and not filtering for just those where the shipment has been delivered. Basically the CTE shoudl have a WHERE clause just including the delivered items.

Our intrepid non-SQL person doesn’t spot this so continues onwards.

Hey look, two tracks

ChatGPT automatically adds in a “most recent event” clause to the query… but

This is genuinely impressive, it has actually corrected the issue from before with the CTE now mentioning the delivery status… but wait the CTEs reference the delivered status, but the subquery doesn’t.

These two queries give different results.

Even the CTE one, though technically correct, is also a mistake. I don’t NEED to add in the shipment event information into this query, it doesn’t add something. And while ChatGPT is spot on to just take the latest event, that could be an absolutely massive query. The query is still going to get the right answer, but we’re funneling money to the cloud provider.

Now I know this, because I know that the status on the shipment line of delivered is enough here. But ChatGPT lost that in the previous query, and has rediscovered it here, but only in one of the options

Boom goes the query again

Now our non-SQL person wants to find out which of those customers have been waiting more than 7 days for their orders…

Asking chat GPT, and in the query it gets things wrong

You’ll notice that yet again the “Delivered” status has been removed here. It also doesn’t do who currently has waited for 7 days, but everyone who has ever waited 7 days. It has also removed the clause around checking that everything has been delivered. So someone might have had a partial delivery in the past 7 days, but not a full delivery and they won’t see that. It has created the write CTE, but then doesn’t actually use that CTE in the query.

Now to think like a user and really mess with it.

It can handle the split

So we do the classic asking for two things in a single statement, and ChatGPT remarkably splits that

Now this is amazing, the Product query has included the Event Status filter, correctly, but the customer query hasn’t. The customer query has however now added in “NOW()” so it is just looking in the last 7 days.

Boom goes the query

So lets do the customer join and find out which customers are impacted by products that haven’t been 14 days late before.

Now here its lost the plot entirely, it should be in the first CTE that the latest status is NOT delivered, or really where the Shipment isn’t delivered (no need to go to the event). But this will find all delivered shipments that happened over 7 days ago. No the LateProduct join in the final query is right, and the CTE is right for the English used (delivered) but we also know that products which aren’t not delivered and which are already 14 days late will be late. I promoted on that, but it was still failing to recognize the status error.

Summary

ChatGPT generated some correct, and some incorrect code, it generated inefficient code. The challenge is that you need to know SQL to know that this code is wrong, if a regular user did this, or it was behind the scenes then they’d see results come back, but not know that it is actually incorrect.

It was spectacularly easy to create this variably correct code. Seriously easy. Took a few minutes of my time and I had badly performing incorrect code.