Why Generative Code needs Regression Testing not Refactoring

Steve Jones
5 min readJul 12, 2023

I just threw away 6 classes on a project I was working on. I wanted to add a few new features so I just removed 6 classes and regenerated them from scratch. I’d stored my previous prompts, put into the headers of the files, so I made some modifications, did a few iterations, and the new version was ready.

There was one thing that was the same: the interfaces for the classes.

I’m using interfaces for two reasons

  1. I think through the structural design of the application first
  2. I can use my own naming conventions

Well actually those are my old reasons for using interfaces a lot, I now have some new reasons that stem from using code generators

  1. Consumers can just write code against the interface, I’m using a standard factory pattern to get the implementation
  2. My test code stays the same

That last one is the real reason that I’m doing it, and I think that as generative coding becomes more and more common we are going to see a real split between those who write proper unit, and particularly system, tests and those who are used to just modifying the code file.

Generative Code is often best thrown away

Sometimes with generative code you should generate once and then treat as your own, but quite a lot of the time you should start thinking about generative code in the same way as you think about a compiler. You don’t edit the binary, you edit the code and then recompile. I’ve been finding that if you take the interface approach you can do the same thing. Change the prompt, recreate the whole class.

There are two reasons that I think this approach is a good one:

  1. The LLMs are going to keep improving
  2. You clearly delineate human code from generated code

The latter point matters, if a human ever modifies a single element within a class, they’re now wholly responsible. If the code is generated only, then someone still needs to be responsible, but for the outcome of the code, not the code itself. Again this is similar to compiler, you’re accountable for the prompt and the outcome, and you are being clear that the internals should be considered as purely generative and the right level of rigour should therefore be applied. Which means much more testing and no fudging it with “Well Dave made some modifications so I trust Dave, therefore its good”. This separates your code base into three elements

  1. Pure human created code
  2. AI Assisted human code
  3. AI Generated Code

The first two should have the same accountability on the developer, the risk is that people put 2 and 3 together, but without the rigour that 3 really requires.

Generative Code demands regression testing

By using the interfaces I’ve generated a lot of regression tests, plus a number of ones that I’ve created. As I add new features the first thing I do is extend the regression tests, which then fail because the current code doesn’t do what I need.

This then enables me to edit the prompts, regenerate the classes, and then re-run the tests until everything passes meaning it meets the previous requirements and the new requirements. What I’ve found quite often is that you need to be careful on the prompt, otherwise you might get the “and do X” new functionality, and lose out on one of the previous pieces of functionality. Having the regression testing framework makes sure this doesn’t happen.

Process showing creating new tests, running the tests, modifying the prompt, generating new code and then re-running the tests until the tests pass
My basic workflow

With my workflow I’m taking time to define the interfaces, then the tests (which I do in part generate) and only then get into having the LLM generate the class. I’m in control of the prompt and I do review the generated code before the final submission, because sometimes the generated code might pass the tests, but really didn’t do it in a smart way.

Generating Regression Tests

Generating tests is something that LLMs do well, particularly if you are using well described interfaces.

An example of test cases for a sorted list
Lots of test cases for a sorted Integer list (ignore the underlines!)

I’ve really found that LLMs massively help with a TDD approach, whether I’m generating the code or writing it myself. Sometimes the generated tests have had errors, but it has also created edge conditions that I’d not thought about as well. By putting variations of prompts and getting it to generate data sets for testing it is creating a much larger regression testing framework than I’ve historically done.

Regression Testing enables Reinforcement based Generation

A shout out to Tarek Ziadé who has had the same thought on where this then ends up. The next obvious evolution of this is that the code generator keeps modifying its own prompt dynamically until it creates code that passes all of the tests. So you make the additional requirement suggestion, and a set of tests that prove you’ve met that standard, and it then iterates until it has a set of classes that pass those tests.

Putting aside the cost of using a generative AI in a lazy way, and the quality of the code created, we can then use our testing framework and an initial prompt to move towards a generative and reinforcement based learning approach to code development, based on testing.

Reinforcement learning and generative code generation, similar process to last time, but the Modify Prompt and Generate New Code are now defined as being done by the AI alone
Red boxes indicate the AI controlled part of the process

In this approach the failure conditions against the tests would help to inform the modifications of the prompt, which in turn would generate new code.

Right now I do not think this approach is viable, as it appears to lead to more and more complex prompts and in turn extremely complex and specific code. So when I’ve looked at using a test failure condition to modify a prompt it tends to add an additional condition, rather than modifying the prompt to emphasize a point. This is to be expected as it doesn’t have a conceptual model of the solution, just test cases to verify against. This results in increasing complexity of code which while it may meet all the current test conditions, tends to introduce more defects due to its specificity, which requires more test cases and in turn more complexity.

It will be interesting to see if in future such an approach can be use to optimize rather than extend a prompt to achieve a simpler solution that matches all of the test cases.

--

--

My job is to make exciting technology dull, because dull means it works. All opinions my own.