Generating 17,000 lines of working test code in less than an hour
Welcome to the future of software development! The agents are coming, and it's a good thing, especially for busy developers who want to automate the less enjoyable tasks, like writing unit tests, and reducing the rippling cost of bugs during SDLC.
Imagine a world where AI agents handle the tedious parts of coding, allowing you to focus on what you love most - innovating and creating. In this blog, we explore how Early’s AI tool using OpenAI's GPT-4o, can transform unit test generation.
This brings up an important question: How effective can these generated tests truly be? How can we measure the quality of the tests? How can we benchmark them against different qualitative matrices? In the sections that follow, we'll dive into the metrics and methods we used at Early to measure and benchmark the quality of these tests.
Why LLMs for unit test generation?
When it comes to automatically generating volumes of quality working unit tests for regression tests purposes, large language models (LLMs), can be an exceptionally powerful engine that saves you a lot of time and headaches. LLMs can help automate test creation and ensure more comprehensive tests, at scale.
That being said, it is also important to remember that the impact and value of LLMs change over time, depending on LLMs aspects like how much they were trained for a particular case, and how the LLMs are being used (prompt, data, validation, tuning, etc.)
The GPT 4o showdown: Benchmarking GPT 4o for unit testing
To uncover the strengths and weaknesses of OpenAI model in the domain of unit test generation, we ran a series of tests on the Early’s product and technology, our own cutting-edge engine, and evaluated the results. We focused specifically on its capabilities in generating unit tests for a complex open-source project called ts-morph.
Note: we are also running the tests on our own backend, and seeing exceptional results.
Test quality criteria
Before diving into the results, it's important to understand two key metrics we used: code coverage and mutation scores. These metrics are crucial for evaluating test quality, and their interplay provides a comprehensive view of our tests’ effectiveness. Let's start with code coverage.
Code coverage
One of the most widely-accepted methods for measuring test quality is code coverage. Code coverage measures which percentage of the code is covered by tests. If I have zero coverage, clearly my tests either don’t exist or are useless, leaving my code with bugs and issues.
However, code coverage is an insufficient measurement on its own. While a low code coverage is indicative of poor testing, high coverage does not necessarily indicate high quality. Even with 100% coverage, the quality of tests might be low. For example, if they don’t cover enough cases of different input datasets. Similarly to the low coverage example, the code still has bugs and issues.
Mutation testing
Mutation testing is a software testing technique used to evaluate the quality and effectiveness of the tests themselves. The process involves introducing small changes or "mutations" to a program's source code to create modified versions of the program, known as "mutants." The primary objective of mutation testing is to assess whether the existing test suite can detect and fail these mutants, indicating that the test suite is thorough and robust.
In our benchmark, we used Stryker, a mutation testing framework for JavaScript, TypeScript and more.
How mutation testing works:
- Generating mutants: The first step in mutation testing is to create multiple versions of the original program, each with a slight modification. These modified versions are known as mutants. Common types of mutations include:
- Changing a logical operator (e.g., replacing && with ||).
- Modifying a mathematical operator (e.g., replacing + with -).
- Altering a constant value.
- Changing a conditional boundary.
- Running tests on mutants: Each mutant is tested using the existing test suite. The purpose is to check whether the tests can detect the changes (i.e., cause the tests to fail).
- Analyzing results: After running the tests, the outcomes are analyzed:
- Killed mutants: If a test fails due to the mutation, the mutant is considered "killed," indicating that the test suite is effective in detecting that type of fault.
- Survived mutants: If the tests pass despite the mutation, the mutant is considered "survived," indicating that the test suite did not detect the fault.
- There are other types of “not killed” mutants, like no coverage, timeouts, and errors.
- Calculating the mutation score: The mutation score is calculated using the formula:
Relations between code coverage and mutation scores
Complementary Metrics:
- Code coverage and mutation score complement each other. Code coverage ensures that the tests exercise the code, while mutation score ensures that the tests are effective in detecting faults within the exercised code.
Correlation:
- Generally, higher code coverage can lead to a higher mutation score, as more parts of the code are being tested. However, this is not always the case. It's possible to have high code coverage but a low mutation score if the tests are not thorough or effective in catching faults.
Quality assessment:
- A balanced approach using both metrics provides a more comprehensive assessment of test suite quality. High code coverage with a high mutation score indicates that the tests are both extensive and effective. Conversely, high coverage with a low mutation score suggests that the tests need improvement in fault detection, probably with more edge cases tests.
Optimization feedback:
- Code coverage can highlight areas of the code that need more tests, while mutation score can highlight the need for more robust and fault-detecting tests in already covered areas.
Ok, let’s get to the results
Setup and testing environment
To conduct our benchmark we used a popular OSS project, ts-morph, and the latest Early product at the time (EarlyAI extension version 0.4.23).
Test Project ts-morph
ts-morph is an open-source project that provides a powerful and user-friendly API for working with the TypeScript Compiler API. It is designed to simplify the process of interacting with TypeScript code, enabling developers to create, manipulate and analyze TypeScript code programmatically.
- GitHub Stars 4600
- GitHub Forks 189 (now one more)
- Contributors 58
- Total Commits 2,297
- License MIT License
- Language TypeScript
- Repository https://github.com/dsherret/ts-morph
- Clone date: June 2024
- Tested package: packages/common/src
- Line of App code (packages/common/src): 4937 LoC
Setup:
- The original tests were removed from the project’s code to mimic a clean slate. This also allows evaluating unit tests in isolation compared to other forms of tests.
- Setup Jest test framework (default for ts-morph is Mocha)
- Tests were generated only for packages/common/src
- Using Early extension version 0.4.23 from July 11 2024
- Attempting to generate unit tests for the 210 public methods in this project
The specific model we used for this benchmark is: gpt-4o-2024-05-13
KPIs
- General data like the number of green unit tests, red unit tests, lines of code, time to generate and more.
- Code coverage (all files)
- Mutation scores for public methods (3 groups)
- Mutation score (all method)
- Mutation score (only for methods that have unit tests)
- Mutation score for 100% coverage. We scored all the methods that have unit tests AND the code is 100% covered by these tests (indicative of quality tests)
- Note: Mutation scores are calculated at the public method level with an in-house tool that we built. We plan to open source that tool when ready
- Scope coverage ratio:
Definition: Measures how successful we are in generating quality unit tests for the public methods.
Quality unit tests are defined as tests with 100% coverage for their respective methods. Scope coverage ratio is defined as:
Benchmark results
One of the initial KPIs is coverage: how well the code is covered by the tests. As we can see, with GPT 4o the coverage is 60%.
In addition, we can see three levels of the mutation scores:
- Mutation score for all methods (a Vanilla run on the Stryker tool) – 44%
- Mutation score only for methods that have generated unit tests – 81%
- Mutation score only for methods with generated unit tests, that has 100% code coverage – 92%
We can see that coverage is quite high at 60% but more importantly, the quality of the tests, when generated, is very high at 81% mutation score for all methods with unit tests, and 92% for the methods with unit tests that are covering 100% of their respective methods.
When generating unit tests automatically we found strong correlation between high coverage and mutation scores. Meaning, that when test generations are successful and produce high coverage, the the unit tests are at high quality.
To understand how effective these tests are we will look at how successful we were in generating tests for all the public methods in the project. To do so, we introduced the Scope-coverage-ratio, as defined earlier.
The project had 210 public methods. Here are the scope-coverage calculations:
Let’s look at more supporting data that could explain these results. Specifically, we’ll look at the number of green and red unit tests.
For the purpose of running Stryker we had to skip all red tests as stryker requires only green tests to run. We decided to keep these tests as some of them could be valuable and reveal bugs. We will explore this on a different blog.
Code Metrics
Although counting LoC (Lines of Code) is a questionable metric for quality and success, in this case, the code represents mocks, happy-path, edge-cases, and complex logic for the generated tests, for the 176 methods for which unit tests were generated.
No less impressive is the time it took to generate these tests, only 51 minutes! With an average LLM response time of 31 seconds. Note that we run the benchmark during weekdays, most likely it would be faster if run over the weekend.
Summary results for EarlyAI Extension using GPT4o
- 965 green unit tests
- 259 red unit tests (which we had to mark as skipped so stryker can run)
- 60% coverage
- Mutation scoressome text
- 44% - All methods
- 81% - Methods with unit tests
- 92% - Methods with unit tests and coverage of 100% for the respective methods
- 68% Scope coverage - meaning 68% of the total methods have unit tests that cover all the code they are testing.
- Run statisticssome text
- Total run: 51.6 minutes
- Average LLM response time: 31 seconds
- Max response time: 89 seconds
- Generated test Code (17,248 LoC):some text
- Comment lines: 2,178
- Test code lines: 15,070
To conclude, Early’s product together with GPT-4o demonstrated significant potential in automating unit test generation, achieving impressive code coverage and high-quality mutation scores. We invite you to explore the full set of tests generated on our cloned ts-morph repo, available on Early's public GitHub.
Install Early’s VSCode Extension and see how AI can revolutionize your development workflow!
Sharon Barr
Co-founder and CEO