Context:
We manage a mono repository where multiple teams contribute code. To ensure that commits do not disrupt the functionality of other teams, we employ a combination of unit tests and E2E tests (written in TestCafe), with TypeScript being our primary language. The CI/CD pipeline is triggered whenever someone makes a commit to a MR/PR, and it includes various steps like:
Basic checks on MR titles and simple sanity checks.
Running the build job.
Executing all E2E and unit test cases.
In our E2E tests, we validate the end-to-end flow, covering tasks such as creating a user through test environment APIs, logging in, performing multiple steps (that an end user does) programmatically, and thorough test if everything is working fine. (Note - Every step is done through UI only, we don't hit the BE directly).
Problem We're Facing:
The challenge we're currently facing is with the time it takes for MRs/PRs to become merge-ready. With around 7-8 E2E test cases running simultaneously (on average, numbers of tests can change based on the files that any MR touches), the pipeline duration is prolonged. Any test failures in the pipeline require additional commits and thus more waiting time.
Moreover, the execution of load tests by multiple teams leads to test case failures as it breaks our test environment APIs, blocking the deployment of changes across various teams.
This not only consumes a significant amount of time but also disrupts project timelines and wastes developers' time.
(Note - we do the entire testing on test environment APIs)
What We've Tried So Far:
In our efforts to address the issue, we've experimented with various strategies like,
we attempted to run tests concurrently by executing all 7-8 tests simultaneously on different AWS instances.
Additionally, we implemented a retry mechanism in the pipeline for tests that fail.
I'm seeking advice on what could the best way to optimize the CI/CD pipeline to improve speed, reduce failure ratio, and ensure stability. Any suggestions, best practices, or tools that can help streamline the process would be highly appreciated.
I think you could introduce preliminary job related to feature branches. Such improved workflow can look like this:
Feature branch job doesn't run all the tests but only the ones related to the change made by developer. I personally use approach to run all unit tests in the module touched by the feature and in all modules which are using it but it is not set in stone. Also, only feature related E2E tests should be run. This is not so easy to map but I am just using text file saying that touching moduleA should trigger E2Etest1 and E2Etest5 and so on. It is ok this is more-less correct as this is just a general check before main pipeline is started. In the main pipeline all the tests are run so if anything is wrong with the distant feature dependencies or in general safety net is not dense enough in preliminary job, defect will not reach the production anyway. As I stated I wouldn't run load tests with every commit but once a day only.
This brings numerous advantages:
Further improvements are possible like: