Optimizing CI/CD pipeline speed and reducing failure ratio in a mono repo with multiple E2E tests

116 Views Asked by At

Context:

We manage a mono repository where multiple teams contribute code. To ensure that commits do not disrupt the functionality of other teams, we employ a combination of unit tests and E2E tests (written in TestCafe), with TypeScript being our primary language. The CI/CD pipeline is triggered whenever someone makes a commit to a MR/PR, and it includes various steps like:

  1. Basic checks on MR titles and simple sanity checks.

  2. Running the build job.

  3. Executing all E2E and unit test cases.

In our E2E tests, we validate the end-to-end flow, covering tasks such as creating a user through test environment APIs, logging in, performing multiple steps (that an end user does) programmatically, and thorough test if everything is working fine. (Note - Every step is done through UI only, we don't hit the BE directly).

Problem We're Facing:

The challenge we're currently facing is with the time it takes for MRs/PRs to become merge-ready. With around 7-8 E2E test cases running simultaneously (on average, numbers of tests can change based on the files that any MR touches), the pipeline duration is prolonged. Any test failures in the pipeline require additional commits and thus more waiting time.

Moreover, the execution of load tests by multiple teams leads to test case failures as it breaks our test environment APIs, blocking the deployment of changes across various teams.

This not only consumes a significant amount of time but also disrupts project timelines and wastes developers' time.

(Note - we do the entire testing on test environment APIs)

What We've Tried So Far:

In our efforts to address the issue, we've experimented with various strategies like,

  1. we attempted to run tests concurrently by executing all 7-8 tests simultaneously on different AWS instances.

  2. Additionally, we implemented a retry mechanism in the pipeline for tests that fail.

I'm seeking advice on what could the best way to optimize the CI/CD pipeline to improve speed, reduce failure ratio, and ensure stability. Any suggestions, best practices, or tools that can help streamline the process would be highly appreciated.

2

There are 2 best solutions below

0
grzegorzgrzegorz On

I think you could introduce preliminary job related to feature branches. Such improved workflow can look like this:

  • dev creates feature branch
  • every commit starts feature branch job with build and selective tests (or maybe this is started only once as the last prerequisite before merging)
  • after it completes successfully code is merged (if not, dev fixes it)
  • this starts main pipeline (build and all the tests you describe with except of load tests)
  • load tests are run in separate job in the night only

Feature branch job doesn't run all the tests but only the ones related to the change made by developer. I personally use approach to run all unit tests in the module touched by the feature and in all modules which are using it but it is not set in stone. Also, only feature related E2E tests should be run. This is not so easy to map but I am just using text file saying that touching moduleA should trigger E2Etest1 and E2Etest5 and so on. It is ok this is more-less correct as this is just a general check before main pipeline is started. In the main pipeline all the tests are run so if anything is wrong with the distant feature dependencies or in general safety net is not dense enough in preliminary job, defect will not reach the production anyway. As I stated I wouldn't run load tests with every commit but once a day only.

This brings numerous advantages:

  • preliminary job reduces number of failures in main pipeline and improves its stability without compromising QA
  • decreasing frequency of load tests speeds main pipeline up

Further improvements are possible like:

  • running unit tests and E2E tests concurrently
  • running most of the unit tests concurrently to each other (I use approach all main modules are tested concurrently)
  • identify very heavy E2E tests and run them once a day only just as load tests
0
OPSM On

@grzegorzgrzegorz has given some great points. I can add my 2 cents here from experiencing similar situations.

When test runs become unweildy to run, I tend to look more at the scope, context and context of the tests themselves. As your system/s grow this issue will only become worse. So it needs addressed ASAP. Im my experience, I would try some of the following:

  1. Consider the fact that your E2E tests are your limiting factor. You should look at whether every scenario in thorough detail needs tested in an E2E format. I would always recommend e2e as the happy path, main user flows, with some critical negative path scenarios if there is resources for it. That is all.

  2. Your E2E tests should share state across the tests. For example, creation of a user and logging in is not the test goal, therefore you can abstract that into a fixture/setup test, that can then be shared across the tests meaning its done only once per run. Playwright is one I recommend for this type of stuff. Your login can be tested elsewhere in other tests, or in a standlone entity.

  3. You should invest time into creating more Functional style tests that work in a microservice-esque approach, where it tests the granular scenario based cases. Those types of things are not ideal for e2e tests when resources are being affected. E2e tests are the most costly test in terms of time.

  4. I know this was already mentioned but I must reiterate it; you should not run every test every time. This is never going to work in a large scale system. Each team will be working on a certain area of code. As such, only those tests that deal with that area need run. This can be tracked very easily through test annotations, with perhaps even pipelines that run different test groupings etc. There is never a need to run all e2e tests everytime, as the coverage will be reached the same elsewhere.

  5. If your load tests break everything, you need to resolve that ASAP. You should either break those out into different pipelines that run staggered over a time when the APIs are never used such as early morning, or, up the resources on your test APIs. Testability is the cornerstone of the SDLC so if you are testing against brittle APIs that break everytime, you are not getting any coverage or benefits from this. So why keep it in the current state

I have many many more to add but to save a novel, I will leave it here. Feel free to comment if you want any more! :)