Concurrent terraform Installation Issue using TFENV with TFENV_AUTO_INSTALL in Terragrunt Environments repository

262 Views Asked by At

Issue:

When using TFENV_AUTO_INSTALL environment variable in a Terragrunt repository, concurrent installations of the many different Terraform versions trigger a race condition.

This results in an error where tfenv attempts to install many versions of Terraform concurrently in parallel pipeline jobs, leading to permission denied issues.

My code repo:

dev-account01
├── eu-west-1
│   ├── iam_roles
│   │    ├──  .terraform-version
│   │    ├──  main.tf
│   ├── networking
│   │    ├── .terraform-version
│   │    ├── main.tf

For each module a different terrform version 1.6.2 and 1.5.5

PS: in my actual setup I have many more regions and more modules and more accounts.

Error Message:

/home/user/.tfenv/lib/tfenv-exec.sh: line 43:  /home/user/.tfenv/versions/1.6.2/terraform: Permission denied
/home/user/.tfenv/lib/tfenv-exec.sh: line 43: exec: /home/user/.tfenv/versions/1.6.2/terraform: cannot execute: Permission denied

Reproducible Scenario:

  1. Enable TFENV_AUTO_INSTALL in a Terragrunt repo.
  2. Trigger pipeline with multiple jobs/plans that attempt to install many versions of Terraform not previously used.

Expected Behavior:

TFENV_AUTO_INSTALL should handle concurrent installations gracefully or sequentially, avoiding race conditions and permission denied errors.

Or is there any way to serialize the installations of the different terraform versions present in my terraform module in each account?

EDIT:

example of solution:

#!/bin/bash

LOCK_FILE="/tmp/tfenv-wrapper.lock"
MAX_CONCURRENT_PROCESSES=1

# Function to acquire a lock
function acquire_lock() {
  while true; do
    exec 202>"$LOCK_FILE"
    flock -n 202 && break
    echo "Another instance of the script is already running. Waiting for it to complete."
    sleep 5
  done
}

# Function to release the lock
function release_lock() {
  flock -u 202
  rm -f "$LOCK_FILE"
}

# Function to check the number of running processes matching the pattern
function check_tfenv_processes() {
  pgrep -f "tfenv install" | grep -v $$ | wc -l
}

# Infinite loop to keep the script running
while true; do
  # Acquire the lock
  acquire_lock

  # Check the number of running processes
  num_processes=$(check_tfenv_processes)

  # If the number of running processes exceeds the limit, wait
  while [ "$num_processes" -ge "$MAX_CONCURRENT_PROCESSES" ]; do
    echo "Maximum number of concurrent 'tfenv install' processes reached. Waiting for processes to complete."
    sleep 5
    num_processes=$(check_tfenv_processes)
  done

  # Your script logic goes here

  # Simulate some work
  echo "Script is running..."

  # Release the lock
  release_lock
done

Current workspaces:

atlantis-git-test-0:/$ ls -l /atlantis-data/repos/orga/infra-test/4
total 24
drwx--S---    5 atlantis atlantis      4096 Jan  8 10:00 default
drwx--S---    5 atlantis atlantis      4096 Jan  8 10:00 environments_eks-dev-1_09_eks
drwx--S---    5 atlantis atlantis      4096 Jan  8 10:00 environments_eks-dev-1_11_r53_zones
drwx--S---    5 atlantis atlantis      4096 Jan  8 10:00 environments_eks-dev-1_13_irsa
drwx--S---    5 atlantis atlantis      4096 Jan  8 10:00 environments_eks-dev-1_15_vault
drwx--S---    5 atlantis atlantis      4096 Jan  8 10:00 environments_eks-staging-1_11_r53_zones

1

There are 1 best solutions below

14
On BEST ANSWER

You might consider a wrapper script, to serialize the Terraform version installations, ensuring that only one version is installed (tfenv install) at a time, avoiding race conditions and permission issues.

Run this script (tfenv_serial_install.sh) before executing Terragrunt commands in your pipeline.

#!/bin/bash

# Wrapper script for tfenv to install Terraform versions serially

# Lock file to synchronize tfenv installations
LOCK_FILE="/tmp/tfenv-install.lock"

# Function to install Terraform version
install_tf_version() {
    version=$1
    (
        flock -x 200
        tfenv install "$version"
    ) 200>$LOCK_FILE
}

# Extract Terraform versions from .terraform-version files and install them serially
for tf_version_file in $(find . -name ".terraform-version"); do
    version=$(cat "$tf_version_file")
    install_tf_version "$version"
done

because whenever Atlantis creates many workspaces it will trigger many installations at the same time.

In that case, you could include a check for existing Terraform versions before attempting installation (tfenv list | grep -q "$version"). That should prevent redundant installations and reduce the likelihood of concurrent installation attempts.

tfenv_serial_install.sh would be:

#!/bin/bash

# Wrapper script for tfenv to install Terraform versions serially and efficiently

# Lock file to synchronize tfenv installations
LOCK_FILE="/tmp/tfenv-install.lock"

# Function to install Terraform version if not already installed
install_tf_version_if_needed() {
    version=$1
    if ! tfenv list | grep -q "$version"; then
        (
            flock -x 200
            tfenv install "$version"
        ) 200>$LOCK_FILE
    fi
}

# Extract Terraform versions from .terraform-version files and install them serially if needed
for tf_version_file in $(find /atlantis-data/repos -type f -name ".terraform-version"); do
    version=$(cat "$tf_version_file")
    install_tf_version_if_needed "$version"
done

Can I lock the tfenv process/syscall and therefore remove the pre-install of all versions and make installations of versions on-demand ?

This will prevent a long list of versions to install (we have big setup where we got approximately 20 versions and it can get more by time), ensures the versions gets installed when needed

Yes, you can modify the approach to lock the tfenv process or system call, enabling on-demand installation of Terraform versions while preventing race conditions.

I am looking for all versions with find command and install them in the pre_workflow_hooks

Instead of pre-installing all versions, you can modify the approach to only lock and install a specific Terraform version when it is actually required by a job. That way, the installations are truly on-demand, without pre-installing versions that might not be needed.

Adjust the locking script to be used directly within each job that requires a specific Terraform version. The script will check if the required version is already installed and, if not, install it with a lock to prevent race conditions.

#!/bin/bash

# tfenv_install_with_lock.sh - Ensures on-demand, serialized installation of Terraform versions

LOCK_FILE="/tmp/tfenv-install.lock"
VERSION_TO_INSTALL=$1

# Function to install Terraform version with a lock
install_with_lock() {
    (
        flock -x 200
        tfenv install "$VERSION_TO_INSTALL"
    ) 200>$LOCK_FILE
}

# Install the requested version with locking
if ! tfenv list | grep -q "$VERSION_TO_INSTALL"; then
    install_with_lock
fi

Modify your Atlantis configuration or pipeline scripts to call this script at the beginning of each job. The script should receive the required Terraform version as a parameter. That makes sure the version is installed only if it is not already available, right before it is needed.

For instance, in an Atlantis job, you would call the script like this:

repos:
    - id: github.com/org/aws-infra
    workflow: terragrunt
    pre_plan:
        commands:
        - run: "/path/to/tfenv_install_with_lock.sh $(cat .terraform-version)"
    # Rest of your configuration

That would avoid the need to scan for all versions beforehand.


Can you please explain one thing to me, does this take into consideration that the Atlantis+terragrut setup can generate many workspaces, i.e. parallel executions?

The tfenv_install_with_lock.sh script uses a file lock (tfenv-install.lock) to make sure serialized installation of Terraform versions. When multiple Atlantis workspaces are generated, each attempting to execute Terraform commands, you would get:

  • Lock acquisition: Each workspace/job that needs to install a Terraform version will execute the tfenv_install_with_lock.sh script. The script attempts to acquire a lock on the tfenv-install.lock file.

  • Serialized installation:

    • If the lock is available (meaning no other process is installing Terraform at that moment), the script acquires the lock and proceeds to check if the required Terraform version is already installed.
    • If the required version is not installed, the script performs the installation, then releases the lock.
    • If the lock is not available (another workspace/job is already performing an installation), the script will wait (due to flock -x) until the lock becomes available.
  • Parallel execution management: The use of the lock makes sure even when multiple workspaces are executed in parallel, any installation of Terraform versions is done sequentially. That prevents race conditions that could occur if multiple installations were attempted simultaneously.

Plus, each workspace checks for the required Terraform version and only attempts installation if it is not already present. That reduces redundant installations and makes the process efficient.

The script is designed to be integrated into the Atlantis workflow (pre_plan or similar stages).
The setup assumes that each workspace/job can independently execute the script as part of its initialization or planning phase.


My main concern if this for sure does one installation per plan, then I am OK with it. Because, for me, it looks like one installation per workspace.

The script tfenv_install_with_lock.sh is designed to manage Terraform version installations in a way that avoids conflicts when multiple workspaces are operating concurrently.
But it is important to understand the distinction between installations per workspace and installations per plan:

  • One installation per plan: The ideal scenario is to have Terraform versions installed only once per plan execution, regardless of the number of workspaces. That ensures efficiency and reduces redundant installations.

  • One installation per workspace: That scenario implies that each workspace, when initiating a plan, might attempt to install the Terraform version it requires. While the locking mechanism prevents simultaneous installations, it does not inherently reduce the total number of installations if each workspace separately determines the need for installation.

Given your setup and concern, the key is to make sure Terraform versions are installed only as needed for each plan, not redundantly across workspaces.
You might consider:

  • Centralized version management: Implement a mechanism to manage Terraform versions centrally before workspaces initiate their plans. That can be a script or process that runs once at the start of your pipeline and ensures all required Terraform versions are installed. That approach ensures one installation per plan.

  • Refined workspace-level installation: Modify the tfenv_install_with_lock.sh script to better track which versions have been installed during the current pipeline run. That could involve creating a record of installed versions and checking against this record before attempting an installation. That approach aims to reduce redundant installations at the workspace level.