Deleting empty files in tar.gz file in bash

651 Views Asked by At

i have a tar.gz file and it contains .yang files along with some empty .yang files. so i want to go into the tar.gz file and delete only those empty files Currently i am using:

for f in *.tar.gz
 do
    echo "Processing file $f"
    gzip -d "$f"
    find $PWD -size  0 -print -delete
    gzip -9 "${f%.*}"
    echo "******************************************"
done

but this is not working maybe because currently, i m not in a directory instead inside the tar.gz file.

any other way to do this?

1

There are 1 best solutions below

2
Renaud Pacalet On

Your find command doesn't do anything useful to your tarballs because it searches and deletes in the current directory, not inside the tarballs.

So we need to first unpack the tarball (tar -xf), delete the empty files (find), and repack (tar -czf). As a safety measure we will work in temporary directories (mktemp -d) and create new tarballs (*.tar.gz.new) instead of overwriting the old ones. As you want to delete only yang empty files, we will also use some more find options. The following is for GNU tar, adapt to your own tar version (or install GNU tar). Before using it read what comes next, just in case...

for f in *.tar.gz; do
    echo "Processing file $f"
    d="$(mktemp -d)"
    tar -xf "$f" -C "$d"
    find "$d" -type f -name '*.yang' -size 0 -print -delete
    tar -C "$d" -czf "$f.new" .
    rm -rf "$d"
    echo "******************************************"
done

But what you want is more complex than it seems because your tarballs could contain files with meta-data (owner, permissions...) that you are not allowed to use. If you run what precedes as a regular user, tar will silently change the ownership and permissions of such files and directories. When re-packing they will thus have modified meta-data. If it is a problem and you absolutely want to preserve the meta-data there are basically two options:

  1. Pretend you are root with fakeroot or an equivalent.
  2. Delete the files inside the tarballs without unpacking.

To use fakeroot just run the above bash script inside a fakeroot environment:

$ fakeroot
# for f in *.tar.gz; do
# ...
# done
# exit

The second solution (in-place tarball edition) uses GNU tar and GNU awk:

for f in *.tar.gz; do
    echo "Processing file $f"
    t="${f%.*}"
    gzip -cd "$f" > "$t"
    tar -tvf "$t" | awk -vORS=$"\0" '/^-.*\.yang$/ && $3==0 {
      match($0,/(\S+\s+){4}\S+\s/); print substr($0,RLENGTH+1)}' |
      xargs -0 -n1 tar -f "$t" --delete
    gzip -c9 "$t" > "$f.new"
    echo "******************************************"
done

Explanations:

We use the GNU tar --delete option to delete files directly inside the tarball, without unpacking it, which is probably more elegant (even if it is also probably slower than a fakeroot-based solution).

Let's first find all empty files in the tarball:

$ tar -tvf foo.tar
drwx------ john/users        0 2021-10-18 14:26 ./
drwx------ john/users        0 2021-10-18 16:34 ./
-rw------- john/users        0 2021-10-18 16:34 ./nonyang
drwx------ john/users        0 2021-10-18 15:22 ./foo.yang/
-rw------- john/users        0 2021-10-18 16:01 ./empty.yang
-rw------- john/users        7 2021-10-18 15:22 ./nonempty.yang
-rw------- john/users        0 2021-10-18 16:01 ./filename with spaces.yang

As you can see the size is in third column. Directory names have a leading d and a trailing /. Symbolic links have a leading l. So by keeping only lines starting with - and ending with .yang we eliminate them. GNU awk can do this twofold filtering:

$ tar -tvf foo.tar | awk '/^-.*\.yang$/ && $3==0 {print}'
-rw------- john/users        0 2021-10-18 16:01 ./empty.yang
-rw------- john/users        0 2021-10-18 16:01 ./filename with spaces.yang

This is more than what we want, so let's print only the name part. We first measure the length of the 5 first fields, including the spaces, with the match function (that sets a variable named RLENGTH) and remove them with substr:

$ tar -tvf foo.tar | awk '/^-.*\.yang$/ && $3==0 {
    match($0,/(\S+\s+){4}\S+\s/); print substr($0,RLENGTH+1)}'
./empty.yang
./filename with spaces.yang

We could try to optimize a bit by calling match only on the first line but I am not 100% sure that all output lines are perfectly aligned, so let's call it on each line.

We are almost done: just pass this to tar -f foo.tar --delete <filename>, one name at a time. xargs can do this for us but there is a last trick: as file names can contain spaces we must use another separator, something that cannot be found in file names, like the NUL character (ASCII code 0). Fortunately GNU awk can use NUL as Output Record Separator (ORS) and xargs has the -0 option to use it as input separator. So, let's put all this together:

$ tar -tvf foo.tar | awk -vORS=$"\0" '/^-.*\.yang$/ && $3==0 {
    match($0,/(\S+\s+){4}\S+\s/); print substr($0,RLENGTH+1)}' |
    xargs -0 -n1 tar -f foo.tar --delete
$ tar -tvf foo.tar
drwx------ john/users        0 2021-10-18 16:34 ./
-rw------- john/users        0 2021-10-18 16:34 ./nonyang
drwx------ john/users        0 2021-10-18 15:22 ./foo.yang/
-rw------- john/users        7 2021-10-18 15:22 ./nonempty.yang

Inside your for loop:

for f in *.tar.gz; do
    echo "Processing file $f"
    t="${f%.*}"
    gzip -cd "$f" > "$t"
    tar -tvf "$t" | awk -vORS=$"\0" '/^-.*\.yang$/ && $3==0 {
      match($0,/(\S+\s+){4}\S+\s/); print substr($0,RLENGTH+1)}' |
      xargs -0 -n1 tar -f "$t" --delete
    gzip -c9 "$t" > "$f.new"
    echo "******************************************"
done

Note that we must decompress the tarballs before editing them because GNU tar cannot edit compressed tarballs.