Extract files from zip without keep the top-level folder with python zipfile

8.6k Views Asked by At

I'm using the current code to extract the files from a zip file while keeping the directory structure:

zip_file = zipfile.ZipFile('archive.zip', 'r')
zip_file.extractall('/dir/to/extract/files/')
zip_file.close()

Here is a structure for an example zip file:

/dir1/file.jpg
/dir1/file1.jpg
/dir1/file2.jpg

At the end I want this:

/dir/to/extract/file.jpg
/dir/to/extract/file1.jpg
/dir/to/extract/file2.jpg

But it should ignore only if the zip file has a top-level folder with all files inside it, so when I extract a zip with this structure:

/dir1/file.jpg
/dir1/file1.jpg
/dir1/file2.jpg
/dir2/file.txt
/file.mp3

It should stay like this:

/dir/to/extract/dir1/file.jpg
/dir/to/extract/dir1/file1.jpg
/dir/to/extract/dir1/file2.jpg
/dir/to/extract/dir2/file.txt
/dir/to/extract/file.mp3

Any ideas?

5

There are 5 best solutions below

0
On

Read the entries returned by ZipFile.namelist() to see if they're in the same directory, and then open/read each entry and write it to a file opened with open().

0
On

Based on the @ekhumoro's answer I come up with a simpler funciton to extract everything on the same level, it is not exactly what you are asking but I think can help someone.

    def _basename_members(self, zip_file: ZipFile):
        for zipinfo in zip_file.infolist():
            zipinfo.filename = os.path.basename(zipinfo.filename)
            yield zipinfo

    from_zip="some.zip"
    to_folder="some_destination/"
    with ZipFile(file=from_zip, mode="r") as zip_file:
        os.makedirs(to_folder, exist_ok=True)
        zip_infos = self._basename_members(zip_file)
        zip_file.extractall(path=to_folder, members=zip_infos)
0
On

Basically you need to do two things:

  1. Identify the root directory in the zip.
  2. Remove the root directory from the paths of other items in the zip.

The following should retain the overall structure of the zip while removing the root directory:

import typing, zipfile

def _is_root(info: zipfile.ZipInfo) -> bool:
    if info.is_dir():
        parts = info.filename.split("/")
        # Handle directory names with and without trailing slashes.
        if len(parts) == 1 or (len(parts) == 2 and parts[1] == ""):
            return True
    return False

def _members_without_root(archive: zipfile.ZipFile, root_filename: str) -> typing.Generator:
    for info in archive.infolist():
        parts = info.filename.split(root_filename)
        if len(parts) > 1 and parts[1]:
            # We join using the root filename, because there might be a subdirectory with the same name.
            info.filename = root_filename.join(parts[1:])
            yield info

with zipfile.ZipFile("archive.zip", mode="r") as archive:
    # We will use the first directory with no more than one path segment as the root.
    root = next(info for info in archive.infolist() if _is_root(info))
    if root:
        archive.extractall(path="/dir/to/extract/", members=_members_without_root(archive, root.filename))
    else:
        print("No root directory found in zip.")
0
On

This might be a problem with the zip archive itself. In a python prompt try this to see if the files are in the correct directories in the zip file itself.

import zipfile

zf = zipfile.ZipFile("my_file.zip",'r')
first_file = zf.filelist[0]
print file_list.filename

This should say something like "dir1" repeat the steps above substituting and index of 1 into filelist like so first_file = zf.filelist[1] This time the output should look like 'dir1/file1.jpg' if this is not the case then the zip file does not contain directories and will be unzipped all to one single directory.

2
On

If I understand your question correctly, you want to strip any common prefix directories from the items in the zip before extracting them.

If so, then the following script should do what you want:

import sys, os
from zipfile import ZipFile

def get_members(zip):
    parts = []
    # get all the path prefixes
    for name in zip.namelist():
        # only check files (not directories)
        if not name.endswith('/'):
            # keep list of path elements (minus filename)
            parts.append(name.split('/')[:-1])
    # now find the common path prefix (if any)
    prefix = os.path.commonprefix(parts)
    if prefix:
        # re-join the path elements
        prefix = '/'.join(prefix) + '/'
    # get the length of the common prefix
    offset = len(prefix)
    # now re-set the filenames
    for zipinfo in zip.infolist():
        name = zipinfo.filename
        # only check files (not directories)
        if len(name) > offset:
            # remove the common prefix
            zipinfo.filename = name[offset:]
            yield zipinfo

args = sys.argv[1:]

if len(args):
    zip = ZipFile(args[0])
    path = args[1] if len(args) > 1 else '.'
    zip.extractall(path, get_members(zip))