Extracted images from pdf, look like rotated, and inverted

65 Views Asked by At

quick question, is there some big errors in my code, apart from being dirty? why the extracted images from a pdf using PyMuPDF are looking inverted and upside down? i made some changes to the extracted images, by rotating them, but colors are still off even when inverting them.

import fitz  # PyMuPDF
from PIL import Image
from io import BytesIO
import os
import cv2
import numpy as np

pdf_path = '/content/drive/MyDrive/Wettbewerb Aktuell/1803_AusgGesa-pages-2.pdf'

document = fitz.open(pdf_path)

# Specify the range of pages you want to extract images from (e.g., first 2 pages)
start_page = 0
end_page = min(1, document.page_count - 1)  # Assuming a 0-based index

# Create a directory to save images
output_directory = '/content/drive/MyDrive/Wettbewerb Aktuell/images/'
os.makedirs(output_directory, exist_ok=True)

for page_number in range(start_page, end_page + 1):
    # Create a subfolder for each page
    page_folder = os.path.join(output_directory, f"page_{page_number}/")
    os.makedirs(page_folder, exist_ok=True)

    page = document[page_number]
    images = page.get_images(full=True)

    for img_index in range(len(images)):
        img = images[img_index]
        image = img[0]
        base_image = document.extract_image(image)
        image_bytes = base_image["image"]

        # Decode image bytes using OpenCV
        image_np = cv2.imdecode(np.frombuffer(image_bytes, np.uint8), cv2.IMREAD_COLOR)

        # Perform any image processing tasks using OpenCV here
        # For example, you can save the decoded image
        cv2.imwrite(os.path.join(page_folder, f"decoded_image_{img_index}.jpg"), image_np)

        # Print a message indicating that the image has been saved
        print(f"Decoded image saved: {os.path.join(page_folder, f'decoded_image_{img_index}.jpg')}")

document.close()

i wanted to extract the images in the pdf in the correct orientatition and color. can it be a problem with the pdf? if i open the file in illustrator and extract the images manually i got no problem. thanks in advance!

1

There are 1 best solutions below

0
Gautam Yadav On

I have edited the code conversion of pixmap datatype to numpy array. The base structure for the folders are same. Here working with the pixmap is very tricky and conversion to bytes sometimes does not result in best image. PyMuPDF Documenation. Here are some other implementations for the same conversion.

import os
import numpy as np
from PIL import Image
import cv2
import fitz  

for page_number in range(start_page, end_page + 1):
    # Create a subfolder for each page
    page_folder = os.path.join(output_directory, f"page_{page_number}/")
    os.makedirs(page_folder, exist_ok=True)

    page = document[page_number]
    images = page.get_images(full=True)

Changes from here on.

Also you forgot the code to create a pixmap , reason you need it is because it works on that data type specifically:

for img_index in range(len(images)):
    img = images[img_index]
    xref = img[0]
    pix = fitz.Pixmap(document, xref) # create a Pixmap
    
    if pix.n - pix.alpha > 3:
        pix = fitz.Pixmap(fitz.csRGB, pix)
    
    # Convert Pixmap to NumPy array
    img_np = np.frombuffer(pix.samples, dtype=np.uint8).reshape((pix.h, pix.w, pix.n))
    # Flip the image vertically
    img_np_flipped = cv2.flip(img_np, 0)
    
    # Save NumPy array as an image using OpenCV
    img_path = os.path.join(page_folder, f"image_{img_index}.jpg")
    cv2.imwrite(img_path, cv2.cvtColor(img_np_flipped, cv2.COLOR_BGR2RGB))
    pix = None  # Release the Pixmap