Extract Image from PDF

- September 09, 2024

Extract Images from PDFs Using PyMuPDF in Python

I/p:-

PDF-

O/p - Images -

Why PyMuPDF?

PyMuPDF (fitz) is a powerful library for PDF processing. It allows easy access to various elements within a PDF, including text, images, and metadata. It’s particularly well-suited for tasks that involve reading and manipulating PDFs.

Prerequisites

pip install pymupdf

Code Steps :

1. Importing Required Libraries

import fitz # PyMuPDF

import os

2. Function to Extract Images from a PDF

def extract_images_from_pdf(pdf_path, output_folder):

# Open the PDF file

pdf_document = fitz.open(pdf_path)

filename = os.path.basename(pdf_path).split('.')[0]

# Create the output directory if it doesn't exist

if not os.path.exists(output_folder):

os.makedirs(output_folder)

# Loop through each page

for page_number in range(len(pdf_document)):

page = pdf_document.load_page(page_number)

images = page.get_images(full=True)

for image_index, image in enumerate(images):

# Extract the image reference

xref = image[0]

# Extract the image bytes

base_image = pdf_document.extract_image(xref)

image_bytes = base_image["image"]

image_ext = base_image["ext"] # image extension, e.g., png, jpeg

# Save the image

image_filename = f"{filename}_page_{page_number + 1}_image_{image_index + 1}.{image_ext}"

image_filepath = os.path.join(output_folder, image_filename)

with open(image_filepath, "wb") as image_file:

image_file.write(image_bytes)

print(f"Saved image {image_filename}")

print(f"Extraction complete. Images saved to {output_folder}")

3. Example Usage

pdf_folder = "/data/imageExtraction/PDFs" # Path to your folder containing PDFs

output_folder = "/data/imageExtraction/pdfImages" # Folder where images will be saved

# Loop through each PDF in the folder and extract images

for filename in os.listdir(pdf_folder):

pdf_path = os.path.join(pdf_folder, filename)

extract_images_from_pdf(pdf_path, output_folder)

Conclusion

Extracting images from PDFs using PyMuPDF is not only easy but also highly customizable. This script provides a basic framework that you can adapt for various needs, such as filtering images by size, resolution, or type.

Search This Blog

Machine Learning