Extract Image from PDF

 Extract Images from PDFs Using PyMuPDF in Python

I/p:-

PDF-


O/p - Images -





Why PyMuPDF?

PyMuPDF (fitz) is a powerful library for PDF processing. It allows easy access to various elements within a PDF, including text, images, and metadata. It’s particularly well-suited for tasks that involve reading and manipulating PDFs.

Prerequisites

pip install pymupdf

Code Steps : 

1. Importing Required Libraries

import fitz  # PyMuPDF
import os

2. Function to Extract Images from a PDF

def extract_images_from_pdf(pdf_path, output_folder):
    # Open the PDF file
    pdf_document = fitz.open(pdf_path)
    filename = os.path.basename(pdf_path).split('.')[0]
    
    # Create the output directory if it doesn't exist
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    # Loop through each page
    for page_number in range(len(pdf_document)):
        page = pdf_document.load_page(page_number)
        images = page.get_images(full=True)

        for image_index, image in enumerate(images):
            # Extract the image reference
            xref = image[0]
            
            # Extract the image bytes
            base_image = pdf_document.extract_image(xref)
            image_bytes = base_image["image"]
            image_ext = base_image["ext"]  # image extension, e.g., png, jpeg
            
            # Save the image
            image_filename = f"{filename}_page_{page_number + 1}_image_{image_index + 1}.{image_ext}"
            image_filepath = os.path.join(output_folder, image_filename)
            
            with open(image_filepath, "wb") as image_file:
                image_file.write(image_bytes)
                
            print(f"Saved image {image_filename}")

    print(f"Extraction complete. Images saved to {output_folder}")

3. Example Usage

pdf_folder = "/data/imageExtraction/PDFs"  # Path to your folder containing PDFs
output_folder = "/data/imageExtraction/pdfImages"  # Folder where images will be saved

# Loop through each PDF in the folder and extract images
for filename in os.listdir(pdf_folder):
    pdf_path = os.path.join(pdf_folder, filename)
    extract_images_from_pdf(pdf_path, output_folder)

Conclusion

Extracting images from PDFs using PyMuPDF is not only easy but also highly customizable. This script provides a basic framework that you can adapt for various needs, such as filtering images by size, resolution, or type.



Comments

Popular posts from this blog

Extracting Tables and Text from Images Using Python

Getting Started with ML

Linear Regression