Extracting Tables and Text from Images Using Python
In this blog, we'll explore a complete Python solution that detects and extracts tables and text from images using libraries like Transformers, OpenCV, PaddleOCR, and easyOCR. This step-by-step breakdown includes code to detect tables, extract content from individual table cells, and retrieve any remaining text in the image.
Overview
When working with scanned documents, such as invoices or forms, it is essential to accurately extract both structured information (like tables) and unstructured text. The approach we’ll explore uses Microsoft's pretrained object detection model to locate tables and OCR techniques to extract the text from both table cells and the rest of the image.
Steps:
1. This code first detects table using microsoft's model. and save that image which contains detected table only
2. After that, from the detected table , we make a seperate image for each cell.
3. Then we read text from the image of each cell
4. Now, to read the extra texts except for the table, we colour the area where table was located with white colour and finally, read rest of the text
Original Image:
Detected Table Image:
Sample Cell Images:
Image with Text Only, excluding Table:
Code Breakdown
- Library Imports:
To begin, we import the required libraries:
- PIL: For opening and manipulating images.
- Transformers: Pretrained models from Hugging Face for table detection.
- torch: For working with PyTorch models.
- OpenCV: An image processing library.
- easyOCR and PaddleOCR: Libraries to perform OCR for text extraction.
from PIL import Image
from transformers import DetrFeatureExtractor, TableTransformerForObjectDetection
import torch
import cv2
import os
import numpy as np
import pandas as pd
from tabulate import tabulate
import easyocr
from paddleocr import PaddleOCR
- OCR Setup:
We initialize both
easyOCR
andPaddleOCR
for reading text from images.
reader = easyocr.Reader(['en'])
ocr = PaddleOCR(use_angle_cls=True, lang='en')
- Pretrained Table Detection Model:
We use Microsoft’s
TableTransformerForObjectDetection
model to detect tables in the image.
model = TableTransformerForObjectDetection.from_pretrained("microsoft/table-transformer-detection")
- Paths Setup: Defining file paths for images, processed images, and text outputs.
imagesFolderPath = 'E:/AI Codes/Extract text and table from image/Images'
txt_Folder = 'E:/AI Codes/Extract text and table from image/txtFiles'
inProcessImages_Folder_path = 'E:/AI Codes/Extract text and table from image/inProcessImages'
Table Detection and Extraction
Step 1: Detecting the Table
The function detectTable
performs the crucial task of detecting the table from an image using Microsoft's table detection model. The detected table is saved as a separate image file for further processing.
def detectTable(original_image_path, base_filename):
image = Image.open(original_image_path).convert("RGB")
feature_extractor = DetrFeatureExtractor()
encoding = feature_extractor(image, return_tensors="pt")
with torch.no_grad():
outputs = model(**encoding)
results = feature_extractor.post_process_object_detection(outputs, threshold=0.7)[0]
plot_results(image, results['scores'], results['labels'], results['boxes'], base_filename, original_image_path)
- How it works: The table detection model processes the image and returns bounding boxes for detected tables. The
plot_results
function saves an image that contains only the detected table.
Step 2: Cropping Cells from Detected Table
Once the table is detected and isolated, the plot_results
function crops the image to create separate images for each cell in the table.
def plot_results(image, scores, labels, boxes, base_filename, original_image_path):
draw = ImageDraw.Draw(image)
for score, label, box in zip(scores, labels, boxes):
# Draw bounding box around detected table
draw.rectangle(box.tolist(), outline="red", width=3)
# Save the table image
table_image = image.crop(box.tolist())
table_image.save(f'{inProcessImages_Folder_path}/{base_filename}_table.jpg')
- How it works: The bounding boxes for each cell are used to crop the table into individual cell images, which are saved for text extraction.
Step 3: Reading Text from Each Cell
The table_detection_display
function processes the detected table image and extracts text from each individual cell using the OCR libraries.
def table_detection_display(img_path):
img = cv2.imread(img_path)
img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Detect vertical and horizontal lines in the table
kernel_length_v = np.array(img_gray).shape[1] // 80
vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, kernel_length_v))
vertical_lines_img = cv2.dilate(cv2.erode(cv2.bitwise_not(img_gray), vertical_kernel), vertical_kernel)
# Segment and extract text from individual cells
table_segment = cv2.addWeighted(vertical_lines_img, 0.5, vertical_lines_img, 0.5, 0.0)
# Use OCR to extract text from each cell
result = ocr.ocr(table_segment, cls=True)
# Store extracted text
- How it works: Using OpenCV, we detect vertical and horizontal lines to isolate table cells, and then apply OCR to extract text from each cell.
Step 4: Removing the Table and Extracting Remaining Text
After extracting table content, we need to remove the table from the image and read the rest of the text. To do this, we fill the table area with a white color and then run OCR on the rest of the image.
def extractRemainingText(image_path, table_boxes):
img = cv2.imread(image_path)
for box in table_boxes:
x1, y1, x2, y2 = box
# Fill the table area with white color
cv2.rectangle(img, (x1, y1), (x2, y2), (255, 255, 255), -1)
# Save the modified image
processed_image_path = f"{inProcessImages_Folder_path}/processed_{os.path.basename(image_path)}"
cv2.imwrite(processed_image_path, img)
# Extract remaining text using OCR
result = ocr.ocr(processed_image_path, cls=True)
return result
- How it works: The table's location is filled with white color to eliminate it from the image. Then, we run OCR on the remaining parts of the image to extract any unstructured text outside the table.
Final Integration
The main loop integrates all these functions:
- Detect tables and save the cropped images.
- Process each table cell and extract text.
- Remove the table from the original image and extract the remaining text.
for filename in os.listdir(imagesFolderPath):
img_path = os.path.join(imagesFolderPath, filename)
base_filename = os.path.splitext(filename)[0]
detectTable(img_path, base_filename)
with open(os.path.join(txt_Folder, f"{base_filename}.txt"), 'w') as file:
# Extract text from the detected table
file.write(extractText(img_path))
# Extract remaining text from the image
remaining_text = extractRemainingText(img_path, table_boxes)
file.write(remaining_text)
print(f"Completed processing: {filename}")
Conclusion
This Python solution leverages table detection models and OCR techniques to handle complex image extraction tasks. Whether you're processing scanned forms or extracting tables and text from invoices, this approach provides a structured, efficient, and scalable method to extract both structured (table) and unstructured text.
Feel free to try this solution on your datasets and streamline your document processing tasks!
Comments
Post a Comment