Machine Learning

Posts

Extracting Tables and Text from Images Using Python

- September 22, 2024

In this blog, we'll explore a complete Python solution that detects and extracts tables and text from images using libraries like Transformers, OpenCV, PaddleOCR, and easyOCR. This step-by-step breakdown includes code to detect tables, extract content from individual table cells, and retrieve any remaining text in the image. Overview When working with scanned documents, such as invoices or forms, it is essential to accurately extract both structured information (like tables) and unstructured text. The approach we’ll explore uses Microsoft's pretrained object detection model to locate tables and OCR techniques to extract the text from both table cells and the rest of the image. Steps: 1. This code first detects table using microsoft's model. and save that image which contains detected table only 2. After that, from the detected table , we make a seperate image for each cell. 3. Then we read text from the image of each cell 4. Now, to read the extra texts except for the ...

Extract Image from PDF

- September 09, 2024

Extract Images from PDFs Using PyMuPDF in Python I/p:- PDF- O/p - Images - Why PyMuPDF? PyMuPDF (fitz) is a powerful library for PDF processing. It allows easy access to various elements within a PDF, including text, images, and metadata. It’s particularly well-suited for tasks that involve reading and manipulating PDFs. Prerequisites pip install pymupdf Code Steps : 1. Importing Required Libraries import fitz # PyMuPDF import os 2. Function to Extract Images from a PDF def extract_images_from_pdf(pdf_path, output_folder): # Open the PDF file pdf_document = fitz.open(pdf_path) filename = os.path.basename(pdf_path).split('.')[0] # Create the output directory if it doesn't exist if not os.path.exists(output_folder): os.makedirs(output_folder) # Loop through each page for page_number in range(len(pdf_document)): ...

Semantic Search and Vector Database

- August 10, 2024

Why do we need Vector DB? Let’s start with a common scenario: You’re searching for “affordable laptop with a good battery life” on an e-commerce site. A keyword-based search engine might return results that include the words “affordable,” “laptop,” and “battery life.” But it might also miss products that are described as “budget-friendly” or “long-lasting battery,” simply because they don’t match the exact keywords. The result? A frustrating search experience where you might not find what you’re really looking for, even though the perfect product is just a few clicks away. Keyword-based search systems struggle with: Synonyms : Different words can mean the same thing (e.g., “cheap” and “affordable”). Polysemy : The same word can have multiple meanings depending on context (e.g., “bank” as in a financial institution vs. “bank” as in the side of a river). Contextual Understanding : Words and phrases can have different meanings based on the context in which they’re used. These limitations ...

DBSCAN Clustering

- July 29, 2024

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) Why we need DBSCAN? While K-Means clustering is a popular choice, it struggles with noisey data as it considers outliers as a cluster. Enters DBSCAN, an algorithm that not only detects the outliers but also removes them. So now, let's understand, How this algorithm works? Key Ingredients of DBSCAN Epsilon (ε) : Think of this as the maximum distance between two points for them to be neighbors. MinPts : The minimum number of points required to form a dense cluster. Core Point : A point with at least MinPts neighbors within its ε-radius. Border Point : Close to a core point but with fewer than MinPts neighbors. Noise Point : Points that don’t fit into any cluster – the outliers. Why Choose DBSCAN? Outlier Detection : Naturally identifies noise points, making it great for spotting anomalies. No Predefined Clusters : Unlike K-Means, you don’t need to specify the number of clusters beforehand. Flexibility : Handles cl...

KMeans Clustering

- July 08, 2024

What is KMeans Clustering? KMeans clustering is an unsupervised learning algorithm used to partition a dataset into K distinct, non-overlapping subsets or clusters. The goal is to group similar data points together while ensuring that data points in different clusters are as distinct as possible. How Does KMeans Clustering Work? Step 1: Initialize Centroids Choose K initial centroids randomly from the data points. These centroids are the initial cluster centers. Step 2: Assign Points to Clusters Assign each data point to the nearest centroid, forming K clusters. Step 3: Update Centroids Calculate the new centroids as the mean of all data points assigned to each cluster. Step 4: Repeat Repeat steps 2 and 3 until the centroids no longer change or change minimally. Choosing the Right Number of Clusters in KMeans Clustering Choosing the right number of clusters (K) is crucial for the effectiveness of KMeans clustering. Two popular methods to determine the optimal number of clusters ...

Random Forest

- June 25, 2024

The core idea behind Random Forest is to create a "forest" of decision trees, each built on different subsets of the data and using a random subset of features. It's used for both classification (like spam detection) and regression (like predicting house prices). A type of bagging that uses decision trees to improve prediction accuracy and robustness. How Does It Work? Data Sampling : Randomly pick samples from your data. Tree Building : Build decision trees on these samples. Voting/Averaging : For classification, trees vote for the most common class. For regression, the average of predictions is taken. Example: Predicting House Prices Imagine we want to predict house prices based on features like size, location, and age of the house. Step-by-Step Collect Data : Gather data on house prices along with features like size, location, and age. Create Subsets : Randomly create multiple subsets of this data. Build Trees : For each subset, build a decision tree. Each tree might ...

Search This Blog