I’m trying to extract tabular data from a scanned engineering document.
The table contains:

  • merged header cells

  • irregular row heights

  • irregular column widths

  • faint and broken borders

  • text inside every cell

  • vertical strokes in text that look like borders

  • engineering symbols and logos

My goal is to extract the table in the exact same structure as the image:

  • correct rows

  • correct columns

  • correct merged cells

  • correct OCR text inside each cell


❗ What I need

A solution that can:

  1. Detect the true horizontal and vertical lines

  2. Reconstruct the table grid accurately

  3. Identify and handle merged cells

  4. Extract OCR cell by cell in the correct reading order

  5. Produce a structured output (e.g., Pandas DataFrame) matching the original layout


❗ What I’ve tried (with code) — but still not working

I have tried multiple OpenCV-based approaches, but none can robustly reconstruct the table.

Below is a summary of each method and why it fails.


1. Contour-based cell detection

cnts, _ = cv2.findContours(thresh, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)

❌ Problems:

  • text inside the cells produces additional contours

  • grid lines that touch text merge into large irregular polygons

  • hard to get reading order

  • merged cells break the hierarchy

  • header row splits into many sub-contours


2. Hough Line Transform

lines = cv2.HoughLinesP(binary, 1, np.pi/180, threshold=50)

❌ Problems:

  • partial faint lines detected as many segments

  • cannot distinguish broken borders from noise

  • short text strokes (“I”, “|”, “1”) detected as vertical lines

  • merging line segments accurately becomes very difficult


3. Morphological Line Detection

Horizontal lines:

h_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (80, 1))
h_lines = cv2.morphologyEx(bw, cv2.MORPH_OPEN, h_kernel)

Vertical lines:

v_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, 80))
v_lines = cv2.morphologyEx(bw, cv2.MORPH_OPEN, v_kernel)

❌ Problems:

  • vertical strokes from letters like “M”, “I”, “H”, “T”, “1” are detected as column lines

  • small horizontal strokes under text get detected as row lines

  • header row gets split into 10–20 false columns

  • faint broken borders produce multiple lines

Even after line merging and connected-components filtering, text strokes are still detected as lines.


4. Connected Component Filtering

I attempted strong filtering:

if h >= 0.7 * image_height and w <= 10:
    keep_vertical_line

And similar for horizontal lines.

❌ Problems:

  • real table borders are sometimes broken → rejected

  • text strokes occasionally exceed height threshold → accepted

  • threshold tuning is image dependent


❗ Why this is hard

This engineering table includes:

  • many merged cells

  • multiple header bands

  • broken borders due to scanning

  • text overlapping borders

  • vertical elements in the logo

  • dense text with vertical-like strokes

  • faint interior separators

Because of these issues, pure OpenCV geometric reconstruction becomes unreliable.


❗ What I am asking for

What is the recommended method (OpenCV-only, hybrid, or ML-based) to:

  • reliably detect the table grid,

  • preserve the table structure,

  • handle merged cells,

  • and extract OCR text cell-by-cell in correct order
    from a complex scanned engineering document?

I’m looking for:

  • a robust OpenCV pipeline, OR

  • a deep learning approach (TableNet, CascadeTabNet, YOLO table models), OR

  • hybrid OpenCV + ML guidance, OR

  • built-in table extraction tools that handle structural reconstruction


📌 Minimal Reproducible Example

Here is the code I am currently using:

import cv2
import numpy as np

img = cv2.imread("table.png")
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

bw = cv2.adaptiveThreshold(
    gray, 255,
    cv2.ADAPTIVE_THRESH_MEAN_C,
    cv2.THRESH_BINARY_INV,
    15, 8
)

# Morphological line detection
h_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (80, 1))
raw_h = cv2.morphologyEx(bw, cv2.MORPH_OPEN, h_kernel)

v_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, 80))
raw_v = cv2.morphologyEx(bw, cv2.MORPH_OPEN, v_kernel)

# Filtering (still fails)
def filter_vertical(binary, min_height):
    num, _, stats, _ = cv2.connectedComponentsWithStats(binary)
    out = np.zeros_like(binary)
    for i in range(1, num):
        x,y,w,h,area = stats[i]
        if h >= min_height and w < 10:
            out[y:y+h, x:x+w] = 255
    return out

v_lines = filter_vertical(raw_v, int(0.7 * img.shape[0]))
cv2.imwrite("vlines_debug.png", v_lines)

Even with strong filtering, vertical text strokes still appear as table lines, breaking the grid.


📌 Desired Output

A Pandas DataFrame where:

  • cell positions map to the real table structure

  • merged cells are respected

  • data is extracted in correct reading order

  • the table layout matches the scanned image


🙏 Any guidance would be greatly appreciated.

I am open to:

  • OpenCV-only solutions

  • ML-based table structure models

  • hybrid approaches

  • or suggestions for more reliable tooling

1 Reply 1

could you add the image, the original one?

Your Reply

By clicking “Post Your Reply”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.