
Invoice Entity Bounding Box Mapping (Duplicate Handling)
from autoskill277
Patches OCR invoice entity-to-bounding-box mapping to avoid shared boxes for duplicate values, reverse-search amounts sections, and ensure coordinate uniqueness
What it does
Adds concrete logic to OCR invoice entity mapping workflows to handle duplicate entity values safely: when the same entity value appears multiple times, the algorithm assigns distinct bounding boxes (no reuse), uses memoization to track taken boxes, and falls back to the next-best match when overlaps occur. For amounts_and_tax sections it reverses the search order (bottom-up) to better match invoice layouts. Multi-token entities get sequence-aware matching and overlap checks so tokens don't claim the same coordinates.
When to use it
Use when extracting structured fields from scanned invoices or receipts where the same textual value can appear multiple times (e.g., repeated amounts, line-item names). It's useful during OCR post-processing to increase mapping accuracy and avoid misattributing coordinates.
What's included
- Scripts: none detected (has_scripts=false).
- References: none bundled (has_references=false).
- Instructions: clear operational rules: duplicate handling via memoization/dynamic programming, reversing dataframe for amounts sections, multi-token sequence matching, coordinate uniqueness constraints, and test/validation guidance.
Compatible agents
Relevant to Python-capable coding assistants (Codex, Copilot, GPT-style code assistants) and OCR pipelines that run post-processing scripts. Recommended for teams working with Tesseract/ocr-dataframes or CV-assisted extraction pipelines.
Tags
Information
- Repository
- autoskill
- Stars
- 277
- Installs
- 0