Project image

Business Goals

  • Fast and reliable extraction of information from digitized unconventional documents - on a large scale

Challenge

  • Template extraction cannot be applied due to the domain (geophysics) specifics and the wide variety of forms of documents.
  • Millions of digitized documents need to be processed in parallel and the required information extracted from images, tables, and text fragments.
  • Images need to be classified into plots, pictures, and tens of domain-specific classes.
  • Linear- and scatterplots need to be transformed into lists of points and discovered qualities.
  • Text fragments and tables with typed, hand-written, or printed characters need to be converted into machine-encoded text.
  • The machine-encoded texts need to be separated according to the detected topic into multiple domain-specific categories.

Results

  • The Deep Learning classification model (1) was validated with an accuracy of 99.9%
  • The Deep Learning model ensemble for semantic segmentation (2) demonstrated an accuracy of 78% and IoU of 0.58.
  • The Deep Learning fragment classification model (3) demonstrated 86% accuracy.
  • The Deep Learning Optical Character Recognition (OCR) model ensemble (4) showed average accuracy of 88%.
  • The NLP Deep Learning text summarization model (5) showed good readability and overall responsiveness as well as near-human reviewers’ accuracy.

Implementation Details

  • The Deep Learning classification model (1) separates original digitized documents into hand-written, typed, and printed documents.
  • The ensemble of deep learning models for semantic segmentation (2) extracts images, tables, and text fragments from scanned documents and saves them to object storage. The ensemble (2) consists of three deep learning segmentation models. Each of them is trained to segment outcomes of the model (1). The ensemble (2) is triggered as the model (1) outputs arrive.
  • The Deep Learning fragment classification model (3) is invoked as the image fragments from models (2) arrive. Model (3) distributes image fragments into plots, pictures, and other domain-specific buckets.
  • The Deep Learning Optical Character Recognition (OCR) models ensemble (4) is invoked as text and table fragments from models (2) arrive. Model ensemble (4) converts text from image fragments into machine-encoded text.
  • The NLP Deep Learning text summarization model (5) distributes machine-encoded text received from the model (4) into multiple domain-specific categories.
  • Simultaneous processing of thousands of digitized documents was implemented via containerization of the workspaces and centralized management of the workloads with Kubernetes.
  • Real-time and batch information extraction from digitized documents was implemented as a pipeline of 5 Deep Learning algorithms, deployed and managed as a series of independent containerized workloads, horizontally scalable. The workloads are created when users upload documents and stop when the documents have been processed.

Industry

Keywords

  • Oil & Gas
  • Documents Processing
  • NLP
  • Computer Vision
  • Segmentation
Roadmap
Discovery
AI Solutions Architect
Data Labeling
AI Solutions Architect, Labeling Team
Data Lake Development
Data Architect, Data Engineer
ETL Pipelines Development
Data Architect, Data Engineer
ML Models Development
Computer Vision Engineer
AI Services Development
AI Solutions Architect, MLOps
Deployment and Integration
MLOps
Infrastructure Provisioning Automation
MLOps
Release

Sign up to receive the project description

    Roadmap
    Discovery
    AI Solutions Architect
    Data Labeling
    AI Solutions Architect, Labeling Team
    Data Lake Development
    Data Architect, Data Engineer
    ETL Pipelines Development
    Data Architect, Data Engineer
    ML Models Development
    Computer Vision Engineer
    AI Services Development
    AI Solutions Architect, MLOps
    Deployment and Integration
    MLOps
    Infrastructure Provisioning Automation
    MLOps
    Release

    Relevant Projects