Recognizing Text in Rectangular Regions

Introduction

In this tutorial, we’ll explore how to use GroupDocs.Parser for .NET to recognize text within specific rectangular regions of documents. GroupDocs.Parser is a powerful library that allows developers to extract text, metadata, and more from various file formats, including PDF, Word, Excel, and PowerPoint.

Prerequisites

Before we begin, ensure you have the following set up:

  • GroupDocs.Parser for .NET: Download and install the library from here.
  • Development Environment: Visual Studio or any other .NET IDE.
  • Sample Document: Have a sample file (e.g., PDF, DOCX) that contains text to be recognized.

Import Namespaces

First, you’ll need to import the necessary namespaces into your C# code:

using System;
using System.Collections.Generic;
using System.Drawing;
using System.IO;
using System.Linq;
using System.Text;
using Aspose.OCR;
using GroupDocs.Parser.Data;
using GroupDocs.Parser.Options;

Step 1: Initialize Parser Settings

Begin by setting up the ParserSettings with the OCR connector. Here, we’ll use the Aspose OCR on-premise connector:

ParserSettings settings = new ParserSettings(new AsposeOcrOnPremise());

Step 2: Create Parser Instance

Next, instantiate the Parser class with the previously defined settings:

using (Parser parser = new Parser("YourSampleFile.pdf", settings))
{
    // Code continues here
}

Replace "YourSampleFile.pdf" with the path to your document.

Step 3: Define OCR Rectangle

Define a rectangle within the document where text recognition will be performed. For example, a rectangle starting at (0, 0) with width 400 and height 200:

OcrOptions ocrOptions = new OcrOptions(new Data.Rectangle(0, 0, 400, 200));

Step 4: Configure Text Recognition Options

Create TextOptions to specify OCR usage along with the defined rectangle:

TextOptions options = new TextOptions(false, true, ocrOptions);

Step 5: Extract Text using OCR

Use the GetText method of the Parser instance with the configured TextOptions:

using (TextReader reader = parser.GetText(options))
{
    // Read extracted text or handle 'not supported' case
    Console.WriteLine(reader == null ? "Text extraction isn't supported" : reader.ReadToEnd());
}

Conclusion

In this tutorial, we’ve demonstrated how to leverage GroupDocs.Parser for .NET to extract text from specific rectangular regions in documents using OCR. This process can be further customized and integrated into various applications for automated text extraction tasks.

FAQ’s

Can GroupDocs.Parser extract text from scanned documents?

Yes, GroupDocs.Parser supports OCR (Optical Character Recognition) for extracting text from scanned documents.

What file formats does GroupDocs.Parser support?

GroupDocs.Parser supports a wide range of file formats, including PDF, DOCX, XLSX, PPTX, and more.

How can I handle documents that are not supported for text extraction?

You can check if text extraction is supported using TextReader instance returned by parser.GetText(options).

Is GroupDocs.Parser suitable for large-scale text extraction tasks?

Yes, GroupDocs.Parser is designed to handle large-scale text extraction tasks efficiently.

For support and discussions, visit the GroupDocs.Parser forum.