Recognizing Text in Specific Areas

Introduction

In this tutorial, we will explore how to use GroupDocs.Parser for .NET to recognize and extract text from specific areas within a document. GroupDocs.Parser is a powerful document parsing library that allows developers to work with various document formats, including PDF, Word, Excel, PowerPoint, and more. Specifically, we will focus on leveraging the OCR (Optical Character Recognition) capabilities of GroupDocs.Parser to extract text from defined areas within a document.

Prerequisites

Before we begin, ensure you have the following prerequisites set up:

Visual Studio IDE: Make sure you have Visual Studio installed on your machine.
GroupDocs.Parser for .NET: Download and install GroupDocs.Parser for .NET from the download link.
Document Samples: Prepare sample files (e.g., PDF, DOCX) from which you want to extract text.

Import Namespaces

To get started, import the necessary namespaces into your project:

using System;
using System.Collections.Generic;
using System.Drawing;
using System.IO;
using System.Linq;
using System.Text;
using Aspose.OCR;
using GroupDocs.Parser.Data;
using GroupDocs.Parser.Options;

Let’s break down the process into detailed steps using GroupDocs.Parser for .NET:

Step 1: Create Parser Settings with OCR Connector

First, create an instance of ParserSettings class and initialize it with an OCR connector, such as AsposeOcrOnPremise:

ParserSettings settings = new ParserSettings(new AsposeOcrOnPremise());

Step 2: Instantiate Parser with Settings

Next, create an instance of Parser class by passing the previously created ParserSettings:

using (Parser parser = new Parser("YourSampleFile.pdf", settings))
{
    // Code snippet continues...
}

Replace "YourSampleFile.pdf" with the path to your target document.

Step 3: Configure Text Area Extraction Options

Create an instance of PageTextAreaOptions to enable OCR-based text extraction:

PageTextAreaOptions options = new PageTextAreaOptions(true);

Set true to enable OCR for better text recognition.

Step 4: Extract Text Areas

Invoke parser.GetTextAreas(options) to extract text areas from the document:

IEnumerable<PageTextArea> areas = parser.GetTextAreas(options);

Step 5: Process Extracted Text Areas

Iterate over the extracted text areas and retrieve text, position, and size information:

foreach (PageTextArea area in areas)
{
    Console.WriteLine(area.Text);
    Console.WriteLine($"\tPosition: ({area.Rectangle.Left}; {area.Rectangle.Top})");
    Console.WriteLine($"\tSize: ({area.Rectangle.Size.Width}; {area.Rectangle.Size.Height})");
}

Conclusion

In this tutorial, we’ve covered the process of extracting text from specific areas within a document using GroupDocs.Parser for .NET with OCR capabilities. By following these steps, you can effectively leverage the parsing functionalities of GroupDocs.Parser to handle text extraction tasks programmatically.

FAQ’s

Can GroupDocs.Parser extract text from scanned documents?

Yes, GroupDocs.Parser supports OCR for extracting text from scanned images within documents.

Which document formats are supported by GroupDocs.Parser?

GroupDocs.Parser supports a wide range of formats, including PDF, DOCX, XLSX, PPTX, TXT, and more.

Is GroupDocs.Parser suitable for batch processing of documents?

Yes, GroupDocs.Parser can efficiently handle batch processing tasks for document parsing and extraction.

Can I customize text extraction options with GroupDocs.Parser?

Yes, GroupDocs.Parser offers various options to customize text extraction based on specific requirements.

Does GroupDocs.Parser provide support for extracting metadata from documents?

Yes, GroupDocs.Parser allows extraction of metadata like author, creation date, and more from supported document formats.

Recognizing Text Recognizing Text in Rectangular Regions