Extract Text from Specific Areas with Options

Introduction

In this tutorial, we’ll explore how to use GroupDocs.Parser for .NET to extract text from specific areas within a document using customizable options. GroupDocs.Parser is a powerful library that enables developers to parse and extract text from various document formats effortlessly.

Prerequisites

Before we dive into the coding, make sure you have the following:

Development Environment: Install Visual Studio or any other .NET development IDE.
GroupDocs.Parser Library: Download and install GroupDocs.Parser for .NET from here.
Sample File: Prepare a sample document (e.g., PDF, DOCX, etc.) to extract text from.

Import Namespaces

First, you’ll need to import the necessary namespaces to access the GroupDocs.Parser classes and methods.

using System;
using System.Collections.Generic;
using System.Text;
using GroupDocs.Parser.Data;
using GroupDocs.Parser.Options;

Step 1: Create an Instance of Parser Class

Initialize an instance of the Parser class by providing the path to your sample file.

using (Parser parser = new Parser("YourSampleFile.pdf"))
{
    // Code for text area extraction will go here
}

Step 2: Define Text Area Extraction Options

Create PageTextAreaOptions to specify the criteria for text extraction.

PageTextAreaOptions options = new PageTextAreaOptions("\\s[a-z]{2}\\s", new Rectangle(new Point(0, 0), new Size(300, 100)));

In this example:

"\\s[a-z]{2}\\s" is a regular expression pattern to match text areas containing only lowercase letters.
new Rectangle(new Point(0, 0), new Size(300, 100)) defines the rectangle (position and size) on the page from which to extract text.

Step 3: Extract Text Areas

Use the defined options to extract text areas that meet the specified criteria.

IEnumerable<PageTextArea> areas = parser.GetTextAreas(options);

Step 4: Check and Iterate over Extracted Text Areas

Check if text area extraction is supported and then iterate over the extracted areas.

if (areas == null)
{
    Console.WriteLine("Page text areas extraction isn't supported");
    return;
}
foreach (PageTextArea a in areas)
{
    Console.WriteLine(string.Format("Page: {0}, R: {1}, Text: {2}", a.Page.Index, a.Rectangle, a.Text));
}

Conclusion

In this tutorial, we’ve covered how to extract text from specific areas within a document using GroupDocs.Parser for .NET. This library offers extensive capabilities for parsing various document formats, making it a valuable tool for text extraction tasks.

FAQ’s

Can GroupDocs.Parser extract text from scanned documents?

Yes, GroupDocs.Parser supports OCR-based text extraction for scanned documents.

Is GroupDocs.Parser compatible with multiple document formats?

Yes, it can parse and extract text from PDF, DOCX, XLSX, PPTX, and other popular formats.

Does GroupDocs.Parser provide support for .NET Core?

Yes, GroupDocs.Parser is compatible with .NET Core as well as .NET Framework.

Can I extract metadata along with text using GroupDocs.Parser?

Yes, you can extract both textual content and metadata from documents.

Is there a trial version available for GroupDocs.Parser?

Yes, you can get a free trial from here.

Extract Text from Specific Areas on a Page Extract Text by Table of Contents (TOC) Item