Extract Tables from Document Page

Introduction

In this tutorial, we will explore how to extract tables from a document page using GroupDocs.Parser for .NET. GroupDocs.Parser is a powerful library that allows developers to work with various document formats such as PDF, DOCX, XLSX, and more. By leveraging its features, we can efficiently extract structured data like tables from these documents, enabling us to manipulate and analyze the information programmatically.

Prerequisites

Before getting started, ensure you have the following:

  • Visual Studio installed on your machine.
  • Basic understanding of C# and .NET development.
  • GroupDocs.Parser for .NET library. You can download it from here.
  • Access to a sample document (PDF, DOCX, etc.) containing tables for extraction.

Import Namespaces

First, begin by importing the necessary namespaces in your C# project:

using System;
using System.Collections.Generic;
using System.IO;
using System.Text;
using GroupDocs.Parser.Data;
using GroupDocs.Parser.Options;
using GroupDocs.Parser.Templates;

Step 1: Create an Instance of Parser Class

Instantiate the Parser class by providing the path to your sample document:

using (Parser parser = new Parser("YourSampleFile.pdf"))
{
    // Your code continues here...
}

Step 2: Check Document Table Extraction Support

Before proceeding, verify if the document supports table extraction:

if (!parser.Features.Tables)
{
    Console.WriteLine("Document does not support table extraction.");
    return;
}

Step 3: Define Table Layout

Define the layout of tables to be extracted from the document. Specify column widths and row heights as per your document’s structure:

TemplateTableLayout layout = new TemplateTableLayout(
    new double[] { 50, 95, 275, 415, 485, 545 },  // Column widths
    new double[] { 325, 340, 365, 395 });         // Row heights

Step 4: Configure Table Extraction Options

Create options for table extraction using the specified layout:

PageTableAreaOptions options = new PageTableAreaOptions(layout);

Step 5: Retrieve Document Information

Fetch information about the document, including the number of pages:

IDocumentInfo documentInfo = parser.GetDocumentInfo();
if (documentInfo.PageCount == 0)
{
    Console.WriteLine("Document has no pages.");
    return;
}

Step 6: Iterate Over Document Pages

Iterate through each page of the document to extract tables:

for (int pageIndex = 0; pageIndex < documentInfo.PageCount; pageIndex++)
{
    Console.WriteLine($"Page {pageIndex + 1}/{documentInfo.PageCount}");
    // Extract tables from the current page
    IEnumerable<PageTableArea> tables = parser.GetTables(pageIndex, options);
    // Iterate over extracted tables
    foreach (PageTableArea table in tables)
    {
        // Iterate over rows of the table
        for (int row = 0; row < table.RowCount; row++)
        {
            // Iterate over columns of the table
            for (int column = 0; column < table.ColumnCount; column++)
            {
                // Get the table cell
                PageTableAreaCell cell = table[row, column];
                if (cell != null)
                {
                    // Print the text of the table cell
                    Console.Write(cell.Text);
                    Console.Write(" | ");
                }
            }
            Console.WriteLine();
        }
        Console.WriteLine();
    }
}

Conclusion

In this tutorial, we covered the process of extracting tables from document pages using GroupDocs.Parser for .NET. By following the provided steps, you can seamlessly integrate table extraction functionality into your .NET applications, enabling efficient handling and manipulation of structured data within documents.

FAQ’s

Can GroupDocs.Parser extract tables from all types of documents?

GroupDocs.Parser supports various document formats like PDF, DOCX, XLSX, and more, enabling table extraction from compatible file types.

Is GroupDocs.Parser for .NET suitable for large-scale document processing?

Yes, GroupDocs.Parser is designed to handle large documents efficiently, making it suitable for processing extensive datasets.

Does GroupDocs.Parser preserve formatting during table extraction?

Yes, GroupDocs.Parser retains formatting details such as cell borders, text styles, and alignments during table extraction.

Can I extract specific tables based on content criteria?

GroupDocs.Parser offers flexible options to target specific tables based on layout templates or content conditions for extraction.

Is GroupDocs.Parser compatible with .NET Core?

Yes, GroupDocs.Parser is compatible with both .NET Framework and .NET Core environments.