Parser Leave feedback

Constructors

Constructor	Description
Parser(URL url)	Initializes a new instance of the Parser class to extract data from an URL.

Parser(URL url, LoadOptions loadOptions)	Initializes a new instance of the Parser class to extract data from an URL with loadOptions .

Parser(URL url, ParserSettings parserSettings)	Initializes a new instance of the Parser class to extract data from an URL with parserSettings .

Parser(URL url, LoadOptions loadOptions, ParserSettings parserSettings)	Initializes a new instance of the Parser class to extract data from an URL with loadOptions and parserSettings .

Parser(Connection connection)	Initializes a new instance of the Parser class to extract data from a database.

Parser(Connection connection, ParserSettings parserSettings)	Initializes a new instance of the Parser class to extract data from a database.

Parser(EmailConnection connection)	Initializes a new instance of the Parser class.

Parser(EmailConnection connection, ParserSettings parserSettings)	Initializes a new instance of the Parser class.

Parser(String filePath)	Initializes a new instance of the Parser class.

Parser(String filePath, LoadOptions loadOptions)	Initializes a new instance of the Parser class with LoadOptions.

Parser(String filePath, ParserSettings parserSettings)	Initializes a new instance of the Parser class with ParserSettings.

Parser(String filePath, LoadOptions loadOptions, ParserSettings parserSettings)	Initializes a new instance of the Parser class with LoadOptions and ParserSettings.

Parser(InputStream document)	Initializes a new instance of the Parser class.

Parser(InputStream document, LoadOptions loadOptions)	Initializes a new instance of the Parser class with LoadOptions.

Parser(InputStream document, ParserSettings parserSettings)	Initializes a new instance of the Parser class with ParserSettings.

Parser(InputStream document, LoadOptions loadOptions, ParserSettings parserSettings)	Initializes a new instance of the Parser class with LoadOptions and ParserSettings.

Methods

Method	Description
getFileInfo(String filePath)	Returns the general information about a file.

getFileInfo(InputStream document)	Returns the general information about a file.

getFeatures()	Gets the supported features.

getPagePreview(int pageIndex)	Generates a document page preview.

getPagePreview(int pageIndex, PagePreviewOptions options)	Generates a document page preview using customization options.

generatePreview(PreviewOptions previewOptions)	Get pages preview.

getDocumentInfo()	Returns the general information about the document.

getText()	Extracts a text from the document.

getText(TextOptions options)	Extracts a text page from the document using text options (to enable raw fast text extraction mode).

getText(int pageIndex)	Extracts a text from the document page.

getText(int pageIndex, TextOptions options)	Extracts a text from the document page using text options (to enable raw fast text extraction mode).

getFormattedText(FormattedTextOptions options)	Extracts a formatted text from the document.

getFormattedText(int pageIndex, FormattedTextOptions options)	Extracts a formatted text from the document page.

search(String keyword)	Searches a keyword in the document.

search(String keyword, SearchOptions options)	Searches a keyword in the document using search options (regular expression, match case, etc.).

getHighlight(int position, boolean isDirect, HighlightOptions options)	Extracts a highlight from the document.

getToc()	Extracts a table of contents from the document.

getMetadata()	Extracts metadata from the document.

getContainer()	Extracts a container object from the document to work with formats that contain attachments, ZIP archives etc.

getTextAreas()	Extracts text areas from the document.

getTextAreas(PageTextAreaOptions options)	Extracts text areas from the document using customization options (regular expression, match case, etc.).

getTextAreas(int pageIndex)	Extracts text areas from the document page.

getTextAreas(int pageIndex, PageTextAreaOptions options)	Extracts text areas from the document page using customization options (regular expression, match case, etc.).

getImages()	Extracts images from the document.

getImages(PageAreaOptions options)	Extracts images from the document using customization options (to set the rectangular area that contains images).

getImages(int pageIndex)	Extracts images from the document page.

getImages(int pageIndex, PageAreaOptions options)	Extracts images from the document page using customization options (to set the rectangular area that contains images).

getHyperlinks()	Extracts hyperlinks from the document.

getHyperlinks(int pageIndex)	Extracts hyperlinks from the document page.

getHyperlinks(PageAreaOptions options)	Extracts hyperlinks from the document using customization options (to set the rectangular area that contains hyperlinks).

getHyperlinks(int pageIndex, PageAreaOptions options)	Extracts hyperlinks from the document page using customization options (to set the rectangular area that contains hyperlinks).

getBarcodes()	Extracts barcodes from the document.

getBarcodes(int pageIndex)	Extracts barcodes from the document page.

getBarcodes(BarcodeOptions options)	Extracts barcodes from the document using customization options (to set the rectangular area that contains barcodes).

getBarcodes(int pageIndex, BarcodeOptions options)	Extracts barcodes from the document page using customization options (to set the rectangular area that contains barcodes).

getTables(PageTableAreaOptions options)	Extracts tables from the document.

getTables()	Extracts tables from the document, detecting them automatically.

getTables(int pageIndex, PageTableAreaOptions options)	Extracts tables from the document page.

getTables(int pageIndex)	Extracts tables from the document page, detecting them automatically.

generateAdjustmentFields(GenerateTemplateOptions options)	Generates a collection of adjustment TemplateItems for the document.

getWorksheetInfo()	Extracts the info about all worksheets in the spreadsheet.

getWorksheetInfo(int worksheetIndex)	Extracts the info about the worksheet.

getWorksheetCells(int worksheetIndex)	Extracts worksheet cells.

getWorksheetCells(int worksheetIndex, WorksheetOptions options)	Extracts worksheet cells using customization options.

parseByTemplate(Template template)	Parses the document by the user-generated template.

parseByTemplate(Template template, ParseByTemplateOptions options)	Parses the document by the user-generated template with the supplied options.

parseByTemplate(TemplateCollection templates, ParseByTemplateOptions options)	Parses the document by automatically selecting the best-matching template from a collection.

parseForm()	Parses the document form.

getStructure()	Extracts a structured text from the document.

close()	Closes this resource, relinquishing any underlying resources.

Parser(URL url)

public Parser(URL url)

Initializes a new instance of the Parser class to extract data from an URL.

Parameters:

Parameter	Type	Description
url	java.net.URL	The URL the request is sent to

Parser(URL url, LoadOptions loadOptions)

public Parser(URL url, LoadOptions loadOptions)

Initializes a new instance of the Parser class to extract data from an URL with loadOptions .

Parameters:

Parameter	Type	Description
url	java.net.URL	The URL the request is sent to.

loadOptions	LoadOptions	The options to open the file.

Parser(URL url, ParserSettings parserSettings)

public Parser(URL url, ParserSettings parserSettings)

Initializes a new instance of the Parser class to extract data from an URL with parserSettings .

Parameters:

Parameter	Type	Description
url	java.net.URL	The URL the request is sent to.

parserSettings	ParserSettings	The parser settings which are used to customize data extraction.

Parser(URL url, LoadOptions loadOptions, ParserSettings parserSettings)

public Parser(URL url, LoadOptions loadOptions, ParserSettings parserSettings)

Initializes a new instance of the Parser class to extract data from an URL with loadOptions and parserSettings .

Parameters:

Parameter	Type	Description
url	java.net.URL	The URL the request is sent to.

loadOptions	LoadOptions	The options to open the file.

parserSettings	ParserSettings	The parser settings which are used to customize data extraction.

Parser(Connection connection)

public Parser(Connection connection)

Initializes a new instance of the Parser class to extract data from a database.

Learn more:

Extract data from databases

The following example shows how to extract data from Sqlite database:

// Create DbConnection object
 java.sql.Connection connection = java.sql.DriverManager.getConnection(String.format("jdbc:sqlite:%s", Constants.SampleDatabase));
 // Create an instance of Parser class to extract tables from the database
 try (Parser parser = new Parser(connection)) {
     // Check if text extraction is supported
     if (!parser.getFeatures().isText()) {
         System.out.println("Text extraction isn't supported.");
         return;
     }
     // Check if toc extraction is supported
     if (!parser.getFeatures().isToc()) {
         System.out.println("Toc extraction isn't supported.");
         return;
     }
     // Get a list of tables
     Iterable toc = parser.getToc();
     // Iterate over tables
     for(TocItem i : toc)
     {
         // Print the table name
         System.out.println(i.extractText());
         // Extract a table content as a text
         try(TextReader reader = parser.getText(i.getPageIndex().intValue()))
         {
             System.out.println(reader.readToEnd());
         }
     }
 }

Parameters:

Parameter	Type	Description
connection	java.sql.Connection	The database connection.

Parser(Connection connection, ParserSettings parserSettings)

public Parser(Connection connection, ParserSettings parserSettings)

Initializes a new instance of the Parser class to extract data from a database.

Learn more:

The following example shows how to extract data from Sqlite database:

// Create DbConnection object
 java.sql.Connection connection = java.sql.DriverManager.getConnection(String.format("jdbc:sqlite:%s", Constants.SampleDatabase));
 // Create an instance of Parser class to extract tables from the database
 try (Parser parser = new Parser(connection)) {
     // Check if text extraction is supported
     if (!parser.getFeatures().isText()) {
         System.out.println("Text extraction isn't supported.");
         return;
     }
     // Check if toc extraction is supported
     if (!parser.getFeatures().isToc()) {
         System.out.println("Toc extraction isn't supported.");
         return;
     }
     // Get a list of tables
     Iterable toc = parser.getToc();
     // Iterate over tables
     for(TocItem i : toc)
     {
         // Print the table name
         System.out.println(i.extractText());
         // Extract a table content as a text
         try(TextReader reader = parser.getText(i.getPageIndex().intValue()))
         {
             System.out.println(reader.readToEnd());
         }
     }
 }

Parameters:

Parameter	Type	Description
connection	java.sql.Connection	The database connection.

parserSettings	ParserSettings	The parser settings which are used to customize data extraction.

Parser(EmailConnection connection)

public Parser(EmailConnection connection)

Initializes a new instance of the Parser class.

Learn more:

Extract emails from remote server via POP, IMAP or Exchange Web Services protocols

The following example shows how to extract emails from Exchange Server:

// Create the connection object for Exchange Web Services protocol
 EmailConnection connection = new EmailEwsConnection(
         "https://outlook.office365.com/ews/exchange.asmx",
         "email@server",
         "password");
 // Create an instance of Parser class to extract emails from the remote server
 try (Parser parser = new Parser(connection)) {
     // Check if container extraction is supported
     if (!parser.getFeatures().isContainer()) {
         System.out.println("Container extraction isn't supported.");
         return;
     }
     // Extract email messages from the server
     Iterable emails = parser.getContainer();
     // Iterate over attachments
     for (ContainerItem item : emails) {
         // Create an instance of Parser class for email message
         try (Parser emailParser = item.openParser()) {
             // Extract the email text
             try (TextReader reader = emailParser.getText()) {
                 // Print the email text
                 System.out.println(reader == null ? "Text extraction isn't supported." : reader.readToEnd());
             }
         }
     }
 }

Parameters:

Parameter	Type	Description
connection	EmailConnection	The email connection.

Parser(EmailConnection connection, ParserSettings parserSettings)

public Parser(EmailConnection connection, ParserSettings parserSettings)

Initializes a new instance of the Parser class.

Learn more:

The following example shows how to extract emails from Exchange Server:

// Create the connection object for Exchange Web Services protocol
 EmailConnection connection = new EmailEwsConnection(
         "https://outlook.office365.com/ews/exchange.asmx",
         "email@server",
         "password");
 // Create an instance of Parser class to extract emails from the remote server
 try (Parser parser = new Parser(connection)) {
     // Check if container extraction is supported
     if (!parser.getFeatures().isContainer()) {
         System.out.println("Container extraction isn't supported.");
         return;
     }
     // Extract email messages from the server
     Iterable emails = parser.getContainer();
     // Iterate over attachments
     for (ContainerItem item : emails) {
         // Create an instance of Parser class for email message
         try (Parser emailParser = item.openParser()) {
             // Extract the email text
             try (TextReader reader = emailParser.getText()) {
                 // Print the email text
                 System.out.println(reader == null ? "Text extraction isn't supported." : reader.readToEnd());
             }
         }
     }
 }

Parameters:

Parameter	Type	Description
connection	EmailConnection	The email connection.

parserSettings	ParserSettings	The parser settings which are used to customize data extraction.

Parser(String filePath)

public Parser(String filePath)

Initializes a new instance of the Parser class.

Learn more:

Load document from local disk

The following example shows how to load the document from the local disk:

// Set the filePath
 String filePath = Constants.SamplePdf;
 // Create an instance of Parser class with the filePath
 try (Parser parser = new Parser(filePath)) {
     // Extract a text into the reader
     try (TextReader reader = parser.getText()) {
         // Print a text from the document
         // If text extraction isn't supported, a reader is null
         System.out.println(reader == null ? "Text extraction isn't supported" : reader.readToEnd());
     }
 }

Parameters:

Parameter	Type	Description
filePath	java.lang.String	The path to the file.

Parser(String filePath, LoadOptions loadOptions)

public Parser(String filePath, LoadOptions loadOptions)

Initializes a new instance of the Parser class with LoadOptions.

Learn more:

The document password is passed by LoadOptions class:

try {
     String password = "123456";
     // Create an instance of Parser class with the password:
     try (Parser parser = new Parser(Constants.SamplePassword, new LoadOptions(password))) {
         // Check if text extraction is supported
         if (!parser.getFeatures().isText()) {
             System.out.println("Text extraction isn't supported.");
             return;
         }
         // Print the document text
         try (TextReader reader = parser.getText()) {
             System.out.println(reader.readToEnd());
         }
     }
 } catch (InvalidPasswordException ex) {
     // Print the message if the password is incorrect or empty
     System.out.println("Invalid password");
 }

Parameters:

Parameter	Type	Description
filePath	java.lang.String	The path to the file.

loadOptions	LoadOptions	The options to open the file.

Parser(String filePath, ParserSettings parserSettings)

public Parser(String filePath, ParserSettings parserSettings)

Initializes a new instance of the Parser class with ParserSettings.

Parameters:

Parameter	Type	Description
filePath	java.lang.String	The path to the file.

parserSettings	ParserSettings	The parser settings which are used to customize data extraction.

Parser(String filePath, LoadOptions loadOptions, ParserSettings parserSettings)

public Parser(String filePath, LoadOptions loadOptions, ParserSettings parserSettings)

Initializes a new instance of the Parser class with LoadOptions and ParserSettings.

Learn more:

The following example shows how to receive the information via ILogger interface:

try {
     // Create an instance of Logger class
     Logger logger = new Logger();
     // Create an instance of Parser class with the parser settings
     try (Parser parser = new Parser(Constants.SamplePassword, null, new ParserSettings(logger))) {
         // Check if text extraction is supported
         if (!parser.getFeatures().isText()) {
             System.out.println("Text extraction isn't supported.");
             return;
         }
         // Print the document text
         try (TextReader reader = parser.getText()) {
             System.out.println(reader.readToEnd());
         }
     }
 } catch (InvalidPasswordException | IOException ex) {
     ; // Ignore the exception
 }

 class Logger implements ILogger {
     public void error(String message, Exception exception) {
         // Print error message
         System.out.println("Error: " + message);
     }

     public void trace(String message) {
         // Print event message
         System.out.println("Event: " + message);
     }

     public void warning(String message) {
         // Print warning message
         System.out.println("Warning: " + message);
     }
 }

Parameters:

Parameter	Type	Description
filePath	java.lang.String	The path to the file.

loadOptions	LoadOptions	The options to open the file.

parserSettings	ParserSettings	The parser settings which are used to customize data extraction.

Parser(InputStream document)

public Parser(InputStream document)

Initializes a new instance of the Parser class.

Learn more:

Load document from stream

The following example shows how to load the document from the stream:

// Create the stream
 try (InputStream stream = new FileInputStream(Constants.SamplePdf)) {
     // Create an instance of Parser class with the stream
     try (Parser parser = new Parser(stream)) {
         // Extract a text into the reader
         try (TextReader reader = parser.getText()) {
             // Print a text from the document
             // If text extraction isn't supported, a reader is null
             System.out.println(reader == null ? "Text extraction isn't supported" : reader.readToEnd());
         }
     }
 }

Parameters:

Parameter	Type	Description
document	java.io.InputStream	The source input stream.

Parser(InputStream document, LoadOptions loadOptions)

public Parser(InputStream document, LoadOptions loadOptions)

Initializes a new instance of the Parser class with LoadOptions.

Learn more:

In some cases it’s necessary to define FileFormat. Both for special cases (databases, email server) and for detecting file types by the content:

// Create an instance of Parser class for markdown document
 try (Parser parser = new Parser(stream, new LoadOptions(FileFormat.Markup))) {
     // Check if text extraction is supported
     if (!parser.getFeatures().isText()) {
         System.out.println("Text extraction isn't supported.");
         return;
     }
     try (TextReader reader = parser.getText()) {
         // Print the document text
         // Markdown is detected; text without special symbols is printed
         System.out.println(reader.readToEnd());
     }
 }

Parameters:

Parameter	Type	Description
document	java.io.InputStream	The source input stream.

loadOptions	LoadOptions	The options to open the file.

Parser(InputStream document, ParserSettings parserSettings)

public Parser(InputStream document, ParserSettings parserSettings)

Initializes a new instance of the Parser class with ParserSettings.

Parameters:

Parameter	Type	Description
document	java.io.InputStream	The source input stream.

parserSettings	ParserSettings	The parser settings which are used to customize data extraction.

Parser(InputStream document, LoadOptions loadOptions, ParserSettings parserSettings)

public Parser(InputStream document, LoadOptions loadOptions, ParserSettings parserSettings)

Initializes a new instance of the Parser class with LoadOptions and ParserSettings.

Learn more:

The following example shows how to receive the information via ILogger interface:

try {
     // Create an instance of Logger class
     Logger logger = new Logger();
     // Create an instance of Parser class with the parser settings
     try (Parser parser = new Parser(Constants.SamplePassword, null, new ParserSettings(logger))) {
         // Check if text extraction is supported
         if (!parser.getFeatures().isText()) {
             System.out.println("Text extraction isn't supported.");
             return;
         }
         // Print the document text
         try (TextReader reader = parser.getText()) {
             System.out.println(reader.readToEnd());
         }
     }
 } catch (InvalidPasswordException | IOException ex) {
     ; // Ignore the exception
 }

 class Logger implements ILogger {
     public void error(String message, Exception exception) {
         // Print error message
         System.out.println("Error: " + message);
     }

     public void trace(String message) {
         // Print event message
         System.out.println("Event: " + message);
     }

     public void warning(String message) {
         // Print warning message
         System.out.println("Warning: " + message);
     }
 }

Parameters:

Parameter	Type	Description
document	java.io.InputStream	The source input stream.

loadOptions	LoadOptions	The options to open the file.

parserSettings	ParserSettings	The parser settings which are used to customize data extraction.

getFileInfo(String filePath)

public static FileInfo getFileInfo(String filePath)

Returns the general information about a file.

The following code shows how to check whether a file is password-protected:

// Get a file info
 FileInfo info = Parser.getFileInfo(filePath);
 // Check IsEncrypted property
 System.out.println(info.isEncrypted() ? "Password is required" : "");

Parameters:

Parameter	Type	Description
filePath	java.lang.String	The path to the file.

Returns: FileInfo - An instance of FileInfo class.

getFileInfo(InputStream document)

public static FileInfo getFileInfo(InputStream document)

Returns the general information about a file.

The following code shows how to check whether a file is password-protected:

// Get a file info
 FileInfo info = Parser.getFileInfo(filePath);
 // Check IsEncrypted property
 System.out.println(info.isEncrypted() ? "Password is required" : "");

Parameters:

Parameter	Type	Description
document	java.io.InputStream	The source input stream.

Returns: FileInfo - An instance of FileInfo class.

getFeatures()

public Features getFeatures()

Gets the supported features.

Learn more:

Get supported features

If the feature isn’t supported, the method returns null instead of the value. Some operations may consume significant time. So it’s not optimal to call the method to just check the support for the feature. For this purpose Features property is used:

// Create an instance of Parser class
 try (Parser parser = new Parser(Constants.SampleZip)) {
     // Check if text extraction is supported for the document
     if (!parser.getFeatures().isText()) {
         System.out.println("Text extraction isn't supported");
         return;
     }
     // Extract a text from the document
     try (TextReader reader = parser.getText()) {
         System.out.println(reader.readToEnd());
     }
 }

Returns: Features - An instance of Features class that represents the supported features.

getPagePreview(int pageIndex)

public OutputStream getPagePreview(int pageIndex)

Generates a document page preview.

Parameters:

Parameter	Type	Description
pageIndex	int	The zero-based page index.

Returns: java.io.OutputStream - An instance of java.io.OutputStream containing an image of the document page; null if the page preview generation isn’t supported.

getPagePreview(int pageIndex, PagePreviewOptions options)

public OutputStream getPagePreview(int pageIndex, PagePreviewOptions options)

Generates a document page preview using customization options.

Parameters:

Parameter	Type	Description
pageIndex	int	The zero-based page index.

options	PagePreviewOptions	The options to customize the preview generation.

Returns: java.io.OutputStream - An instance of java.io.OutputStream containing an image of the document page; null if the page preview generation isn’t supported.

generatePreview(PreviewOptions previewOptions)

public void generatePreview(PreviewOptions previewOptions)

Get pages preview.

Parameters:

Parameter	Type	Description
previewOptions	PreviewOptions	The options to sets requirements and stream delegates for preview generation.

getDocumentInfo()

public IDocumentInfo getDocumentInfo()

Returns the general information about the document.

Learn more:

The following example shows how to get document info:

// Create an instance of Parser class
 try (Parser parser = new Parser(Constants.SampleDocx)) {
     // Get the document info
     IDocumentInfo info = parser.getDocumentInfo();
     // Print document information
     System.out.println(String.format("FileType: %s", info.getFileType()));
     System.out.println(String.format("PageCount: %d", info.getPageCount()));
     System.out.println(String.format("Size: %d", info.getSize()));
 }

Returns: IDocumentInfo - An instance of class that implements IDocumentInfo interface.

getText()

public TextReader getText()

Extracts a text from the document.

Learn more:

The following example shows how to extract a text from a document:

// Create an instance of Parser class
 try (Parser parser = new Parser(Constants.SamplePdf)) {
     // Extract a text into the reader
     try (TextReader reader = parser.getText()) {
         // Print a text from the document
         // If text extraction isn't supported, a reader is null
         System.out.println(reader == null ? "Text extraction isn't supported" : reader.readToEnd());
     }
 }

Returns: TextReader - An instance of TextReader class with the extracted text; null if text extraction isn’t supported.

getText(TextOptions options)

public TextReader getText(TextOptions options)

Extracts a text page from the document using text options (to enable raw fast text extraction mode).

Learn more:

The following example shows how to extract a raw text from a document:

// Create an instance of Parser class
 try (Parser parser = new Parser(Constants.SamplePdf)) {
     // Extract a raw text into the reader
     try (TextReader reader = parser.getText(new TextOptions(true))) {
         // Print a text from the document
         // If text extraction isn't supported, a reader is null
         System.out.println(reader == null ? "Text extraction isn't supported" : reader.readToEnd());
     }
 }

Parameters:

Parameter	Type	Description
options	TextOptions	The text extraction options.

Returns: TextReader - An instance of TextReader class with the extracted text; null if text extraction isn’t supported.

getText(int pageIndex)

public TextReader getText(int pageIndex)

Extracts a text from the document page.

Learn more:

Extract text in Accurate Mode

The following example shows how to extract a text from the document page:

// Create an instance of Parser class
 try (Parser parser = new Parser(Constants.SamplePdf)) {
     // Check if the document supports text extraction
     if (!parser.getFeatures().isText()) {
         System.out.println("Document isn't supports text extraction.");
         return;
     }
     // Get the document info
     IDocumentInfo documentInfo = parser.getDocumentInfo();
     // Check if the document has pages
     if (documentInfo.getPageCount() == 0) {
         System.out.println("Document hasn't pages.");
         return;
     }
     // Iterate over pages
     for (int p = 0; p < documentInfo.getPageCount(); p++) {
         // Print a page number
         System.out.println(String.format("Page %d/%d", p + 1, documentInfo.getPageCount()));
         // Extract a text into the reader
         try (TextReader reader = parser.getText(p)) {
             // Print a text from the document
             // We ignore null-checking as we have checked text extraction feature support earlier
             System.out.println(reader.readToEnd());
         }
     }
 }

Parameters:

Parameter	Type	Description
pageIndex	int	The zero-based page index.

Returns: TextReader - An instance of TextReader class with the extracted text; null if text page extraction isn’t supported.

getText(int pageIndex, TextOptions options)

public TextReader getText(int pageIndex, TextOptions options)

Extracts a text from the document page using text options (to enable raw fast text extraction mode).

Learn more:

The following example shows how to extract a raw text from the document page:

// Create an instance of Parser class
 try (Parser parser = new Parser(Constants.SamplePdf)) {
     // Check if the document supports text extraction
     if (!parser.getFeatures().isText()) {
         System.out.println("Document isn't supports text extraction.");
         return;
     }
     // Get the document info
     DocumentInfo documentInfo = parser.getDocumentInfo() instanceof DocumentInfo
             ? (DocumentInfo) parser.getDocumentInfo()
             : null;
     // Check if the document has pages
     if (documentInfo == null || documentInfo.getRawPageCount() == 0) {
         System.out.println("Document hasn't pages.");
         return;
     }
     // Iterate over pages
     for (int p = 0; p < documentInfo.getRawPageCount(); p++) {
         // Print a page number
         System.out.println(String.format("Page %d/%d", p + 1, documentInfo.getPageCount()));
         // Extract a text into the reader
         try (TextReader reader = parser.getText(p, new TextOptions(true))) {
             // Print a text from the document
             // We ignore null-checking as we have checked text extraction feature support earlier
             System.out.println(reader.readToEnd());
         }
     }
 }

Parameters:

Parameter	Type	Description
pageIndex	int	The zero-based page index.

options	TextOptions	The text extraction options.

Returns: TextReader - An instance of TextReader class with the extracted text; null if text page extraction isn’t supported.

getFormattedText(FormattedTextOptions options)

public TextReader getFormattedText(FormattedTextOptions options)

Extracts a formatted text from the document.

Learn more:

Extract formatted text from document
Extract a document text as HTML
Extract a document text as Markdown
Extract a document text as Plain text

The following example shows how to extract a document text as HTML text:

// Create an instance of Parser class
 try (Parser parser = new Parser(Constants.SampleDocx)) {
     // Extract a formatted text into the reader
     try (TextReader reader = parser.getFormattedText(new FormattedTextOptions(FormattedTextMode.Html))) {
         // Print a formatted text from the document
         // If formatted text extraction isn't supported, a reader is null
         System.out.println(reader == null ? "Formatted text extraction isn't suppported" : reader.readToEnd());
     }
 }

Parameters:

Parameter	Type	Description
options	FormattedTextOptions	The formatted text extraction options.

Returns: TextReader - An instance of TextReader class with the extracted text; null if formatted text extraction isn’t supported.

getFormattedText(int pageIndex, FormattedTextOptions options)

public TextReader getFormattedText(int pageIndex, FormattedTextOptions options)

Extracts a formatted text from the document page.

Learn more:

Extract formatted text from document page
Extract a document text as HTML
Extract a document text as Markdown
Extract a document text as Plain text

The following example shows how to extract a document page text as Markdown text:

// Create an instance of Parser class
 try (Parser parser = new Parser(Constants.SampleDocx)) {
     // Check if the document supports formatted text extraction
     if (!parser.getFeatures().isFormattedText()) {
         System.out.println("Document isn't supports formatted text extraction.");
         return;
     }
     // Get the document info
     IDocumentInfo documentInfo = parser.getDocumentInfo();
     // Check if the document has pages
     if (documentInfo.getPageCount() == 0) {
         System.out.println("Document hasn't pages.");
         return;
     }
     // Iterate over pages
     for (int p = 0; p < documentInfo.getPageCount(); p++) {
         // Print a page number
         System.out.println(String.format("Page %d/%d", p + 1, documentInfo.getPageCount()));
         // Extract a formatted text into the reader
         try (TextReader reader = parser.getFormattedText(p, new FormattedTextOptions(FormattedTextMode.Markdown))) {
             // Print a formatted text from the document
             // We ignore null-checking as we have checked formatted text extraction feature support earlier
             System.out.println(reader.readToEnd());
         }
     }
 }

Parameters:

Parameter	Type	Description
pageIndex	int	The zero-based page index.

options	FormattedTextOptions	The formatted text extraction options.

Returns: TextReader - An instance of TextReader class with the extracted text; null if formatted text page extraction isn’t supported.

search(String keyword)

public Iterable<SearchResult> search(String keyword)

Searches a keyword in the document.

Learn more:

The following example shows how to find a keyword in a document:

// Create an instance of Parser class
 try (Parser parser = new Parser(Constants.SamplePdf)) {
     // Search a keyword:
     Iterable sr = parser.search("lorem");
     // Check if search is supported
     if (sr == null) {
         System.out.println("Search isn't supported");
         return;
     }
     // Iterate over search results
     for (SearchResult s : sr) {
         // Print an index and found text:
         System.out.println(String.format("At %d: %s", s.getPosition(), s.getText()));
     }
 }

Parameters:

Parameter	Type	Description
keyword	java.lang.String	The keyword to search.

Returns: java.lang.Iterable<com.groupdocs.parser.data.SearchResult> - A collection of SearchResult objects; null if search isn’t supported.

search(String keyword, SearchOptions options)

public Iterable<SearchResult> search(String keyword, SearchOptions options)

Searches a keyword in the document using search options (regular expression, match case, etc.).

Learn more:

The following example shows how to search with a regular expression in a document:

// Create an instance of Parser class
 try (Parser parser = new Parser(Constants.SamplePdf)) {
     // Search with a regular expression with case matching
     Iterable sr = parser.search("[0-9]+", new SearchOptions(true, false, true));
     // Check if search is supported
     if (sr == null) {
         System.out.println("Search isn't supported");
         return;
     }
     // Iterate over search results
     for (SearchResult s : sr) {
         // Print an index and found text:
         System.out.println(String.format("At %d: %s", s.getPosition(), s.getText()));
     }
 }

The following example shows how to search a text on pages:

// Create an instance of Parser class
 try (Parser parser = new Parser(Constants.SamplePdf)) {
     // Search a keyword with page numbers
     Iterable sr = parser.search("lorem", new SearchOptions(false, false, false, true));
     // Check if search is supported
     if (sr == null) {
         System.out.println("Search isn't supported");
         return;
     }
     // Iterate over search results
     for (SearchResult s : sr) {
         // Print an index, page number and found text:
         System.out.println(String.format("At %d (%d): %s", s.getPosition(), s.getPageIndex(), s.getText()));
     }
 }

Parameters:

Parameter	Type	Description
keyword	java.lang.String	The keyword to search.

options	SearchOptions	The search options.

Returns: java.lang.Iterable<com.groupdocs.parser.data.SearchResult> - A collection of SearchResult objects; null if search isn’t supported.

getHighlight(int position, boolean isDirect, HighlightOptions options)

public HighlightItem getHighlight(int position, boolean isDirect, HighlightOptions options)

Extracts a highlight from the document.

Learn more:

Extract highlights

The following example shows how to extract a highlight that contains 3 words:

// Create an instance of Parser class
 try (Parser parser = new Parser(Constants.SamplePdf)) {
     // Extract a highlight:
     HighlightItem hl = parser.getHighlight(2, true, new HighlightOptions(10, 3));
     // Check if highlight extraction is supported
     if (hl == null) {
         System.out.println("Highlight extraction isn't supported");
         return;
     }
     // Print an extracted highlight
     System.out.println(String.format("At %d: %s", hl.getPosition(), hl.getText()));
 }

Parameters:

Parameter	Type	Description
position	int	The start position of the highlight.

isDirect	boolean	The value that indicates whether highlight extraction is direct. true if the higlight is extracted by the right of position; otherwise, false .

options	HighlightOptions	The highlight extraction options.

Returns: HighlightItem - An instance of HighlightItem class that represents the extracted highlight; null if highlight extraction isn’t supported.

getToc()

public Iterable<TocItem> getToc()

Extracts a table of contents from the document.

Learn more:

The following example shows how to extract table of contents from EPUB file:

// Create an instance of Parser class
 try (Parser parser = new Parser(Constants.SampleEpub)) {
     // Check if text extraction is supported
     if (!parser.getFeatures().isText()) {
         System.out.println("Text extraction isn't supported.");
         return;
     }
     // Check if toc extraction is supported
     if (!parser.getFeatures().isToc()) {
         System.out.println("Toc extraction isn't supported.");
         return;
     }
     // Get table of contents
     Iterable toc = parser.getToc();
     // Iterate over items
     for (TocItem i : toc) {
         // Print the Toc text
         System.out.println(i.getText());
         // Check if page index has a value
         if (i.getPageIndex() == null) {
             continue;
         }
         // Extract a page text
         try (TextReader reader = parser.getText(i.getPageIndex())) {
             System.out.println(reader.readToEnd());
         }
     }
 }

Returns: java.lang.Iterable<com.groupdocs.parser.data.TocItem> - A collection of table of contents items; null if table of contents extraction isn’t supported.

getMetadata()

public Iterable<MetadataItem> getMetadata()

Extracts metadata from the document.

Learn more:

The following example shows how to extract metadata from a document:

// Create an instance of Parser class
 try (Parser parser = new Parser(Constants.SampleDocx)) {
     // Extract metadata from the document
     Iterable metadata = parser.getMetadata();
     // Check if metadata extraction is supported
     if (metadata == null) {
         System.out.println("Metatada extraction isn't supported");
     }
     // Iterate over metadata items
     for (MetadataItem item : metadata) {
         // Print an item name and value
         System.out.println(String.format("%s: %s", item.getName(), item.getValue()));
     }
 }

Returns: java.lang.Iterable<com.groupdocs.parser.data.MetadataItem> - A collection of metadata items; null if metadata extraction isn’t supported.

getContainer()

public Iterable<ContainerItem> getContainer()

Extracts a container object from the document to work with formats that contain attachments, ZIP archives etc.

Learn more:

The following example shows how to extract a text from zip entities:

// Create an instance of Parser class
 try (Parser parser = new Parser(Constants.SampleZip)) {
     // Extract attachments from the container
     Iterable attachments = parser.getContainer();
     // Check if container extraction is supported
     if (attachments == null) {
         System.out.println("Container extraction isn't supported");
     }
     // Iterate over zip entities
     for (ContainerItem item : attachments) {
         // Print the file path
         System.out.println(item.getFilePath());
         try {
             // Create Parser object for the zip entity content
             try (Parser attachmentParser = item.openParser()) {
                 // Extract an zip entity text
                 try (TextReader reader = attachmentParser.getText()) {
                     System.out.println(reader == null ? "No text" : reader.readToEnd());
                 }
             }
         } catch (UnsupportedDocumentFormatException ex) {
             System.out.println("Isn't supported.");
         }
     }
 }

Returns: java.lang.Iterable<com.groupdocs.parser.data.ContainerItem> - A collection of container items; null if container extraction isn’t supported.

getTextAreas()

public Iterable<PageTextArea> getTextAreas()

Extracts text areas from the document.

Learn more:

Extract text areas

The following example shows how to extract all text areas from the whole document:

// Create an instance of Parser class
 try (Parser parser = new Parser(Constants.SampleImagesPdf)) {
     // Extract text areas
     Iterable areas = parser.getTextAreas();
     // Check if text areas extraction is supported
     if (areas == null) {
         System.out.println("Page text areas extraction isn't supported");
         return;
     }
     // Iterate over page text areas
     for (PageTextArea a : areas) {
         // Print a page index, rectangle and text area value:
         System.out.println(String.format("Page: %d, R: %s, Text: %s", a.getPage().getIndex(), a.getRectangle(), a.getText()));
     }
 }

Returns: java.lang.Iterable<com.groupdocs.parser.data.PageTextArea> - A collection of PageTextArea objects; null if text areas extraction isn’t supported.

getTextAreas(PageTextAreaOptions options)

public Iterable<PageTextArea> getTextAreas(PageTextAreaOptions options)

Extracts text areas from the document using customization options (regular expression, match case, etc.).

Learn more:

Extract text areas

The following example shows how to extract only text areas with digits from the upper-left courner:

// Create an instance of Parser class
 try (Parser parser = new Parser(Constants.SampleImagesPdf)) {
     // Create the options which are used for text area extraction
     PageTextAreaOptions options = new PageTextAreaOptions("\\s[a-z]{2}\\s", new Rectangle(new Point(0, 0), new Size(300, 100)));
     // Extract text areas which contain only digits from the upper-left corner of a page:
     Iterable areas = parser.getTextAreas(options);
     // Check if text areas extraction is supported
     if (areas == null) {
         System.out.println("Page text areas extraction isn't supported");
         return;
     }
     // Iterate over page text areas
     for (PageTextArea a : areas) {
         // Print a page index, rectangle and text area value:
         System.out.println(String.format("Page: %d, R: %s, Text: %s", a.getPage().getIndex(), a.getRectangle(), a.getText()));
     }
 }

Parameters:

Parameter	Type	Description
options	PageTextAreaOptions	The options for text area extraction.

Returns: java.lang.Iterable<com.groupdocs.parser.data.PageTextArea> - A collection of PageTextArea objects; null if text areas extraction isn’t supported.

getTextAreas(int pageIndex)

public Iterable<PageTextArea> getTextAreas(int pageIndex)

Extracts text areas from the document page.

Learn more:

Extract text areas

To extract text areas from a document page the following method is used:

// Create an instance of Parser class
 try (Parser parser = new Parser(Constants.SampleImagesPdf)) {
     // Check if the document supports text areas extraction
     if (!parser.getFeatures().isTextAreas()) {
         System.out.println("Document isn't supports text areas extraction.");
         return;
     }
     // Get the document info
     IDocumentInfo documentInfo = parser.getDocumentInfo();
     // Check if the document has pages
     if (documentInfo.getPageCount() == 0) {
         System.out.println("Document hasn't pages.");
         return;
     }
     // Iterate over pages
     for (int pageIndex = 0; pageIndex < documentInfo.getPageCount(); pageIndex++) {
         // Print a page number
         System.out.println(String.format("Page %d/%d", pageIndex + 1, documentInfo.getPageCount()));
         // Iterate over page text areas
         // We ignore null-checking as we have checked text areas extraction feature support earlier
         for (PageTextArea a : parser.getTextAreas(pageIndex)) {
             // Print a rectangle and text area value:
             System.out.println(String.format("R: %s, Text: %s", a.getRectangle(), a.getText()));
         }
     }
 }

Parameters:

Parameter	Type	Description
pageIndex	int	The zero-based page index.

Returns: java.lang.Iterable<com.groupdocs.parser.data.PageTextArea> - A collection of PageTextArea objects; null if text areas extraction isn’t supported.

getTextAreas(int pageIndex, PageTextAreaOptions options)

public Iterable<PageTextArea> getTextAreas(int pageIndex, PageTextAreaOptions options)

Extracts text areas from the document page using customization options (regular expression, match case, etc.).

Learn more:

Extract text areas

Parameters:

Parameter	Type	Description
pageIndex	int	The zero-based page index.

options	PageTextAreaOptions	The options for text area extraction.

Returns: java.lang.Iterable<com.groupdocs.parser.data.PageTextArea> - A collection of PageTextArea objects; null if text areas extraction isn’t supported.

getImages()

public Iterable<PageImageArea> getImages()

Extracts images from the document.

Learn more:

The following example shows how to extract all images from the whole document:

// Create an instance of Parser class
 try (Parser parser = new Parser(Constants.SampleImagesPdf)) {
     // Extract images
     Iterable images = parser.getImages();
     // Check if images extraction is supported
     if (images == null) {
         System.out.println("Images extraction isn't supported");
         return;
     }
     // Iterate over images
     for (PageImageArea image : images) {
         // Print a page index, rectangle and image type:
         System.out.println(String.format("Page: %d, R: %s, Type: %s", image.getPage().getIndex(), image.getRectangle(), image.getFileType()));
     }
 }

Returns: java.lang.Iterable<com.groupdocs.parser.data.PageImageArea> - A collection of PageImageArea objects; null if images extraction isn’t supported.

getImages(PageAreaOptions options)

public Iterable<PageImageArea> getImages(PageAreaOptions options)

Extracts images from the document using customization options (to set the rectangular area that contains images).

Learn more:

The following example shows how to extract only images from the upper-left courner:

// Create an instance of Parser class
 try (Parser parser = new Parser(Constants.SampleImagesPdf)) {
     // Create the options which are used for images extraction
     PageAreaOptions options = new PageAreaOptions(new Rectangle(new Point(340, 150), new Size(300, 100)));
     // Extract images from the upper-left corner of a page:
     Iterable images = parser.getImages(options);
     // Check if images extraction is supported
     if (images == null) {
         System.out.println("Page images extraction isn't supported");
         return;
     }
     // Iterate over images
     for (PageImageArea image : images) {
         // Print a page index, rectangle and image type:
         System.out.println(String.format("Page: %d, R: %s, Type: %s", image.getPage().getIndex(), image.getRectangle(), image.getFileType()));
     }
 }

Parameters:

Parameter	Type	Description
options	PageAreaOptions	The options for images extraction.

Returns: java.lang.Iterable<com.groupdocs.parser.data.PageImageArea> - A collection of PageImageArea objects; null if images extraction isn’t supported.

getImages(int pageIndex)

public Iterable<PageImageArea> getImages(int pageIndex)

Extracts images from the document page.

Learn more:

To extract images from a document page the following method is used:

// Create an instance of Parser class
 try (Parser parser = new Parser(Constants.SampleImagesPdf)) {
     // Check if the document supports images extraction
     if (!parser.getFeatures().isImages()) {
         System.out.println("Document isn't supports images extraction.");
         return;
     }
     // Get the document info
     IDocumentInfo documentInfo = parser.getDocumentInfo();
     // Check if the document has pages
     if (documentInfo.getPageCount() == 0) {
         System.out.println("Document hasn't pages.");
         return;
     }
     // Iterate over pages
     for (int pageIndex = 0; pageIndex < documentInfo.getPageCount(); pageIndex++) {
         // Print a page number
         System.out.println(String.format("Page %d/%d", pageIndex + 1, documentInfo.getPageCount()));
         // Iterate over images
         // We ignore null-checking as we have checked images extraction feature support earlier
         for (PageImageArea image : parser.getImages(pageIndex)) {
             // Print a rectangle and image type
             System.out.println(String.format("R: %s, Text: %s", image.getRectangle(), image.getFileType()));
         }
     }
 }

Parameters:

Parameter	Type	Description
pageIndex	int	The zero-based page index.

Returns: java.lang.Iterable<com.groupdocs.parser.data.PageImageArea> - A collection of PageImageArea objects; null if images extraction isn’t supported.

getImages(int pageIndex, PageAreaOptions options)

public Iterable<PageImageArea> getImages(int pageIndex, PageAreaOptions options)

Extracts images from the document page using customization options (to set the rectangular area that contains images).

Learn more:

Parameters:

Parameter	Type	Description
pageIndex	int	The zero-based page index.

options	PageAreaOptions	The options for images extraction.

Returns: java.lang.Iterable<com.groupdocs.parser.data.PageImageArea> - A collection of PageImageArea objects; null if images extraction isn’t supported.

getHyperlinks()

public Iterable<PageHyperlinkArea> getHyperlinks()

Extracts hyperlinks from the document.

The following example shows how to extract all hyperlinks from the whole document:

// Create an instance of Parser class
 try (Parser parser = new Parser(filePath)) {
     // Check if the document supports hyperlink extraction
     if (!parser.getFeatures().isHyperlinks()) {
         System.out.println("Document isn't supports hyperlink extraction.");
         return;
     }
     // Extract hyperlinks from the document
     Iterable hyperlinks = parser.getHyperlinks();
     // Iterate over hyperlinks
     for (PageHyperlinkArea h : hyperlinks) {
         // Print the hyperlink text
         System.out.println(h.getText());
         // Print the hyperlink URL
         System.out.println(h.getUrl());
         System.out.println();
     }
 }

Returns: java.lang.Iterable<com.groupdocs.parser.data.PageHyperlinkArea> - A collection of PageHyperlinkArea objects; null if hyperlinks extraction isn’t supported.

getHyperlinks(int pageIndex)

public Iterable<PageHyperlinkArea> getHyperlinks(int pageIndex)

Extracts hyperlinks from the document page.

The following example shows how to extract hyperlinks from the document page:

// Create an instance of Parser class
 try (Parser parser = new Parser(filePath)) {
     // Check if the document supports hyperlink extraction
     if (!parser.getFeatures().isHyperlinks()) {
         System.out.println("Document isn't supports hyperlink extraction.");
         return;
     }
     // Get the document info
     IDocumentInfo documentInfo = parser.getDocumentInfo();
     // Check if the document has pages
     if (documentInfo.getPageCount() == 0) {
         System.out.println("Document hasn't pages.");
         return;
     }
     // Iterate over pages
     for (int pageIndex = 0; pageIndex < documentInfo.getPageCount(); pageIndex++) {
         // Print a page number
         System.out.println(String.format("Page %d/%d", pageIndex + 1, documentInfo.getPageCount()));
         // Extract hyperlinks from the document page
         Iterable hyperlinks = parser.getHyperlinks(pageIndex);
         // Iterate over hyperlinks
         for (PageHyperlinkArea h : hyperlinks) {
             // Print the hyperlink text
             System.out.println(h.getText());
             // Print the hyperlink URL
             System.out.println(h.getUrl());
             System.out.println();
         }
     }
 }

Parameters:

Parameter	Type	Description
pageIndex	int	The zero-based page index.

Returns: java.lang.Iterable<com.groupdocs.parser.data.PageHyperlinkArea> - A collection of PageHyperlinkArea objects; null if hyperlinks extraction isn’t supported.

getHyperlinks(PageAreaOptions options)

public Iterable<PageHyperlinkArea> getHyperlinks(PageAreaOptions options)

Extracts hyperlinks from the document using customization options (to set the rectangular area that contains hyperlinks).

The following example shows how to extract hyperlinks from the document page area:

// Create an instance of Parser class
 try (Parser parser = new Parser(Constants.HyperlinksPdf)) {
     // Check if the document supports hyperlink extraction
     if (!parser.getFeatures().isHyperlinks()) {
         System.out.println("Document isn't supports hyperlink extraction.");
         return;
     }
     // Create the options which are used for hyperlink extraction
     PageAreaOptions options = new PageAreaOptions(new Rectangle(new Point(380, 90), new Size(150, 50)));
     // Extract hyperlinks from the document page area
     Iterable hyperlinks = parser.getHyperlinks(options);
     // Iterate over hyperlinks
     for (PageHyperlinkArea h : hyperlinks) {
         // Print the hyperlink text
         System.out.println(h.getText());
         // Print the hyperlink URL
         System.out.println(h.getUrl());
         System.out.println();
     }
 }

Parameters:

Parameter	Type	Description
options	PageAreaOptions	The options for hyperlinks extraction.

Returns: java.lang.Iterable<com.groupdocs.parser.data.PageHyperlinkArea> - A collection of PageHyperlinkArea objects; null if hyperlinks extraction isn’t supported.

getHyperlinks(int pageIndex, PageAreaOptions options)

public Iterable<PageHyperlinkArea> getHyperlinks(int pageIndex, PageAreaOptions options)

Extracts hyperlinks from the document page using customization options (to set the rectangular area that contains hyperlinks).

The following example shows how to extract hyperlinks from the document page area using customization options:

// Create an instance of Parser class
 try (Parser parser = new Parser(filePath)) {
     // Check if the document supports hyperlink extraction
     if (!parser.getFeatures().isHyperlinks()) {
         System.out.println("Document isn't supports hyperlink extraction.");
         return;
     }
     // Get the document info
     IDocumentInfo documentInfo = parser.getDocumentInfo();
     // Check if the document has pages
     if (documentInfo.getPageCount() == 0) {
         System.out.println("Document hasn't pages.");
         return;
     }
     // Create the options which are used for hyperlink extraction
     PageAreaOptions options = new PageAreaOptions(new Rectangle(new Point(380, 90), new Size(150, 50)));
     // Iterate over pages
     for (int pageIndex = 0; pageIndex < documentInfo.getPageCount(); pageIndex++) {
         // Print a page number
         System.out.println(String.format("Page %d/%d", pageIndex + 1, documentInfo.getPageCount()));
         // Extract hyperlinks from the document page
         Iterable hyperlinks = parser.getHyperlinks(pageIndex, options);
         // Iterate over hyperlinks
         for (PageHyperlinkArea h : hyperlinks) {
             // Print the hyperlink text
             System.out.println(h.getText());
             // Print the hyperlink URL
             System.out.println(h.getUrl());
             System.out.println();
         }
     }
 }

Parameters:

Parameter	Type	Description
pageIndex	int	The zero-based page index.

options	PageAreaOptions	The options for hyperlinks extraction.

Returns: java.lang.Iterable<com.groupdocs.parser.data.PageHyperlinkArea> - A collection of PageHyperlinkArea objects; null if hyperlinks extraction isn’t supported.

getBarcodes()

public Iterable<PageBarcodeArea> getBarcodes()

Extracts barcodes from the document.

Returns: java.lang.Iterable<com.groupdocs.parser.data.PageBarcodeArea> - A collection of PageBarcodeArea objects; null if barcodes extraction isn’t supported.

getBarcodes(int pageIndex)

public Iterable<PageBarcodeArea> getBarcodes(int pageIndex)

Extracts barcodes from the document page.

Parameters:

Parameter	Type	Description
pageIndex	int	The zero-based page index.

Returns: java.lang.Iterable<com.groupdocs.parser.data.PageBarcodeArea> - A collection of PageBarcodeArea objects; null if barcodes extraction isn’t supported.

getBarcodes(BarcodeOptions options)

public Iterable<PageBarcodeArea> getBarcodes(BarcodeOptions options)

Extracts barcodes from the document using customization options (to set the rectangular area that contains barcodes).

Parameters:

Parameter	Type	Description
options	BarcodeOptions	The options for barcodes extraction.

Returns: java.lang.Iterable<com.groupdocs.parser.data.PageBarcodeArea> - A collection of PageBarcodeArea objects; null if barcodes extraction isn’t supported.\

getBarcodes(int pageIndex, BarcodeOptions options)

public Iterable<PageBarcodeArea> getBarcodes(int pageIndex, BarcodeOptions options)

Extracts barcodes from the document page using customization options (to set the rectangular area that contains barcodes).

Parameters:

Parameter	Type	Description
pageIndex	int	The zero-based page index.

options	BarcodeOptions	The options for barcodes extraction.

Returns: java.lang.Iterable<com.groupdocs.parser.data.PageBarcodeArea> - A collection of PageBarcodeArea objects; null if barcodes extraction isn’t supported.

getTables(PageTableAreaOptions options)

public Iterable<PageTableArea> getTables(PageTableAreaOptions options)

Extracts tables from the document.

The following example shows how to extract tables from the whole document:

// Create an instance of Parser class
 try (Parser parser = new Parser(filePath)) {
     // Check if the document supports table extraction
     if (!parser.getFeatures().isTables()) {
         System.out.println("Document isn't supports tables extraction.");
         return;
     }
     // Create the layout of tables
     TemplateTableLayout layout = new TemplateTableLayout(
             java.util.Arrays.asList(new Double[]{50.0, 95.0, 275.0, 415.0, 485.0, 545.0}),
             java.util.Arrays.asList(new Double[]{325.0, 340.0, 365.0, 395.0}));
     // Create the options for table extraction
     PageTableAreaOptions options = new PageTableAreaOptions(layout);
     // Extract tables from the document
     Iterable tables = parser.getTables(options);
     // Iterate over tables
     for (PageTableArea t : tables) {
         // Iterate over rows
         for (int row = 0; row < t.getRowCount(); row++) {
             // Iterate over columns
             for (int column = 0; column < t.getColumnCount(); column++) {
                 // Get the table cell
                 PageTableAreaCell cell = t.getCell(row, column);
                 if (cell != null) {
                     // Print the table cell text
                     System.out.print(cell.getText());
                     System.out.print(" | ");
                 }
             }
             System.out.println();
         }
         System.out.println();
     }
 }

Parameters:

Parameter	Type	Description
options	PageTableAreaOptions	The options for tables extraction.

Returns: java.lang.Iterable<com.groupdocs.parser.data.PageTableArea> - A collection of PageTableArea objects; null if tables extraction isn’t supported.

getTables()

public Iterable<PageTableArea> getTables()

Extracts tables from the document, detecting them automatically.

Returns: java.lang.Iterable<com.groupdocs.parser.data.PageTableArea> - A collection of PageTableArea objects; null if tables extraction isn’t supported.

getTables(int pageIndex, PageTableAreaOptions options)

public Iterable<PageTableArea> getTables(int pageIndex, PageTableAreaOptions options)

Extracts tables from the document page.

The following example shows how to extract tables from the document page:

// Create an instance of Parser class
 try (Parser parser = new Parser(Constants.SampleInvoicePagesPdf)) {
     // Check if the document supports table extraction
     if (!parser.getFeatures().isTables()) {
         System.out.println("Document isn't supports tables extraction.");
         return;
     }
     // Create the layout of tables
     TemplateTableLayout layout = new TemplateTableLayout(
             java.util.Arrays.asList(new Double[]{50.0, 95.0, 275.0, 415.0, 485.0, 545.0}),
             java.util.Arrays.asList(new Double[]{325.0, 340.0, 365.0, 395.0}));
     // Create the options for table extraction
     PageTableAreaOptions options = new PageTableAreaOptions(layout);
     // Get the document info
     IDocumentInfo documentInfo = parser.getDocumentInfo();
     // Check if the document has pages
     if (documentInfo.getPageCount() == 0) {
         System.out.println("Document hasn't pages.");
         return;
     }
     // Iterate over pages
     for (int pageIndex = 0; pageIndex < documentInfo.getPageCount(); pageIndex++) {
         // Print a page number
         System.out.println(String.format("Page %d/%d", pageIndex + 1, documentInfo.getPageCount()));
         // Extract tables from the document page
         Iterable tables = parser.getTables(pageIndex, options);
         // Iterate over tables
         for (PageTableArea t : tables) {
             // Iterate over rows
             for (int row = 0; row < t.getRowCount(); row++) {
                 // Iterate over columns
                 for (int column = 0; column < t.getColumnCount(); column++) {
                     // Get the table cell
                     PageTableAreaCell cell = t.getCell(row, column);
                     if (cell != null) {
                         // Print the table cell text
                         System.out.print(cell.getText());
                         System.out.print(" | ");
                     }
                 }
                 System.out.println();
             }
             System.out.println();
         }
     }
 }

Parameters:

Parameter	Type	Description
pageIndex	int	The zero-based page index.

options	PageTableAreaOptions	The options for tables extraction.

Returns: java.lang.Iterable<com.groupdocs.parser.data.PageTableArea> - A collection of PageTableArea objects; null if tables extraction isn’t supported.

getTables(int pageIndex)

public Iterable<PageTableArea> getTables(int pageIndex)

Extracts tables from the document page, detecting them automatically.

Parameters:

Parameter	Type	Description
pageIndex	int	The zero-based page index.

Returns: java.lang.Iterable<com.groupdocs.parser.data.PageTableArea> - A collection of PageTableArea objects; null if tables extraction isn’t supported.

generateAdjustmentFields(GenerateTemplateOptions options)

public Iterable<TemplateItem> generateAdjustmentFields(GenerateTemplateOptions options)

Generates a collection of adjustment TemplateItems for the document.

Java mirror of the C# Parser.GenerateAdjustmentFields(GenerateTemplateOptions) API (commits d4e691b / ae9e98c). Useful as a starting-point template that the user can then trim or rename.

Algorithm

Pick the page from options.getPageIndex() (default 0 ).
If the document supports text-area extraction, call #getTextAreas(int, com.groupdocs.parser.options.PageTextAreaOptions).getTextAreas(int, com.groupdocs.parser.options.PageTextAreaOptions) and emit one TemplateField per non-empty text area, named {prefix}_{n} (default prefix “Field” ). Each field gets a TemplateFixedPosition at the text area’s rectangle, with the page’s width recorded as pageWidth so subsequent template scaling works.
If the document does not support text areas, return null .

OCR caveat

The C# implementation runs OCR over the page preview to derive fields when the document is image-only. Java’s OcrConnectorBase is user-supplied, so OCR-driven generation is only meaningful if a connector is configured AND the connector’s recognizeTextAreas(…) returns proper rectangles. When options.getOcrOptions() is set and a connector is available, the OCR pipeline is preferred.

Parameters:

Parameter	Type	Description
options	GenerateTemplateOptions	The generation options. May be null (defaults applied).

Returns: java.lang.Iterable<com.groupdocs.parser.templates.TemplateItem> - An iterable of generated TemplateItems, or null if not supported by this document format.

getWorksheetInfo()

public Iterable<WorksheetInfo> getWorksheetInfo()

Extracts the info about all worksheets in the spreadsheet.

Returns: java.lang.Iterable<com.groupdocs.parser.data.WorksheetInfo> - A list of WorksheetInfo instances that contains info about all worksheets in the spreadsheet.

getWorksheetInfo(int worksheetIndex)

public WorksheetInfo getWorksheetInfo(int worksheetIndex)

Extracts the info about the worksheet.

Parameters:

Parameter	Type	Description
worksheetIndex	int	The zero-based index of the worksheet.

Returns: WorksheetInfo - An instance of WorksheetInfo that contains the info about the worksheet.

getWorksheetCells(int worksheetIndex)

public Iterable<WorksheetCell> getWorksheetCells(int worksheetIndex)

Extracts worksheet cells.

Parameters:

Parameter	Type	Description
worksheetIndex	int	The zero-based index of the worksheet.

Returns: java.lang.Iterable<com.groupdocs.parser.data.WorksheetCell> - A collection of WorksheetCell instances that contains the cell data.

getWorksheetCells(int worksheetIndex, WorksheetOptions options)

public Iterable<WorksheetCell> getWorksheetCells(int worksheetIndex, WorksheetOptions options)

Extracts worksheet cells using customization options.

Parameters:

Parameter	Type	Description
worksheetIndex	int	The zero-based index of the worksheet.

options	WorksheetOptions	The worksheet extraction options.

Returns: java.lang.Iterable<com.groupdocs.parser.data.WorksheetCell> - A collection of WorksheetCell instances that contains the cell data.

parseByTemplate(Template template)

public DocumentData parseByTemplate(Template template)

Parses the document by the user-generated template.

Learn more:

Parameters:

Parameter	Type	Description
template	Template	The user-generated template.

Returns: DocumentData - An instance of DocumentData class that contains the extracted data; null if parsing by template isn’t supported.

parseByTemplate(Template template, ParseByTemplateOptions options)

public DocumentData parseByTemplate(Template template, ParseByTemplateOptions options)

Parses the document by the user-generated template with the supplied options.

If options.getPageIndex() is set, only that page is parsed; otherwise all pages are parsed.

OCR support depends on the OcrConnectorBase configured in ParserSettings.

Parameters:

Parameter	Type	Description
template	Template	The user-generated template.

options	ParseByTemplateOptions	The parse-by-template options.

Returns: DocumentData - An instance of DocumentData class that contains the extracted data; null if parsing by template isn’t supported.

parseByTemplate(TemplateCollection templates, ParseByTemplateOptions options)

public DocumentData parseByTemplate(TemplateCollection templates, ParseByTemplateOptions options)

Parses the document by automatically selecting the best-matching template from a collection.

Mirror of the C# Parser.ParseByTemplate(TemplateCollection, ParseByTemplateOptions) introduced in commit 71f3f21.

Selection heuristic (Java port): each candidate template is applied via #parseByTemplate(Template, com.groupdocs.parser.options.ParseByTemplateOptions).parseByTemplate(Template, com.groupdocs.parser.options.ParseByTemplateOptions), and the template that yields the highest number of populated fields (non-null page areas with non-empty text) wins. The C# implementation uses hidden-marker matching against an OCR pass; that path is not yet available in Java because hidden-field markers and the OCR-driven TemplatePageOcrParser aren’t ported, so this populated-field count is used as a transparent fallback. Callers that need C#-identical behavior should plug in a custom selector once the OCR engine is wired.

Parameters:

Parameter	Type	Description
templates	TemplateCollection	A collection of candidate templates. The parser will pick the best-matching one.

options	ParseByTemplateOptions	Parse options. May be null .

Returns: DocumentData - The extracted data with DocumentData.getTemplate() set to the selected template, or null if the document does not support parsing-by-template or the collection is empty / produced no result.

parseForm()

public DocumentData parseForm()

Parses the document form.

Learn more:

The following example shows how to parse a form of the document:

// Create an instance of Parser class
 try (Parser parser = new Parser(Constants.SampleFormsPdf)) {
     // Extract data from PDF document
     DocumentData data = parser.parseForm();
     // Check if form extraction is supported
     if (data == null) {
         System.out.println("Form extraction isn't supported.");
         return;
     }
     // Iterate over extracted data
     for (int i = 0; i < data.getCount(); i++) {
         System.out.print(data.get(i).getName() + ": ");
         PageTextArea area = data.get(i).getPageArea() instanceof PageTextArea
                 ? (PageTextArea) data.get(i).getPageArea()
                 : null;
         System.out.println(area == null ? "Not a template field" : area.getText());
     }
 }

Returns: DocumentData - An instance of DocumentData class that contains the extracted data; null if parsing by template isn’t supported.

getStructure()

public Document getStructure()

Extracts a structured text from the document.

Learn more:

Extract text structure

Returns: org.w3c.dom.Document - An instance of org.w3c.dom.Document class with XML text structure; null if text structure extraction isn’t supported.

close()

public void close()

Closes this resource, relinquishing any underlying resources.

We value your opinion. Your feedback will help us improve our documentation.

Parser Leave feedback

On this page

Constructors

Methods

Parser(URL url)

Parser(URL url, LoadOptions loadOptions)

Parser(URL url, ParserSettings parserSettings)

Parser(URL url, LoadOptions loadOptions, ParserSettings parserSettings)

Parser(Connection connection)

Parser(Connection connection, ParserSettings parserSettings)

Parser(EmailConnection connection)

Parser(EmailConnection connection, ParserSettings parserSettings)

Parser(String filePath)

Parser(String filePath, LoadOptions loadOptions)

Parser(String filePath, ParserSettings parserSettings)

Parser(String filePath, LoadOptions loadOptions, ParserSettings parserSettings)

Parser(InputStream document)

Parser(InputStream document, LoadOptions loadOptions)

Parser(InputStream document, ParserSettings parserSettings)

Parser(InputStream document, LoadOptions loadOptions, ParserSettings parserSettings)

getFileInfo(String filePath)

getFileInfo(InputStream document)

getFeatures()

getPagePreview(int pageIndex)

getPagePreview(int pageIndex, PagePreviewOptions options)

generatePreview(PreviewOptions previewOptions)

getDocumentInfo()

getText()

getText(TextOptions options)

getText(int pageIndex)

getText(int pageIndex, TextOptions options)

getFormattedText(FormattedTextOptions options)

getFormattedText(int pageIndex, FormattedTextOptions options)

search(String keyword)

search(String keyword, SearchOptions options)

getHighlight(int position, boolean isDirect, HighlightOptions options)

getToc()

getMetadata()

getContainer()

getTextAreas()

getTextAreas(PageTextAreaOptions options)

getTextAreas(int pageIndex)

getTextAreas(int pageIndex, PageTextAreaOptions options)

getImages()

getImages(PageAreaOptions options)

getImages(int pageIndex)

getImages(int pageIndex, PageAreaOptions options)

getHyperlinks()

getHyperlinks(int pageIndex)

getHyperlinks(PageAreaOptions options)

getHyperlinks(int pageIndex, PageAreaOptions options)

getBarcodes()

getBarcodes(int pageIndex)

getBarcodes(BarcodeOptions options)

getBarcodes(int pageIndex, BarcodeOptions options)

getTables(PageTableAreaOptions options)

getTables()

getTables(int pageIndex, PageTableAreaOptions options)

getTables(int pageIndex)

generateAdjustmentFields(GenerateTemplateOptions options)

Algorithm

OCR caveat

getWorksheetInfo()

getWorksheetInfo(int worksheetIndex)

getWorksheetCells(int worksheetIndex)

getWorksheetCells(int worksheetIndex, WorksheetOptions options)

parseByTemplate(Template template)

parseByTemplate(Template template, ParseByTemplateOptions options)

parseByTemplate(TemplateCollection templates, ParseByTemplateOptions options)

parseForm()

getStructure()

close()

Was this page helpful?

Any additional feedback you'd like to share with us?

Please tell us how we can improve this page.

Thank you for your feedback!

On this page