The following example shows how to extract data from Sqlite database:
// Create DbConnection object
java.sql.Connection connection = java.sql.DriverManager.getConnection(String.format("jdbc:sqlite:%s", Constants.SampleDatabase));
// Create an instance of Parser class to extract tables from the database
try (Parser parser = new Parser(connection)) {
// Check if text extraction is supported
if (!parser.getFeatures().isText()) {
System.out.println("Text extraction isn't supported.");
return;
}
// Check if toc extraction is supported
if (!parser.getFeatures().isToc()) {
System.out.println("Toc extraction isn't supported.");
return;
}
// Get a list of tables
Iterable toc = parser.getToc();
// Iterate over tables
for(TocItem i : toc)
{
// Print the table name
System.out.println(i.extractText());
// Extract a table content as a text
try(TextReader reader = parser.getText(i.getPageIndex().intValue()))
{
System.out.println(reader.readToEnd());
}
}
}
The following example shows how to extract data from Sqlite database:
// Create DbConnection object
java.sql.Connection connection = java.sql.DriverManager.getConnection(String.format("jdbc:sqlite:%s", Constants.SampleDatabase));
// Create an instance of Parser class to extract tables from the database
try (Parser parser = new Parser(connection)) {
// Check if text extraction is supported
if (!parser.getFeatures().isText()) {
System.out.println("Text extraction isn't supported.");
return;
}
// Check if toc extraction is supported
if (!parser.getFeatures().isToc()) {
System.out.println("Toc extraction isn't supported.");
return;
}
// Get a list of tables
Iterable toc = parser.getToc();
// Iterate over tables
for(TocItem i : toc)
{
// Print the table name
System.out.println(i.extractText());
// Extract a table content as a text
try(TextReader reader = parser.getText(i.getPageIndex().intValue()))
{
System.out.println(reader.readToEnd());
}
}
}
The following example shows how to extract emails from Exchange Server:
// Create the connection object for Exchange Web Services protocol
EmailConnection connection = new EmailEwsConnection(
"https://outlook.office365.com/ews/exchange.asmx",
"email@server",
"password");
// Create an instance of Parser class to extract emails from the remote server
try (Parser parser = new Parser(connection)) {
// Check if container extraction is supported
if (!parser.getFeatures().isContainer()) {
System.out.println("Container extraction isn't supported.");
return;
}
// Extract email messages from the server
Iterable emails = parser.getContainer();
// Iterate over attachments
for (ContainerItem item : emails) {
// Create an instance of Parser class for email message
try (Parser emailParser = item.openParser()) {
// Extract the email text
try (TextReader reader = emailParser.getText()) {
// Print the email text
System.out.println(reader == null ? "Text extraction isn't supported." : reader.readToEnd());
}
}
}
}
The following example shows how to extract emails from Exchange Server:
// Create the connection object for Exchange Web Services protocol
EmailConnection connection = new EmailEwsConnection(
"https://outlook.office365.com/ews/exchange.asmx",
"email@server",
"password");
// Create an instance of Parser class to extract emails from the remote server
try (Parser parser = new Parser(connection)) {
// Check if container extraction is supported
if (!parser.getFeatures().isContainer()) {
System.out.println("Container extraction isn't supported.");
return;
}
// Extract email messages from the server
Iterable emails = parser.getContainer();
// Iterate over attachments
for (ContainerItem item : emails) {
// Create an instance of Parser class for email message
try (Parser emailParser = item.openParser()) {
// Extract the email text
try (TextReader reader = emailParser.getText()) {
// Print the email text
System.out.println(reader == null ? "Text extraction isn't supported." : reader.readToEnd());
}
}
}
}
The following example shows how to load the document from the local disk:
// Set the filePath
String filePath = Constants.SamplePdf;
// Create an instance of Parser class with the filePath
try (Parser parser = new Parser(filePath)) {
// Extract a text into the reader
try (TextReader reader = parser.getText()) {
// Print a text from the document
// If text extraction isn't supported, a reader is null
System.out.println(reader == null ? "Text extraction isn't supported" : reader.readToEnd());
}
}
Parameters:
Parameter
Type
Description
filePath
java.lang.String
The path to the file.
Parser(String filePath, LoadOptions loadOptions)
public Parser(String filePath, LoadOptions loadOptions)
The document password is passed by LoadOptions class:
try {
String password = "123456";
// Create an instance of Parser class with the password:
try (Parser parser = new Parser(Constants.SamplePassword, new LoadOptions(password))) {
// Check if text extraction is supported
if (!parser.getFeatures().isText()) {
System.out.println("Text extraction isn't supported.");
return;
}
// Print the document text
try (TextReader reader = parser.getText()) {
System.out.println(reader.readToEnd());
}
}
} catch (InvalidPasswordException ex) {
// Print the message if the password is incorrect or empty
System.out.println("Invalid password");
}
The following example shows how to load the document from the stream:
// Create the stream
try (InputStream stream = new FileInputStream(Constants.SamplePdf)) {
// Create an instance of Parser class with the stream
try (Parser parser = new Parser(stream)) {
// Extract a text into the reader
try (TextReader reader = parser.getText()) {
// Print a text from the document
// If text extraction isn't supported, a reader is null
System.out.println(reader == null ? "Text extraction isn't supported" : reader.readToEnd());
}
}
}
In some cases it’s necessary to define FileFormat. Both for special cases (databases, email server)
and for detecting file types by the content:
// Create an instance of Parser class for markdown document
try (Parser parser = new Parser(stream, new LoadOptions(FileFormat.Markup))) {
// Check if text extraction is supported
if (!parser.getFeatures().isText()) {
System.out.println("Text extraction isn't supported.");
return;
}
try (TextReader reader = parser.getText()) {
// Print the document text
// Markdown is detected; text without special symbols is printed
System.out.println(reader.readToEnd());
}
}
The parser settings which are used to customize data extraction.
getFileInfo(String filePath)
public static FileInfo getFileInfo(String filePath)
Returns the general information about a file.
The following code shows how to check whether a file is password-protected:
// Get a file info
FileInfo info = Parser.getFileInfo(filePath);
// Check IsEncrypted property
System.out.println(info.isEncrypted() ? "Password is required" : "");
public static FileInfo getFileInfo(InputStream document)
Returns the general information about a file.
The following code shows how to check whether a file is password-protected:
// Get a file info
FileInfo info = Parser.getFileInfo(filePath);
// Check IsEncrypted property
System.out.println(info.isEncrypted() ? "Password is required" : "");
If the feature isn’t supported, the method returns
null
instead of the value. Some operations may consume
significant time. So it’s not optimal to call the method to just check the support for the feature.
For this purpose Features property is used:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleZip)) {
// Check if text extraction is supported for the document
if (!parser.getFeatures().isText()) {
System.out.println("Text extraction isn't supported");
return;
}
// Extract a text from the document
try (TextReader reader = parser.getText()) {
System.out.println(reader.readToEnd());
}
}
Returns:Features - An instance of Features class that represents the supported features.
getPagePreview(int pageIndex)
public OutputStream getPagePreview(int pageIndex)
Generates a document page preview.
Parameters:
Parameter
Type
Description
pageIndex
int
The zero-based page index.
Returns:
java.io.OutputStream - An instance of java.io.OutputStream containing an image of the document page; null if the page preview generation isn’t supported.
Returns:
java.io.OutputStream - An instance of java.io.OutputStream containing an image of the document page; null if the page preview generation isn’t supported.
generatePreview(PreviewOptions previewOptions)
public void generatePreview(PreviewOptions previewOptions)
The following example shows how to get document info:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleDocx)) {
// Get the document info
IDocumentInfo info = parser.getDocumentInfo();
// Print document information
System.out.println(String.format("FileType: %s", info.getFileType()));
System.out.println(String.format("PageCount: %d", info.getPageCount()));
System.out.println(String.format("Size: %d", info.getSize()));
}
The following example shows how to extract a text from a document:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SamplePdf)) {
// Extract a text into the reader
try (TextReader reader = parser.getText()) {
// Print a text from the document
// If text extraction isn't supported, a reader is null
System.out.println(reader == null ? "Text extraction isn't supported" : reader.readToEnd());
}
}
Returns:TextReader - An instance of TextReader class with the extracted text; null if text extraction isn’t supported.
getText(TextOptions options)
public TextReader getText(TextOptions options)
Extracts a text page from the document using text options (to enable raw fast text extraction mode).
The following example shows how to extract a raw text from a document:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SamplePdf)) {
// Extract a raw text into the reader
try (TextReader reader = parser.getText(new TextOptions(true))) {
// Print a text from the document
// If text extraction isn't supported, a reader is null
System.out.println(reader == null ? "Text extraction isn't supported" : reader.readToEnd());
}
}
The following example shows how to extract a text from the document page:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SamplePdf)) {
// Check if the document supports text extraction
if (!parser.getFeatures().isText()) {
System.out.println("Document isn't supports text extraction.");
return;
}
// Get the document info
IDocumentInfo documentInfo = parser.getDocumentInfo();
// Check if the document has pages
if (documentInfo.getPageCount() == 0) {
System.out.println("Document hasn't pages.");
return;
}
// Iterate over pages
for (int p = 0; p < documentInfo.getPageCount(); p++) {
// Print a page number
System.out.println(String.format("Page %d/%d", p + 1, documentInfo.getPageCount()));
// Extract a text into the reader
try (TextReader reader = parser.getText(p)) {
// Print a text from the document
// We ignore null-checking as we have checked text extraction feature support earlier
System.out.println(reader.readToEnd());
}
}
}
Parameters:
Parameter
Type
Description
pageIndex
int
The zero-based page index.
Returns:TextReader - An instance of TextReader class with the extracted text; null if text page extraction isn’t supported.
getText(int pageIndex, TextOptions options)
public TextReader getText(int pageIndex, TextOptions options)
Extracts a text from the document page using text options (to enable raw fast text extraction mode).
The following example shows how to extract a raw text from the document page:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SamplePdf)) {
// Check if the document supports text extraction
if (!parser.getFeatures().isText()) {
System.out.println("Document isn't supports text extraction.");
return;
}
// Get the document info
DocumentInfo documentInfo = parser.getDocumentInfo() instanceof DocumentInfo
? (DocumentInfo) parser.getDocumentInfo()
: null;
// Check if the document has pages
if (documentInfo == null || documentInfo.getRawPageCount() == 0) {
System.out.println("Document hasn't pages.");
return;
}
// Iterate over pages
for (int p = 0; p < documentInfo.getRawPageCount(); p++) {
// Print a page number
System.out.println(String.format("Page %d/%d", p + 1, documentInfo.getPageCount()));
// Extract a text into the reader
try (TextReader reader = parser.getText(p, new TextOptions(true))) {
// Print a text from the document
// We ignore null-checking as we have checked text extraction feature support earlier
System.out.println(reader.readToEnd());
}
}
}
The following example shows how to extract a document text as HTML text:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleDocx)) {
// Extract a formatted text into the reader
try (TextReader reader = parser.getFormattedText(new FormattedTextOptions(FormattedTextMode.Html))) {
// Print a formatted text from the document
// If formatted text extraction isn't supported, a reader is null
System.out.println(reader == null ? "Formatted text extraction isn't suppported" : reader.readToEnd());
}
}
The following example shows how to extract a document page text as Markdown text:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleDocx)) {
// Check if the document supports formatted text extraction
if (!parser.getFeatures().isFormattedText()) {
System.out.println("Document isn't supports formatted text extraction.");
return;
}
// Get the document info
IDocumentInfo documentInfo = parser.getDocumentInfo();
// Check if the document has pages
if (documentInfo.getPageCount() == 0) {
System.out.println("Document hasn't pages.");
return;
}
// Iterate over pages
for (int p = 0; p < documentInfo.getPageCount(); p++) {
// Print a page number
System.out.println(String.format("Page %d/%d", p + 1, documentInfo.getPageCount()));
// Extract a formatted text into the reader
try (TextReader reader = parser.getFormattedText(p, new FormattedTextOptions(FormattedTextMode.Markdown))) {
// Print a formatted text from the document
// We ignore null-checking as we have checked formatted text extraction feature support earlier
System.out.println(reader.readToEnd());
}
}
}
The following example shows how to find a keyword in a document:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SamplePdf)) {
// Search a keyword:
Iterable sr = parser.search("lorem");
// Check if search is supported
if (sr == null) {
System.out.println("Search isn't supported");
return;
}
// Iterate over search results
for (SearchResult s : sr) {
// Print an index and found text:
System.out.println(String.format("At %d: %s", s.getPosition(), s.getText()));
}
}
Parameters:
Parameter
Type
Description
keyword
java.lang.String
The keyword to search.
Returns:
java.lang.Iterable<com.groupdocs.parser.data.SearchResult> - A collection of SearchResult objects; null if search isn’t supported.
search(String keyword, SearchOptions options)
public Iterable<SearchResult> search(String keyword, SearchOptions options)
Searches a keyword in the document using search options (regular expression, match case, etc.).
The following example shows how to search with a regular expression in a document:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SamplePdf)) {
// Search with a regular expression with case matching
Iterable sr = parser.search("[0-9]+", new SearchOptions(true, false, true));
// Check if search is supported
if (sr == null) {
System.out.println("Search isn't supported");
return;
}
// Iterate over search results
for (SearchResult s : sr) {
// Print an index and found text:
System.out.println(String.format("At %d: %s", s.getPosition(), s.getText()));
}
}
The following example shows how to search a text on pages:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SamplePdf)) {
// Search a keyword with page numbers
Iterable sr = parser.search("lorem", new SearchOptions(false, false, false, true));
// Check if search is supported
if (sr == null) {
System.out.println("Search isn't supported");
return;
}
// Iterate over search results
for (SearchResult s : sr) {
// Print an index, page number and found text:
System.out.println(String.format("At %d (%d): %s", s.getPosition(), s.getPageIndex(), s.getText()));
}
}
The following example shows how to extract table of contents from EPUB file:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleEpub)) {
// Check if text extraction is supported
if (!parser.getFeatures().isText()) {
System.out.println("Text extraction isn't supported.");
return;
}
// Check if toc extraction is supported
if (!parser.getFeatures().isToc()) {
System.out.println("Toc extraction isn't supported.");
return;
}
// Get table of contents
Iterable toc = parser.getToc();
// Iterate over items
for (TocItem i : toc) {
// Print the Toc text
System.out.println(i.getText());
// Check if page index has a value
if (i.getPageIndex() == null) {
continue;
}
// Extract a page text
try (TextReader reader = parser.getText(i.getPageIndex())) {
System.out.println(reader.readToEnd());
}
}
}
Returns:
java.lang.Iterable<com.groupdocs.parser.data.TocItem> - A collection of table of contents items; null if table of contents extraction isn’t supported.
The following example shows how to extract metadata from a document:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleDocx)) {
// Extract metadata from the document
Iterable metadata = parser.getMetadata();
// Check if metadata extraction is supported
if (metadata == null) {
System.out.println("Metatada extraction isn't supported");
}
// Iterate over metadata items
for (MetadataItem item : metadata) {
// Print an item name and value
System.out.println(String.format("%s: %s", item.getName(), item.getValue()));
}
}
Returns:
java.lang.Iterable<com.groupdocs.parser.data.MetadataItem> - A collection of metadata items; null if metadata extraction isn’t supported.
getContainer()
public Iterable<ContainerItem> getContainer()
Extracts a container object from the document to work with formats that contain attachments, ZIP archives etc.
The following example shows how to extract all text areas from the whole document:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleImagesPdf)) {
// Extract text areas
Iterable areas = parser.getTextAreas();
// Check if text areas extraction is supported
if (areas == null) {
System.out.println("Page text areas extraction isn't supported");
return;
}
// Iterate over page text areas
for (PageTextArea a : areas) {
// Print a page index, rectangle and text area value:
System.out.println(String.format("Page: %d, R: %s, Text: %s", a.getPage().getIndex(), a.getRectangle(), a.getText()));
}
}
Returns:
java.lang.Iterable<com.groupdocs.parser.data.PageTextArea> - A collection of PageTextArea objects; null if text areas extraction isn’t supported.
getTextAreas(PageTextAreaOptions options)
public Iterable<PageTextArea> getTextAreas(PageTextAreaOptions options)
Extracts text areas from the document using customization options (regular expression, match case, etc.).
The following example shows how to extract only text areas with digits from the upper-left courner:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleImagesPdf)) {
// Create the options which are used for text area extraction
PageTextAreaOptions options = new PageTextAreaOptions("\\s[a-z]{2}\\s", new Rectangle(new Point(0, 0), new Size(300, 100)));
// Extract text areas which contain only digits from the upper-left corner of a page:
Iterable areas = parser.getTextAreas(options);
// Check if text areas extraction is supported
if (areas == null) {
System.out.println("Page text areas extraction isn't supported");
return;
}
// Iterate over page text areas
for (PageTextArea a : areas) {
// Print a page index, rectangle and text area value:
System.out.println(String.format("Page: %d, R: %s, Text: %s", a.getPage().getIndex(), a.getRectangle(), a.getText()));
}
}
Returns:
java.lang.Iterable<com.groupdocs.parser.data.PageTextArea> - A collection of PageTextArea objects; null if text areas extraction isn’t supported.
getTextAreas(int pageIndex)
public Iterable<PageTextArea> getTextAreas(int pageIndex)
To extract text areas from a document page the following method is used:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleImagesPdf)) {
// Check if the document supports text areas extraction
if (!parser.getFeatures().isTextAreas()) {
System.out.println("Document isn't supports text areas extraction.");
return;
}
// Get the document info
IDocumentInfo documentInfo = parser.getDocumentInfo();
// Check if the document has pages
if (documentInfo.getPageCount() == 0) {
System.out.println("Document hasn't pages.");
return;
}
// Iterate over pages
for (int pageIndex = 0; pageIndex < documentInfo.getPageCount(); pageIndex++) {
// Print a page number
System.out.println(String.format("Page %d/%d", pageIndex + 1, documentInfo.getPageCount()));
// Iterate over page text areas
// We ignore null-checking as we have checked text areas extraction feature support earlier
for (PageTextArea a : parser.getTextAreas(pageIndex)) {
// Print a rectangle and text area value:
System.out.println(String.format("R: %s, Text: %s", a.getRectangle(), a.getText()));
}
}
}
Parameters:
Parameter
Type
Description
pageIndex
int
The zero-based page index.
Returns:
java.lang.Iterable<com.groupdocs.parser.data.PageTextArea> - A collection of PageTextArea objects; null if text areas extraction isn’t supported.
Returns:
java.lang.Iterable<com.groupdocs.parser.data.PageTextArea> - A collection of PageTextArea objects; null if text areas extraction isn’t supported.
The following example shows how to extract all images from the whole document:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleImagesPdf)) {
// Extract images
Iterable images = parser.getImages();
// Check if images extraction is supported
if (images == null) {
System.out.println("Images extraction isn't supported");
return;
}
// Iterate over images
for (PageImageArea image : images) {
// Print a page index, rectangle and image type:
System.out.println(String.format("Page: %d, R: %s, Type: %s", image.getPage().getIndex(), image.getRectangle(), image.getFileType()));
}
}
Returns:
java.lang.Iterable<com.groupdocs.parser.data.PageImageArea> - A collection of PageImageArea objects; null if images extraction isn’t supported.
getImages(PageAreaOptions options)
public Iterable<PageImageArea> getImages(PageAreaOptions options)
Extracts images from the document using customization options (to set the rectangular area that contains images).
The following example shows how to extract only images from the upper-left courner:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleImagesPdf)) {
// Create the options which are used for images extraction
PageAreaOptions options = new PageAreaOptions(new Rectangle(new Point(340, 150), new Size(300, 100)));
// Extract images from the upper-left corner of a page:
Iterable images = parser.getImages(options);
// Check if images extraction is supported
if (images == null) {
System.out.println("Page images extraction isn't supported");
return;
}
// Iterate over images
for (PageImageArea image : images) {
// Print a page index, rectangle and image type:
System.out.println(String.format("Page: %d, R: %s, Type: %s", image.getPage().getIndex(), image.getRectangle(), image.getFileType()));
}
}
Returns:
java.lang.Iterable<com.groupdocs.parser.data.PageImageArea> - A collection of PageImageArea objects; null if images extraction isn’t supported.
getImages(int pageIndex)
public Iterable<PageImageArea> getImages(int pageIndex)
To extract images from a document page the following method is used:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleImagesPdf)) {
// Check if the document supports images extraction
if (!parser.getFeatures().isImages()) {
System.out.println("Document isn't supports images extraction.");
return;
}
// Get the document info
IDocumentInfo documentInfo = parser.getDocumentInfo();
// Check if the document has pages
if (documentInfo.getPageCount() == 0) {
System.out.println("Document hasn't pages.");
return;
}
// Iterate over pages
for (int pageIndex = 0; pageIndex < documentInfo.getPageCount(); pageIndex++) {
// Print a page number
System.out.println(String.format("Page %d/%d", pageIndex + 1, documentInfo.getPageCount()));
// Iterate over images
// We ignore null-checking as we have checked images extraction feature support earlier
for (PageImageArea image : parser.getImages(pageIndex)) {
// Print a rectangle and image type
System.out.println(String.format("R: %s, Text: %s", image.getRectangle(), image.getFileType()));
}
}
}
Parameters:
Parameter
Type
Description
pageIndex
int
The zero-based page index.
Returns:
java.lang.Iterable<com.groupdocs.parser.data.PageImageArea> - A collection of PageImageArea objects; null if images extraction isn’t supported.
getImages(int pageIndex, PageAreaOptions options)
public Iterable<PageImageArea> getImages(int pageIndex, PageAreaOptions options)
Extracts images from the document page using customization options (to set the rectangular area that contains images).
Returns:
java.lang.Iterable<com.groupdocs.parser.data.PageImageArea> - A collection of PageImageArea objects; null if images extraction isn’t supported.
getHyperlinks()
public Iterable<PageHyperlinkArea> getHyperlinks()
Extracts hyperlinks from the document.
The following example shows how to extract all hyperlinks from the whole document:
// Create an instance of Parser class
try (Parser parser = new Parser(filePath)) {
// Check if the document supports hyperlink extraction
if (!parser.getFeatures().isHyperlinks()) {
System.out.println("Document isn't supports hyperlink extraction.");
return;
}
// Extract hyperlinks from the document
Iterable hyperlinks = parser.getHyperlinks();
// Iterate over hyperlinks
for (PageHyperlinkArea h : hyperlinks) {
// Print the hyperlink text
System.out.println(h.getText());
// Print the hyperlink URL
System.out.println(h.getUrl());
System.out.println();
}
}
Returns:
java.lang.Iterable<com.groupdocs.parser.data.PageHyperlinkArea> - A collection of PageHyperlinkArea objects; null if hyperlinks extraction isn’t supported.
getHyperlinks(int pageIndex)
public Iterable<PageHyperlinkArea> getHyperlinks(int pageIndex)
Extracts hyperlinks from the document page.
The following example shows how to extract hyperlinks from the document page:
// Create an instance of Parser class
try (Parser parser = new Parser(filePath)) {
// Check if the document supports hyperlink extraction
if (!parser.getFeatures().isHyperlinks()) {
System.out.println("Document isn't supports hyperlink extraction.");
return;
}
// Get the document info
IDocumentInfo documentInfo = parser.getDocumentInfo();
// Check if the document has pages
if (documentInfo.getPageCount() == 0) {
System.out.println("Document hasn't pages.");
return;
}
// Iterate over pages
for (int pageIndex = 0; pageIndex < documentInfo.getPageCount(); pageIndex++) {
// Print a page number
System.out.println(String.format("Page %d/%d", pageIndex + 1, documentInfo.getPageCount()));
// Extract hyperlinks from the document page
Iterable hyperlinks = parser.getHyperlinks(pageIndex);
// Iterate over hyperlinks
for (PageHyperlinkArea h : hyperlinks) {
// Print the hyperlink text
System.out.println(h.getText());
// Print the hyperlink URL
System.out.println(h.getUrl());
System.out.println();
}
}
}
Parameters:
Parameter
Type
Description
pageIndex
int
The zero-based page index.
Returns:
java.lang.Iterable<com.groupdocs.parser.data.PageHyperlinkArea> - A collection of PageHyperlinkArea objects; null if hyperlinks extraction isn’t supported.
getHyperlinks(PageAreaOptions options)
public Iterable<PageHyperlinkArea> getHyperlinks(PageAreaOptions options)
Extracts hyperlinks from the document using customization options (to set the rectangular area that contains hyperlinks).
The following example shows how to extract hyperlinks from the document page area:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.HyperlinksPdf)) {
// Check if the document supports hyperlink extraction
if (!parser.getFeatures().isHyperlinks()) {
System.out.println("Document isn't supports hyperlink extraction.");
return;
}
// Create the options which are used for hyperlink extraction
PageAreaOptions options = new PageAreaOptions(new Rectangle(new Point(380, 90), new Size(150, 50)));
// Extract hyperlinks from the document page area
Iterable hyperlinks = parser.getHyperlinks(options);
// Iterate over hyperlinks
for (PageHyperlinkArea h : hyperlinks) {
// Print the hyperlink text
System.out.println(h.getText());
// Print the hyperlink URL
System.out.println(h.getUrl());
System.out.println();
}
}
Returns:
java.lang.Iterable<com.groupdocs.parser.data.PageHyperlinkArea> - A collection of PageHyperlinkArea objects; null if hyperlinks extraction isn’t supported.
public Iterable<PageHyperlinkArea> getHyperlinks(int pageIndex, PageAreaOptions options)
Extracts hyperlinks from the document page using customization options (to set the rectangular area that contains hyperlinks).
The following example shows how to extract hyperlinks from the document page area using customization options:
// Create an instance of Parser class
try (Parser parser = new Parser(filePath)) {
// Check if the document supports hyperlink extraction
if (!parser.getFeatures().isHyperlinks()) {
System.out.println("Document isn't supports hyperlink extraction.");
return;
}
// Get the document info
IDocumentInfo documentInfo = parser.getDocumentInfo();
// Check if the document has pages
if (documentInfo.getPageCount() == 0) {
System.out.println("Document hasn't pages.");
return;
}
// Create the options which are used for hyperlink extraction
PageAreaOptions options = new PageAreaOptions(new Rectangle(new Point(380, 90), new Size(150, 50)));
// Iterate over pages
for (int pageIndex = 0; pageIndex < documentInfo.getPageCount(); pageIndex++) {
// Print a page number
System.out.println(String.format("Page %d/%d", pageIndex + 1, documentInfo.getPageCount()));
// Extract hyperlinks from the document page
Iterable hyperlinks = parser.getHyperlinks(pageIndex, options);
// Iterate over hyperlinks
for (PageHyperlinkArea h : hyperlinks) {
// Print the hyperlink text
System.out.println(h.getText());
// Print the hyperlink URL
System.out.println(h.getUrl());
System.out.println();
}
}
}
Returns:
java.lang.Iterable<com.groupdocs.parser.data.PageHyperlinkArea> - A collection of PageHyperlinkArea objects; null if hyperlinks extraction isn’t supported.
getBarcodes()
public Iterable<PageBarcodeArea> getBarcodes()
Extracts barcodes from the document.
Returns:
java.lang.Iterable<com.groupdocs.parser.data.PageBarcodeArea> - A collection of PageBarcodeArea objects; null if barcodes extraction isn’t supported.
getBarcodes(int pageIndex)
public Iterable<PageBarcodeArea> getBarcodes(int pageIndex)
Extracts barcodes from the document page.
Parameters:
Parameter
Type
Description
pageIndex
int
The zero-based page index.
Returns:
java.lang.Iterable<com.groupdocs.parser.data.PageBarcodeArea> - A collection of PageBarcodeArea objects; null if barcodes extraction isn’t supported.
getBarcodes(BarcodeOptions options)
public Iterable<PageBarcodeArea> getBarcodes(BarcodeOptions options)
Extracts barcodes from the document using customization options (to set the rectangular area that contains barcodes).
Returns:
java.lang.Iterable<com.groupdocs.parser.data.PageBarcodeArea> - A collection of PageBarcodeArea objects; null if barcodes extraction isn’t supported.\
Returns:
java.lang.Iterable<com.groupdocs.parser.data.PageBarcodeArea> - A collection of PageBarcodeArea objects; null if barcodes extraction isn’t supported.
getTables(PageTableAreaOptions options)
public Iterable<PageTableArea> getTables(PageTableAreaOptions options)
Extracts tables from the document.
The following example shows how to extract tables from the whole document:
// Create an instance of Parser class
try (Parser parser = new Parser(filePath)) {
// Check if the document supports table extraction
if (!parser.getFeatures().isTables()) {
System.out.println("Document isn't supports tables extraction.");
return;
}
// Create the layout of tables
TemplateTableLayout layout = new TemplateTableLayout(
java.util.Arrays.asList(new Double[]{50.0, 95.0, 275.0, 415.0, 485.0, 545.0}),
java.util.Arrays.asList(new Double[]{325.0, 340.0, 365.0, 395.0}));
// Create the options for table extraction
PageTableAreaOptions options = new PageTableAreaOptions(layout);
// Extract tables from the document
Iterable tables = parser.getTables(options);
// Iterate over tables
for (PageTableArea t : tables) {
// Iterate over rows
for (int row = 0; row < t.getRowCount(); row++) {
// Iterate over columns
for (int column = 0; column < t.getColumnCount(); column++) {
// Get the table cell
PageTableAreaCell cell = t.getCell(row, column);
if (cell != null) {
// Print the table cell text
System.out.print(cell.getText());
System.out.print(" | ");
}
}
System.out.println();
}
System.out.println();
}
}
Returns:
java.lang.Iterable<com.groupdocs.parser.data.PageTableArea> - A collection of PageTableArea objects; null if tables extraction isn’t supported.
getTables()
public Iterable<PageTableArea> getTables()
Extracts tables from the document, detecting them automatically.
Returns:
java.lang.Iterable<com.groupdocs.parser.data.PageTableArea> - A collection of PageTableArea objects; null if tables extraction isn’t supported.
Returns:
java.lang.Iterable<com.groupdocs.parser.data.PageTableArea> - A collection of PageTableArea objects; null if tables extraction isn’t supported.
getTables(int pageIndex)
public Iterable<PageTableArea> getTables(int pageIndex)
Extracts tables from the document page, detecting them automatically.
Parameters:
Parameter
Type
Description
pageIndex
int
The zero-based page index.
Returns:
java.lang.Iterable<com.groupdocs.parser.data.PageTableArea> - A collection of PageTableArea objects; null if tables extraction isn’t supported.
public Iterable<TemplateItem> generateAdjustmentFields(GenerateTemplateOptions options)
Generates a collection of adjustment TemplateItems for the document.
Java mirror of the C#
Parser.GenerateAdjustmentFields(GenerateTemplateOptions)
API (commits d4e691b /
ae9e98c). Useful as a starting-point template that the user can then trim or rename.
Algorithm
Pick the page from options.getPageIndex() (default 0 ).
If the document supports text-area extraction, call #getTextAreas(int, com.groupdocs.parser.options.PageTextAreaOptions).getTextAreas(int, com.groupdocs.parser.options.PageTextAreaOptions) and emit one TemplateField per non-empty text area, named {prefix}_{n} (default prefix “Field” ). Each field gets a TemplateFixedPosition at the text area’s rectangle, with the page’s width recorded as pageWidth so subsequent template scaling works.
If the document does not support text areas, return null .
OCR caveat
The C# implementation runs OCR over the page preview to derive fields when the document is image-only. Java’s OcrConnectorBase is user-supplied, so OCR-driven generation is only meaningful if a connector is configured AND the connector’s recognizeTextAreas(…) returns proper rectangles. When options.getOcrOptions() is set and a connector is available, the OCR pipeline is preferred.
The generation options. May be null (defaults applied).
Returns:
java.lang.Iterable<com.groupdocs.parser.templates.TemplateItem> - An iterable of generated TemplateItems, or null if not supported by this document format.
getWorksheetInfo()
public Iterable<WorksheetInfo> getWorksheetInfo()
Extracts the info about all worksheets in the spreadsheet.
Returns:
java.lang.Iterable<com.groupdocs.parser.data.WorksheetInfo> - A list of WorksheetInfo instances that contains info about all worksheets in the spreadsheet.
getWorksheetInfo(int worksheetIndex)
public WorksheetInfo getWorksheetInfo(int worksheetIndex)
public DocumentData parseByTemplate(TemplateCollection templates, ParseByTemplateOptions options)
Parses the document by automatically selecting the best-matching template from a collection.
Mirror of the C#
Parser.ParseByTemplate(TemplateCollection, ParseByTemplateOptions)
introduced in commit 71f3f21.
Selection heuristic (Java port): each candidate template is applied via
#parseByTemplate(Template, com.groupdocs.parser.options.ParseByTemplateOptions).parseByTemplate(Template, com.groupdocs.parser.options.ParseByTemplateOptions), and the template that yields
the highest number of populated fields (non-null page areas with non-empty text) wins. The C# implementation uses
hidden-marker matching against an OCR pass; that path is not yet available in Java because hidden-field markers and
the OCR-driven TemplatePageOcrParser aren’t ported, so this populated-field count is used as a transparent fallback.
Callers that need C#-identical behavior should plug in a custom selector once the OCR engine is wired.
Returns:DocumentData - The extracted data with DocumentData.getTemplate() set to the selected template, or null if the document does not support parsing-by-template or the collection is empty / produced no result.
The following example shows how to parse a form of the document:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleFormsPdf)) {
// Extract data from PDF document
DocumentData data = parser.parseForm();
// Check if form extraction is supported
if (data == null) {
System.out.println("Form extraction isn't supported.");
return;
}
// Iterate over extracted data
for (int i = 0; i < data.getCount(); i++) {
System.out.print(data.get(i).getName() + ": ");
PageTextArea area = data.get(i).getPageArea() instanceof PageTextArea
? (PageTextArea) data.get(i).getPageArea()
: null;
System.out.println(area == null ? "Not a template field" : area.getText());
}
}
Returns:DocumentData - An instance of DocumentData class that contains the extracted data; null if parsing by template isn’t supported.