PDF Extractor

Extract Text

Extract text from PDF documents accurately with Documentize's .NET tools—retrieve, process, and analyze content effortlessly.

Extract Images

Effortlessly extract images from PDF documents from within .NET applications

Extract Properties / Metadata

Extract Metadata from PDFs accurately with Documentize using C#/.NET

Export Form Data

Extract and export data from PDF forms (AcroForms) into other formats like CSV using C#/.NET

Extract Text

The Documentize PDF Extractor for .NET simplifies extracting text from PDF documents. Whether you need pure, raw, or plain text, this plugin allows you to extract text efficiently while preserving formatting or omitting it based on your needs.

How to Extract Text from PDF file

To extract text from a PDF file, follow these steps:

Create an instance of ExtractTextOptions to configure input file path.
Run the Extract method to extract the text.

1// Create ExtractTextOptions object to set input file path
2var options = new ExtractTextOptions("path_to_your_pdf_file.pdf");
3// Perform the process and get the extracted text
4var textExtracted = PdfExtractor.Extract(options);

How to Extract Text from PDF stream

To extract text from a PDF stream, follow these steps:

Create an instance of ExtractTextOptions to configure input stream.
Run the Extract method to extract the text.

1// Create ExtractTextOptions object to set input stream
2var stream = File.OpenRead("path_to_your_pdf_file.pdf");
3var options = new ExtractTextOptions(stream);
4// Perform the process and get the extracted text
5var textExtracted = PdfExtractor.Extract(options);

Text Extraction Modes

The ExtractTextOptions offers three extraction modes, providing flexibility based on your needs.

Pure Mode: Preserves the original formatting, including spaces and alignment.
Raw Mode: Extracts the text without formatting, useful for raw data processing.
Flatten Mode: Represent PDF content with positioning text fragments by their coordinates.

1// Create ExtractTextOptions object to set input file path and TextFormattingMode
2var options = new ExtractTextOptions("path_to_your_pdf_file.pdf", TextFormattingMode.Pure);
3// Perform the process and get the extracted text
4var textExtracted = PdfExtractor.Extract(options);

How to Extract Text from PDF file in the shortest possible style

1// Perform the process and get the extracted text
2var textExtracted = PdfExtractor.Extract(new ExtractTextOptions("path_to_your_pdf_file.pdf", TextFormattingMode.Pure));

Key Features:

Pure Mode: Extract text while preserving its original formatting.
Raw Mode: Extract text without any formatting.
Flatten Mode: Extract text without special characters or formatting.

Extract Images

The Documentize PDF Extractor for .NET plugin enables you to effortlessly extract images from PDF documents. It scans your PDF files, identifies embedded images, and extracts them while maintaining their original quality and format. This tool enhances the accessibility of visual content and streamlines the process of retrieving images from PDFs.

How to Extract Images from a PDF

To extract images from a PDF file, follow these steps:

Create an instance of the ExtractImagesOptions class.
Add the input file path to the options using the AddInput method.
Set the output Directory path for images using the AddOutput method.
Process the image extraction using the plugin.
Retrieve the extracted images from the result container.

 1// Create ExtractImagesOptions to set instructions
 2var options = new ExtractImagesOptions();
 3// Add input file path
 4options.AddInput(new FileData("path_to_your_pdf_file.pdf"));
 5// Set output Directory path
 6options.AddOutput(new DirectoryData("path_to_results_directory"));
 7// Perform the process
 8var results = PdfExtractor.Extract(options);
 9// Get path to image result
10var imageExtracted = results.ResultCollection[0].ToFile();

Extracting Images from PDF File to Streams without folder

The PdfExtractor plugin supports saving to streams, which allows you to extract images from PDF files into streams without using temporary folders.

 1// Create ExtractImagesOptions to set instructions
 2var options = new ExtractImagesOptions();
 3// Add input file path
 4options.AddInput(new FileData("path_to_your_pdf_file.pdf"));
 5// Not set output - it will write results to streams
 6// Perform the process
 7var results = PdfExtractor.Extract(options);
 8// Get Stream
 9var ms = results.ResultCollection[0].ToStream();
10// Copy data to file for demo
11ms.Seek(0, SeekOrigin.Begin);
12using (var fs = File.Create("test_file.png"))
13{
14    ms.CopyTo(fs);
15}

Key Features:

Extract Embedded Images: Identify and extract images from PDF documents.
Preserve Image Quality: Ensures extracted images retain their original quality.
Flexible Output: Save extracted images in your preferred format or location.

Extract Properties / Metadata

The Documentize PDF Extractor for .NET simplifies extracting Metadata from PDF documents. Available properties that may interest you: FileName, Title, Author, Subject, Keywords, Created, Modified, Application, PDF Producer, Number of Pages.

How to Extract Metadata from PDF file

The example demonstrates how to Extract Properties (Title, Author, Subject, Keywords, Number of Pages) from PDF file. To extract metadata from a PDF document, follow these steps:

Create an instance of ExtractPropertiesOptions to configure the extraction options and input PDF file.
Run the Extract method of PdfExtractor to extract the metadata.
Access the extracted properties using the PdfProperties.

 1// Create ExtractPropertiesOptions object to set input file
 2var options = new ExtractPropertiesOptions("path_to_your_pdf_file.pdf");
 3// Perform the process and get Properties
 4var pdfProperties = PdfExtractor.Extract(options);
 5var filename = pdfProperties.FileName;
 6var title = pdfProperties.Title;
 7var author = pdfProperties.Author;
 8var subject = pdfProperties.Subject;
 9var keywords = pdfProperties.Keywords;
10var created = pdfProperties.Created;
11var modified = pdfProperties.Modified;
12var application = pdfProperties.Application;
13var pdfProducer = pdfProperties.PdfProducer;
14var numberOfPages = pdfProperties.NumberOfPages;

How to Extract Metadata from PDF stream

You can open the stream at your own discretion.

 1// Create ExtractPropertiesOptions object to set input stream
 2var stream = File.OpenRead("path_to_your_pdf_file.pdf");
 3var options = new ExtractPropertiesOptions(stream);
 4// Perform the process and get Properties
 5var pdfProperties = PdfExtractor.Extract(options);
 6var title = pdfProperties.Title;
 7var author = pdfProperties.Author;
 8var subject = pdfProperties.Subject;
 9var keywords = pdfProperties.Keywords;
10var created = pdfProperties.Created;
11var modified = pdfProperties.Modified;
12var application = pdfProperties.Application;
13var pdfProducer = pdfProperties.PdfProducer;
14var numberOfPages = pdfProperties.NumberOfPages;

How to Extract Metadata from PDF file in the shortest possible style

1// Perform the process and get Properties
2var pdfProperties = PdfExtractor.Extract(new ExtractPropertiesOptions("path_to_your_pdf_file.pdf"));

Key Features:

Available metadata: FileName, Title, Author, Subject, Keywords, Created, Modified, Application, PDF Producer, Number of Pages.

Export Form Data

The Documentize PDF Extractor for .NET plugin provides a seamless way to extract and export data from PDF forms (AcroForms) into other formats like CSV. This dynamic tool simplifies the process of retrieving form field values, allowing for easy data management, transfer, and analysis.

How to Export Form Data from PDF to CSV

To export form data from a PDF to CSV, follow these steps:

Create an instance of the ExtractImagesOptions class.
Define export options using the FormExporterValuesToCsvOptions class.
Add input PDF files and specify the output CSV file.
Run the Extract method to perform the export.

1// Create ExtractFormDataToDsvOptions object to set instructions
2var options = new ExtractFormDataToDsvOptions(',', true);
3// Add input file path
4options.AddInput(new FileData("path_to_your_pdf_file.pdf"));
5// Set output file path
6options.AddOutput(new FileData("path_to_result_csv_file.csv"));
7// Perform the process
8PdfExtractor.Extract(options);

How to Export Form Data from PDF to TSV

Use Tab as Delimiter.

 1// Create ExtractFormDataToDsvOptions object to set instructions
 2var options = new ExtractFormDataToDsvOptions();
 3//Set Delimiter
 4options.Delimiter = '\t';
 5//Add Field Names to result
 6options.AddFieldName = true;
 7// Add input file path
 8options.AddInput(new FileData("path_to_your_pdf_file.pdf"));
 9// Set output file path
10options.AddOutput(new FileData("path_to_result_csv_file.tsv"));
11// Perform the process
12PdfExtractor.Extract(options);

Key Features:

Export Form Data: Extract data from PDF forms (AcroForms) into CSV or other formats.
Data Filtering: Use predicates to filter specific form fields for export based on criteria like field type or page number.
Flexible Output: Save exported data for analysis or transfer to spreadsheets, databases, or other document formats.

PDF Extractor

Extract Text

Extract Images

Extract Properties / Metadata

Export Form Data

Subsections of PDF Extractor

Extract Text

How to Extract Text from PDF file

How to Extract Text from PDF stream

Text Extraction Modes

How to Extract Text from PDF file in the shortest possible style

Key Features:

Extract Images

How to Extract Images from a PDF

Extracting Images from PDF File to Streams without folder

Key Features:

Extract Properties / Metadata

How to Extract Metadata from PDF file

How to Extract Metadata from PDF stream

How to Extract Metadata from PDF file in the shortest possible style

Key Features:

Export Form Data

How to Export Form Data from PDF to CSV

How to Export Form Data from PDF to TSV

Key Features: