PDF Text Extractor

The Documentize PDF Text Extractor for .NET simplifies extracting text from PDF documents. Whether you need pure, raw, or plain text, this plugin allows you to extract text efficiently while preserving formatting or omitting it based on your needs.

How to Extract Text from PDF

To extract text from a PDF document, follow these steps:

Create an instance of TextExtractorOptions to configure the extraction options.
Add the input PDF file using the AddInput method.
Run the Process method to extract the text.
Access the extracted text using the ResultContainer.ResultCollection.

1// Create TextExtractorOptions object to set instructions
2var options = new TextExtractorOptions();
3// Add input file path
4options.AddInput(new FileDataSource("path_to_your_pdf_file.pdf"));
5// Perform the process
6var results = TextExtractor.Process(options);
7// Get the extracted text from the ResultContainer object
8var textExtracted = results.ResultCollection[0].ToString();

Text Extraction Modes

The TextExtractor plugin offers three extraction modes, providing flexibility based on your needs.

Pure Mode: Preserves the original formatting, including spaces and alignment.
Raw Mode: Extracts the text without formatting, useful for raw data processing.
Flatten Mode: Represent PDF content with positioning text fragments by their coordinates.

1// Create TextExtractorOptions object to set TextFormattingMode
2var options = new TextExtractorOptions(TextFormattingMode.Pure);
3// Add input file path
4options.AddInput(new FileDataSource("path_to_your_pdf_file.pdf"));
5// Perform the process
6var results = TextExtractor.Process(options);
7// Get the extracted text from the ResultContainer object
8var textExtracted = results.ResultCollection[0].ToString();

Key Features:

Pure Mode: Extract text while preserving its original formatting.
Raw Mode: Extract text without any formatting.
Flatten Mode: Extract text without special characters or formatting.