PDF Text Extractor

The Documentize PDF Text Extractor for .NET simplifies extracting text from PDF documents. Whether you need pure, raw, or plain text, this plugin allows you to extract text efficiently while preserving formatting or omitting it based on your needs.

Key Features:

  • Pure Mode: Extract text while preserving its original formatting.
  • Raw Mode: Extract text without any formatting.
  • Plain Mode: Extract text without special characters or formatting.
  • Batch Processing: Extract text from multiple PDFs at once.

How to Extract Text from PDF Documents

To extract text from a PDF document, follow these steps:

  1. Create an instance of the TextExtractor class.
  2. Create an instance of TextExtractorOptions to configure the extraction options.
  3. Add the input PDF file using the AddInput method.
  4. Run the Process method to extract the text.
  5. Access the extracted text using the ResultContainer.ResultCollection.
 1using var extractor = new TextExtractor();
 2var textExtractorOptions = new TextExtractorOptions();
 3
 4// Add the input PDF
 5textExtractorOptions.AddInput(new FileDataSource(@"C:\Samples\input.pdf"));
 6
 7// Process the text extraction
 8var resultContainer = extractor.Process(textExtractorOptions);
 9
10// Print the extracted text
11var extractedText = resultContainer.ResultCollection[0];
12Console.WriteLine(extractedText);

Extracting Text from Multiple PDFs

The plugin allows you to extract text from multiple PDFs simultaneously, ensuring quick and efficient processing.

 1using var extractor = new TextExtractor();
 2var textExtractorOptions = new TextExtractorOptions();
 3
 4// Add multiple input PDFs
 5textExtractorOptions.AddInput(new FileDataSource(@"C:\Samples\input1.pdf"));
 6textExtractorOptions.AddInput(new FileDataSource(@"C:\Samples\input2.pdf"));
 7
 8// Process the extraction
 9var resultContainer = extractor.Process(textExtractorOptions);
10
11// Output the extracted text
12foreach (var result in resultContainer.ResultCollection)
13{
14    Console.WriteLine(result);
15}

Text Extraction Modes

The TextExtractor plugin offers three extraction modes, providing flexibility based on your needs.

  1. Pure Mode: Preserves the original formatting, including spaces and alignment.
  2. Raw Mode: Extracts the text without formatting, useful for raw data processing.
  3. Plain Mode: Extracts text without special characters or additional formatting.
1var textExtractorOptions = new TextExtractorOptions();
2
3// Set to Pure mode
4textExtractorOptions.Mode = ExtractionMode.Pure;
5textExtractorOptions.AddInput(new FileDataSource(@"C:\Samples\input.pdf"));
6
7// Process and output
8var resultContainer = extractor.Process(textExtractorOptions);
9Console.WriteLine(resultContainer.ResultCollection[0]);

How to Handle Batch Processing

For large document sets, you can leverage batch processing, enabling you to extract text from multiple PDFs at once.

 1using var extractor = new TextExtractor();
 2var textExtractorOptions = new TextExtractorOptions();
 3
 4// Add multiple input PDFs
 5textExtractorOptions.AddInput(new FileDataSource(@"C:\Samples\batch1.pdf"));
 6textExtractorOptions.AddInput(new FileDataSource(@"C:\Samples\batch2.pdf"));
 7
 8// Define output for each file
 9var resultContainer = extractor.Process(textExtractorOptions);
10
11// Handle extracted text
12foreach (var result in resultContainer.ResultCollection)
13{
14    Console.WriteLine(result);
15}
 English