PDF Text Extractor
The Documentize PDF Text Extractor for .NET simplifies extracting text from PDF documents. Whether you need pure, raw, or plain text, this plugin allows you to extract text efficiently while preserving formatting or omitting it based on your needs.
Key Features:
- Pure Mode: Extract text while preserving its original formatting.
- Raw Mode: Extract text without any formatting.
- Plain Mode: Extract text without special characters or formatting.
- Batch Processing: Extract text from multiple PDFs at once.
How to Extract Text from PDF Documents
To extract text from a PDF document, follow these steps:
- Create an instance of the
TextExtractor
class. - Create an instance of
TextExtractorOptions
to configure the extraction options. - Add the input PDF file using the
AddInput
method. - Run the
Process
method to extract the text. - Access the extracted text using the
ResultContainer.ResultCollection
.
1using var extractor = new TextExtractor();
2var textExtractorOptions = new TextExtractorOptions();
3
4// Add the input PDF
5textExtractorOptions.AddInput(new FileDataSource(@"C:\Samples\input.pdf"));
6
7// Process the text extraction
8var resultContainer = extractor.Process(textExtractorOptions);
9
10// Print the extracted text
11var extractedText = resultContainer.ResultCollection[0];
12Console.WriteLine(extractedText);
Extracting Text from Multiple PDFs
The plugin allows you to extract text from multiple PDFs simultaneously, ensuring quick and efficient processing.
1using var extractor = new TextExtractor();
2var textExtractorOptions = new TextExtractorOptions();
3
4// Add multiple input PDFs
5textExtractorOptions.AddInput(new FileDataSource(@"C:\Samples\input1.pdf"));
6textExtractorOptions.AddInput(new FileDataSource(@"C:\Samples\input2.pdf"));
7
8// Process the extraction
9var resultContainer = extractor.Process(textExtractorOptions);
10
11// Output the extracted text
12foreach (var result in resultContainer.ResultCollection)
13{
14 Console.WriteLine(result);
15}
Text Extraction Modes
The TextExtractor plugin offers three extraction modes, providing flexibility based on your needs.
- Pure Mode: Preserves the original formatting, including spaces and alignment.
- Raw Mode: Extracts the text without formatting, useful for raw data processing.
- Plain Mode: Extracts text without special characters or additional formatting.
1var textExtractorOptions = new TextExtractorOptions();
2
3// Set to Pure mode
4textExtractorOptions.Mode = ExtractionMode.Pure;
5textExtractorOptions.AddInput(new FileDataSource(@"C:\Samples\input.pdf"));
6
7// Process and output
8var resultContainer = extractor.Process(textExtractorOptions);
9Console.WriteLine(resultContainer.ResultCollection[0]);
How to Handle Batch Processing
For large document sets, you can leverage batch processing, enabling you to extract text from multiple PDFs at once.
1using var extractor = new TextExtractor();
2var textExtractorOptions = new TextExtractorOptions();
3
4// Add multiple input PDFs
5textExtractorOptions.AddInput(new FileDataSource(@"C:\Samples\batch1.pdf"));
6textExtractorOptions.AddInput(new FileDataSource(@"C:\Samples\batch2.pdf"));
7
8// Define output for each file
9var resultContainer = extractor.Process(textExtractorOptions);
10
11// Handle extracted text
12foreach (var result in resultContainer.ResultCollection)
13{
14 Console.WriteLine(result);
15}