Extract Text

The Documentize PDF Extractor for .NET simplifies extracting text from PDF documents. Whether you need pure, raw, or plain text, this plugin allows you to extract text efficiently while preserving formatting or omitting it based on your needs.

How to Extract Text from PDF file

To extract text from a PDF file, follow these steps:

  1. Create an instance of ExtractTextOptions to configure input file path.
  2. Run the Extract method to extract the text.
1```csharp
2// Create ExtractTextOptions object to set input file path
3var options = new ExtractTextOptions("path_to_your_pdf_file.pdf");
4// Perform the process and get the extracted text
5var textExtracted = PdfExtractor.Extract(options);
6```

How to Extract Text from PDF stream

To extract text from a PDF stream, follow these steps:

  1. Create an instance of ExtractTextOptions to configure input stream.
  2. Run the Extract method to extract the text.
1```csharp
2// Create ExtractTextOptions object to set input stream
3var stream = File.OpenRead("path_to_your_pdf_file.pdf");
4var options = new ExtractTextOptions(stream);
5// Perform the process and get the extracted text
6var textExtracted = PdfExtractor.Extract(options);
7```

Text Extraction Modes

The ExtractTextOptions offers three extraction modes, providing flexibility based on your needs.

  1. Pure Mode: Preserves the original formatting, including spaces and alignment.
  2. Raw Mode: Extracts the text without formatting, useful for raw data processing.
  3. Flatten Mode: Represent PDF content with positioning text fragments by their coordinates.
1```csharp
2// Create ExtractTextOptions object to set input file path and TextFormattingMode
3var options = new ExtractTextOptions("path_to_your_pdf_file.pdf", TextFormattingMode.Pure);
4// Perform the process and get the extracted text
5var textExtracted = PdfExtractor.Extract(options);
6```

How to Extract Text from PDF file in the shortest possible style

1```csharp
2// Perform the process and get the extracted text
3var textExtracted = PdfExtractor.Extract(new ExtractTextOptions("path_to_your_pdf_file.pdf", TextFormattingMode.Pure));
4```

Key Features:

  • Pure Mode: Extract text while preserving its original formatting.
  • Raw Mode: Extract text without any formatting.
  • Flatten Mode: Extract text without special characters or formatting.
 Tiếng Việt