PDF Extractor

提取文本

使用 Documentize 的 .NET 工具准确提取 PDF 文档中的文本——轻松检索、处理和分析内容。

提取图像

在 .NET 应用程序中轻松从 PDF 文档中提取图像

Extract Properties / Metadata

使用 Documentize 在 C#/.NET 中准确提取 PDF 的元数据

导出表单数据

使用 C#/.NET 从 PDF 表单(AcroForms)提取并导出数据为 CSV 等其他格式

PDF Extractor 的子部分

提取文本

The Documentize PDF Extractor for .NET 简化了从 PDF 文档中提取文本的过程。无论您需要纯文本、原始文本还是普通文本,此插件都能高效提取文本,并根据需求保留或省略格式。

如何从 PDF 文件提取文本

要从 PDF 文件提取文本,请按照以下步骤操作:

  1. 创建 ExtractTextOptions 实例以配置输入文件路径。
  2. 运行 Extract 方法进行文本提取。
1// Create ExtractTextOptions object to set input file path
2var options = new ExtractTextOptions("path_to_your_pdf_file.pdf");
3// Perform the process and get the extracted text
4var textExtracted = PdfExtractor.Extract(options);

如何从 PDF 流提取文本

要从 PDF 流提取文本,请按照以下步骤操作:

  1. 创建 ExtractTextOptions 实例以配置输入流。
  2. 运行 Extract 方法进行文本提取。
1// Create ExtractTextOptions object to set input stream
2var stream = File.OpenRead("path_to_your_pdf_file.pdf");
3var options = new ExtractTextOptions(stream);
4// Perform the process and get the extracted text
5var textExtracted = PdfExtractor.Extract(options);

文本提取模式

ExtractTextOptions 提供三种提取模式,满足不同需求的灵活性。

  1. Pure Mode:保留原始格式,包括空格和对齐。
  2. Raw Mode:不保留格式,适用于原始数据处理。
  3. Flatten Mode:使用坐标表示 PDF 内容的文本片段。
1// Create ExtractTextOptions object to set input file path and TextFormattingMode
2var options = new ExtractTextOptions("path_to_your_pdf_file.pdf", TextFormattingMode.Pure);
3// Perform the process and get the extracted text
4var textExtracted = PdfExtractor.Extract(options);

最简方式从 PDF 文件提取文本

1// Perform the process and get the extracted text
2var textExtracted = PdfExtractor.Extract(new ExtractTextOptions("path_to_your_pdf_file.pdf", TextFormattingMode.Pure));

关键特性:

  • Pure Mode:在保留原始格式的同时提取文本。
  • Raw Mode:提取不含任何格式的文本。
  • Flatten Mode:提取不含特殊字符或格式的文本。

提取图像

The Documentize PDF Extractor for .NET 插件使您能够轻松从 PDF 文档中提取图像。它会扫描您的 PDF 文件,识别嵌入的图像,并在保持原始质量和格式的同时提取它们。此工具提升了视觉内容的可访问性,并简化了从 PDF 中检索图像的过程。

如何从 PDF 中提取图像

要从 PDF 文件中提取图像,请按照以下步骤操作:

  1. 创建 ExtractImagesOptions 类的实例。
  2. 使用 AddInput 方法将输入文件路径添加到选项中。
  3. 使用 AddOutput 方法设置图像的输出目录路径。
  4. 使用插件执行图像提取过程。
  5. 从结果容器中获取提取的图像。
 1// Create ExtractImagesOptions to set instructions
 2var options = new ExtractImagesOptions();
 3// Add input file path
 4options.AddInput(new FileData("path_to_your_pdf_file.pdf"));
 5// Set output Directory path
 6options.AddOutput(new DirectoryData("path_to_results_directory"));
 7// Perform the process
 8var results = PdfExtractor.Extract(options);
 9// Get path to image result
10var imageExtracted = results.ResultCollection[0].ToFile();

在不使用文件夹的情况下将 PDF 文件的图像提取到流

PdfExtractor 插件支持保存到流,这使您可以在不使用临时文件夹的情况下将 PDF 文件中的图像提取到流中。

 1// Create ExtractImagesOptions to set instructions
 2var options = new ExtractImagesOptions();
 3// Add input file path
 4options.AddInput(new FileData("path_to_your_pdf_file.pdf"));
 5// Not set output - it will write results to streams
 6// Perform the process
 7var results = PdfExtractor.Extract(options);
 8// Get Stream
 9var ms = results.ResultCollection[0].ToStream();
10// Copy data to file for demo
11ms.Seek(0, SeekOrigin.Begin);
12using (var fs = File.Create("test_file.png"))
13{
14    ms.CopyTo(fs);
15}

主要功能:

  • 提取嵌入图像:识别并提取 PDF 文档中的图像。
  • 保留图像质量:确保提取的图像保持原始质量。
  • 灵活的输出:以您偏好的格式或位置保存提取的图像。

Extract Properties / Metadata

The Documentize PDF Extractor for .NET simplifies extracting Metadata from PDF documents. Available properties that may interest you: FileName, Title, Author, Subject, Keywords, Created, Modified, Application, PDF Producer, Number of Pages.

如何从 PDF 文件中提取元数据

The example demonstrates how to Extract Properties (Title, Author, Subject, Keywords, Number of Pages) from PDF file. To extract metadata from a PDF document, follow these steps:

  1. Create an instance of ExtractPropertiesOptions to configure the extraction options and input PDF file.
  2. Run the Extract method of PdfExtractor to extract the metadata.
  3. Access the extracted properties using the PdfProperties.
 1// Create ExtractPropertiesOptions object to set input file
 2var options = new ExtractPropertiesOptions("path_to_your_pdf_file.pdf");
 3// Perform the process and get Properties
 4var pdfProperties = PdfExtractor.Extract(options);
 5var filename = pdfProperties.FileName;
 6var title = pdfProperties.Title;
 7var author = pdfProperties.Author;
 8var subject = pdfProperties.Subject;
 9var keywords = pdfProperties.Keywords;
10var created = pdfProperties.Created;
11var modified = pdfProperties.Modified;
12var application = pdfProperties.Application;
13var pdfProducer = pdfProperties.PdfProducer;
14var numberOfPages = pdfProperties.NumberOfPages;

如何从 PDF 流中提取元数据

You can open the stream at your own discretion.

 1// Create ExtractPropertiesOptions object to set input stream
 2var stream = File.OpenRead("path_to_your_pdf_file.pdf");
 3var options = new ExtractPropertiesOptions(stream);
 4// Perform the process and get Properties
 5var pdfProperties = PdfExtractor.Extract(options);
 6var title = pdfProperties.Title;
 7var author = pdfProperties.Author;
 8var subject = pdfProperties.Subject;
 9var keywords = pdfProperties.Keywords;
10var created = pdfProperties.Created;
11var modified = pdfProperties.Modified;
12var application = pdfProperties.Application;
13var pdfProducer = pdfProperties.PdfProducer;
14var numberOfPages = pdfProperties.NumberOfPages;

以最简洁方式从 PDF 文件中提取元数据

1// Perform the process and get Properties
2var pdfProperties = PdfExtractor.Extract(new ExtractPropertiesOptions("path_to_your_pdf_file.pdf"));

关键特性:

  • 可用的元数据:FileName、Title、Author、Subject、Keywords、Created、Modified、Application、PDF Producer、Number of Pages.

导出表单数据

The Documentize PDF Extractor for .NET plugin provides a seamless way to extract and export data from PDF forms (AcroForms) into other formats like CSV. This dynamic tool simplifies the process of retrieving form field values, allowing for easy data management, transfer, and analysis.

How to Export Form Data from PDF to CSV

To export form data from a PDF to CSV, follow these steps:

  1. Create an instance of the ExtractImagesOptions class.
  2. Define export options using the FormExporterValuesToCsvOptions class.
  3. Add input PDF files and specify the output CSV file.
  4. Run the Extract method to perform the export.
1// Create ExtractFormDataToDsvOptions object to set instructions
2var options = new ExtractFormDataToDsvOptions(',', true);
3// Add input file path
4options.AddInput(new FileData("path_to_your_pdf_file.pdf"));
5// Set output file path
6options.AddOutput(new FileData("path_to_result_csv_file.csv"));
7// Perform the process
8PdfExtractor.Extract(options);

How to Export Form Data from PDF to TSV

Use Tab as Delimiter.

 1// Create ExtractFormDataToDsvOptions object to set instructions
 2var options = new ExtractFormDataToDsvOptions();
 3//Set Delimiter
 4options.Delimiter = '\t';
 5//Add Field Names to result
 6options.AddFieldName = true;
 7// Add input file path
 8options.AddInput(new FileData("path_to_your_pdf_file.pdf"));
 9// Set output file path
10options.AddOutput(new FileData("path_to_result_csv_file.tsv"));
11// Perform the process
12PdfExtractor.Extract(options);

Key Features:

  • Export Form Data: Extract data from PDF forms (AcroForms) into CSV or other formats.
  • Data Filtering: Use predicates to filter specific form fields for export based on criteria like field type or page number.
  • Flexible Output: Save exported data for analysis or transfer to spreadsheets, databases, or other document formats.
 中文