PDF Extractor
提取文本
使用 Documentize 的 .NET 工具准确提取 PDF 文档中的文本——轻松检索、处理和分析内容。
提取图像
在 .NET 应用程序中轻松从 PDF 文档中提取图像
Extract Properties / Metadata
使用 Documentize 在 C#/.NET 中准确提取 PDF 的元数据
导出表单数据
使用 C#/.NET 从 PDF 表单(AcroForms)中提取并导出数据到 CSV 等其他格式
使用 Documentize 的 .NET 工具准确提取 PDF 文档中的文本——轻松检索、处理和分析内容。
在 .NET 应用程序中轻松从 PDF 文档中提取图像
使用 Documentize 在 C#/.NET 中准确提取 PDF 的元数据
使用 C#/.NET 从 PDF 表单(AcroForms)中提取并导出数据到 CSV 等其他格式
The Documentize PDF Extractor for .NET 简化了从 PDF 文档中提取文本的过程。无论您需要纯文本、原始文本还是普通文本,此插件都能高效提取文本,并根据需求保留或省略格式。
要从 PDF 文件提取文本,请按照以下步骤操作:
ExtractTextOptions 实例以配置输入文件路径。Extract 方法进行文本提取。1// Create ExtractTextOptions object to set input file path
2var options = new ExtractTextOptions("path_to_your_pdf_file.pdf");
3// Perform the process and get the extracted text
4var textExtracted = PdfExtractor.Extract(options);要从 PDF 流提取文本,请按照以下步骤操作:
ExtractTextOptions 实例以配置输入流。Extract 方法进行文本提取。1// Create ExtractTextOptions object to set input stream
2var stream = File.OpenRead("path_to_your_pdf_file.pdf");
3var options = new ExtractTextOptions(stream);
4// Perform the process and get the extracted text
5var textExtracted = PdfExtractor.Extract(options);ExtractTextOptions 提供三种提取模式,满足不同需求的灵活性。
1// Create ExtractTextOptions object to set input file path and TextFormattingMode
2var options = new ExtractTextOptions("path_to_your_pdf_file.pdf", TextFormattingMode.Pure);
3// Perform the process and get the extracted text
4var textExtracted = PdfExtractor.Extract(options);1// Perform the process and get the extracted text
2var textExtracted = PdfExtractor.Extract(new ExtractTextOptions("path_to_your_pdf_file.pdf", TextFormattingMode.Pure));The Documentize PDF Extractor for .NET plugin enables you to effortlessly extract images from PDF documents. It scans your PDF files, identifies embedded images, and extracts them while maintaining their original quality and format. This tool enhances the accessibility of visual content and streamlines the process of retrieving images from PDFs.
要从 PDF 文件中提取图像,请按照以下步骤操作:
ExtractImagesOptions 类的实例。AddInput 方法将输入文件路径添加到选项中。AddOutput 方法设置图像的输出目录路径。 1// Create ExtractImagesOptions to set instructions
2var options = new ExtractImagesOptions();
3// Add input file path
4options.AddInput(new FileDataSource("path_to_your_pdf_file.pdf"));
5// Set output Directory path
6options.AddOutput(new DirectoryDataSource("path_to_results_directory"));
7// Perform the process
8var results = PdfExtractor.Extract(options);
9// Get path to image result
10var imageExtracted = results.ResultCollection[0].ToFile();The PdfExtractor plugin supports saving to streams, which allows you to extract images from PDF files into streams without using temporary folders.
1// Create ExtractImagesOptions to set instructions
2var options = new ExtractImagesOptions();
3// Add input file path
4options.AddInput(new FileDataSource("path_to_your_pdf_file.pdf"));
5// Not set output - it will write results to streams
6// Perform the process
7var results = PdfExtractor.Extract(options);
8// Get Stream
9var ms = results.ResultCollection[0].ToStream();
10// Copy data to file for demo
11ms.Seek(0, SeekOrigin.Begin);
12using (var fs = File.Create("test_file.png"))
13{
14 ms.CopyTo(fs);
15}The Documentize PDF Extractor for .NET simplifies extracting Metadata from PDF documents. Available properties that may interest you: FileName, Title, Author, Subject, Keywords, Created, Modified, Application, PDF Producer, Number of Pages.
The example demonstrates how to Extract Properties (Title, Author, Subject, Keywords, Number of Pages) from PDF file. To extract metadata from a PDF document, follow these steps:
ExtractPropertiesOptions to configure the extraction options and input PDF file.Extract method of PdfExtractor to extract the metadata.PdfProperties. 1// Create ExtractPropertiesOptions object to set input file
2var options = new ExtractPropertiesOptions("path_to_your_pdf_file.pdf");
3// Perform the process and get Properties
4var pdfProperties = PdfExtractor.Extract(options);
5var filename = pdfProperties.FileName;
6var title = pdfProperties.Title;
7var author = pdfProperties.Author;
8var subject = pdfProperties.Subject;
9var keywords = pdfProperties.Keywords;
10var created = pdfProperties.Created;
11var modified = pdfProperties.Modified;
12var application = pdfProperties.Application;
13var pdfProducer = pdfProperties.PdfProducer;
14var numberOfPages = pdfProperties.NumberOfPages;You can open the stream at your own discretion.
1// Create ExtractPropertiesOptions object to set input stream
2var stream = File.OpenRead("path_to_your_pdf_file.pdf");
3var options = new ExtractPropertiesOptions(stream);
4// Perform the process and get Properties
5var pdfProperties = PdfExtractor.Extract(options);
6var title = pdfProperties.Title;
7var author = pdfProperties.Author;
8var subject = pdfProperties.Subject;
9var keywords = pdfProperties.Keywords;
10var created = pdfProperties.Created;
11var modified = pdfProperties.Modified;
12var application = pdfProperties.Application;
13var pdfProducer = pdfProperties.PdfProducer;
14var numberOfPages = pdfProperties.NumberOfPages;1// Perform the process and get Properties
2var pdfProperties = PdfExtractor.Extract(new ExtractPropertiesOptions("path_to_your_pdf_file.pdf"));The Documentize PDF Extractor for .NET 插件提供了一种无缝的方法,将 PDF 表单(AcroForms)中的数据提取并导出为 CSV 等其他格式。此动态工具简化了检索表单字段值的过程,便于轻松进行数据管理、传输和分析。
要将 PDF 表单数据导出为 CSV,请按照以下步骤操作:
ExtractImagesOptions 类的实例。FormExporterValuesToCsvOptions 类定义导出选项。Extract 方法执行导出。1// Create ExtractFormDataToDsvOptions object to set instructions
2var options = new ExtractFormDataToDsvOptions(',', true);
3// Add input file path
4options.AddInput(new FileDataSource("path_to_your_pdf_file.pdf"));
5// Set output file path
6options.AddOutput(new FileDataSource("path_to_result_csv_file.csv"));
7// Perform the process
8PdfExtractor.Extract(options);