Available for full time assignments: 01.04.2014

PDF Content As Text In .NET

WRITTEN BY Henrik Brinch - 04 March 2013

During a large project, I was asked to develop a tool that could compare data (from text files) with data from another system (thoug formatted differently) and in PDF files!

I just needed to get the text of the individual PDF pages, then I could easily do the comparison with regular expression magic.  To get the text from the PDF files, I could write my own PDF parser.  Sigh!  No solution, as this was on a tight schedule.

Lucky for me, there is an open source project called iTextSharp, which is a PDF library for C#.  Using this I was able to read in the PDF and quickly extract the text of the individual pages, simply by doing as follows (of course linking to the appropriate assemblies):

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

private string PdfAsText(string file)
{
    var result = new StringBuilder();

    var reader = new PdfReader(file);
    var numPages = reader.NumberOfPages;

    for (var page = 1; page <= numPages; page++)
    {
        var text = PdfTextExtractor.GetTextFromPage(reader, page);
        result.AppendLine(text);
    }

    return result.ToString();
}




comments powered by Disqus

Bookmark and Share

.NET / C# course - training
10 x 4 hrs., 10 participants
only DKK. 7.475,-/per. particpant
Read more here ...

View Henrik Brinch's profile on LinkedIn

Archive

Tags