PDF Content As Text In .NET

Posted by Henrik Brinch | Mar 4, 2013 | .NET, C#, Coding Tips, Software Development | 0 |

PDF Content As Text In .NET

During a large project, I was asked to develop a tool that could compare data (from text files) with data from another system (thoug formatted differently) and in PDF files!

I just needed to get the text of the individual PDF pages, then I could easily do the comparison with regular expression magic. To get the text from the PDF files, I could write my own PDF parser. Sigh! No solution, as this was on a tight schedule.

Lucky for me, there is an open source project called iTextSharp, which is a PDF library for C#. Using this I was able to read in the PDF and quickly extract the text of the individual pages, simply by doing as follows (of course linking to the appropriate assemblies):

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

private string PdfAsText(string file)
{
    var result = new StringBuilder();

    var reader = new PdfReader(file);
    var numPages = reader.NumberOfPages;

    for (var page = 1; page <= numPages; page++)
    {
        var text = PdfTextExtractor.GetTextFromPage(reader, page);
        result.AppendLine(text);
    }

    return result.ToString();
}

About The Author

Henrik Brinch

Self-employed freelance consultant (Architect, Tech. Lead, Developer, Instructor and Mentor) with extensive knowledge in all areas of software development, though with primary focus on Microsoft related technologies. Has a pragmatic approach to life - and believes in "Everything is possible". Living in Denmark with his wife Henriette and four kids (a set of triplets and a kid-brother).

READY FOR NEW ASSIGNMENT

I’m available for new freelance assignments 01.08.2023 (fulltime).

If you need a problem solved, don’t hesitate to contact me – we’ll figure it out!

Recent Posts

Archives

Categories