During a large project, I was asked to develop a tool that could compare data (from text files) with data from another system (thoug formatted differently) and in PDF files!
I just needed to get the text of the individual PDF pages, then I could easily do the comparison with regular expression magic. To get the text from the PDF files, I could write my own PDF parser. Sigh! No solution, as this was on a tight schedule.
Lucky for me, there is an open source project called iTextSharp, which is a PDF library for C#. Using this I was able to read in the PDF and quickly extract the text of the individual pages, simply by doing as follows (of course linking to the appropriate assemblies):
using iTextSharp.text.pdf; using iTextSharp.text.pdf.parser; private string PdfAsText(string file) { var result = new StringBuilder(); var reader = new PdfReader(file); var numPages = reader.NumberOfPages; for (var page = 1; page <= numPages; page++) { var text = PdfTextExtractor.GetTextFromPage(reader, page); result.AppendLine(text); } return result.ToString(); }