Hi!,1 person found this helpful
You can try with Apache PDFBox which is an open source library (http://pdfbox.apache.org/). With it you can extract your PDF file onto a text file.
I hope this will be helpful.
I have used pdfbox myself to index pdfs through lucene - it works very well. I just hope you dont run into the disappointment i had to suffer - that the text is actually on an image :/1 person found this helpful
Thanks user1003432, in fact I'm using that one, seems quite good.
I'm using the PDFTextStripperByArea class: http://pdfbox.apache.org/apidocs/org/apache/pdfbox/util/PDFTextStripperByArea.html
It's getTextForRegion() method returns a String.
For me ,i usually extract from the pdf using this method:
public void Pdf Processor Extract TextPage(string PDFInputFile, int PDFPageNumberStart,
int PDFPageNumberStop, string PDFOutputFile);
You can have a try.Hope to help you.
Just a side observation:
If the PDF file is read-only, you may not be able to extract the data you want.