9 Replies Latest reply: Sep 19, 2007 12:23 AM by 807592 RSS

    Read from PDF document

    807587
      Is it possible to read from an PDF document?
      For example an PDF doucment form with Name and Phone number.

      Can I get the value of Name and Phone number in my program. I could not find any help online with the topic, if possible can some body guide me to some good articles or piece of example code.
      Thanks
      ASB
        • 1. Re: Read from PDF document
          DrClap
          That's one of the questions here:

          http://www.lowagie.com/iText/faq.html
          • 2. Re: Read from PDF document
            807587
            Yes I have used itext to create PDF documents. But now i want to extract information from PDFs. Looking for a lib which could help me do that. Have found a commercial lib provider(PDFTextStream) but looking for something which is free.

            Thanks
            ASB
            • 3. Re: Read from PDF document
              gimbal2
              I have had some succes with PDFBox

              http://www.pdfbox.org/

              but it caused an exception on PDFs that were most likely created on a Mac and so contained a font which is not in the windows fontset (this is only a guess). Still, for most PDF's this API works like a charm. Text only PDFs of course.
              • 4. Re: Read from PDF document
                807587
                Thanks
                • 5. Re: Read from PDF document
                  807587
                  hey ... i am going to be using PDFBox very soon to search text within PDF documents. So is it possible for you to give me some sample code which will give me a little headstart as to how to search text within pdf files ... looking forward to collaborating with you on this front
                  • 6. Re: Read from PDF document
                    807587
                    Here is an easy starting point for pdfbox.
                    import java.io.IOException;
                    import java.io.StringWriter;
                    
                    import org.pdfbox.pdmodel.PDDocument;
                    import org.pdfbox.util.PDFTextStripper;
                    
                    public class PdfReader {
                    
                        public String getPdfText(String fileName) throws IOException {
                            StringWriter sw = new StringWriter();
                            PDDocument doc = null;
                    
                            try {
                                doc = PDDocument.load(fileName);
                               
                                PDFTextStripper stripper = new PDFTextStripper();                
                                stripper.setStartPage( 1 );
                                stripper.setEndPage( Integer.MAX_VALUE );
                    
                                stripper.writeText(doc, sw);
                            } finally {
                                 if (doc != null) {
                                     doc.close();
                                 }
                            }
                            
                           return sw.toString();
                      }
                    }
                    Eugenio
                    • 7. Re: Read from PDF document
                      807587
                      PDF Files are really text files formatted in a specific way. You can find the spec at:

                      http://partners.adobe.com/public/developer/pdf/index_reference.html

                      It's a little convoluted but if you take a little time reading and digesting the spec, along with picking apart an existing PDF file, you should be able to develop some classes to do what you need to do.

                      I know this is do-able because I was able to create a PDF file containing text and some line graphics this way.
                      • 8. Re: Read from PDF document
                        807592
                        i cant execute the above code
                        i am getting an err like this
                        pls help me
                        ---------- java ----------
                        Exception in thread "main" java.lang.NoClassDefFoundError: org/fontbox/afm/FontMetric
                             at org.pdfbox.pdmodel.font.PDFont.getAFM(PDFont.java:334)
                             at org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(PDSimpleFont.java:104)
                             at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:336)
                             at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
                             at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
                             at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
                             at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
                             at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
                             at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
                             at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
                             at PDFReader.getPdfText(PDFReader.java:20)
                             at Reader.main(Reader.java:16)

                        Output completed (0 sec consumed) - Normal Termination
                        • 9. Re: Read from PDF document
                          807592
                          i cant execute the above code
                          i am getting an err like this
                          pls help me
                          ---------- java ----------
                          Exception in thread "main" java.lang.NoClassDefFoundError: org/fontbox/afm/FontMetric
                               at org.pdfbox.pdmodel.font.PDFont.getAFM(PDFont.java:334)
                               at org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(PDSimpleFont.java:104)
                               at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:336)
                               at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
                               at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
                               at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
                               at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
                               at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
                               at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
                               at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
                               at PDFReader.getPdfText(PDFReader.java:20)
                               at Reader.main(Reader.java:16)

                          Output completed (0 sec consumed) - Normal Termination