5 Replies Latest reply: Oct 3, 2007 5:08 PM by 807605 RSS

    extracting text from pdf file

    807605
      Hi All
      I want to extract only text from a pdf file.
      I am trying to extrat text from a pdf file using PDFBox. But I am getting error. My code is like this:
      /*
      * Main.java
      *
      * Created on den 10 september 2007, 23:01
      *
      * To change this template, choose Tools | Template Manager
      * and open the template in the editor.
      */

      package extracttext;
      import org.pdfbox.exceptions.InvalidPasswordException;
      import org.pdfbox.pdmodel.PDDocument;
      import org.pdfbox.util.PDFTextStripper;
      //import java.awt.Rectangle;
      //import java.util.List;
      import org.pdfbox.pdmodel.PDPage;


      /**
      *
      * @
      */
      public class Main {

      /** Creates a new instance of Main */
      public Main() {
      }

      /**
      * @param args the command line arguments
      */
      public static void main( String[] args ) throws Exception
      {
      int startPage = 1;
      int endPage = Integer.MAX_VALUE;


      PDDocument document = null;
      try
      {
      document = PDDocument.load( "C:\\thesis\\fileread\\sim.pdf" );
      if( document.isEncrypted() )
      {
      try
      {
      document.decrypt( "" );
      }
      catch( InvalidPasswordException e )
      {
      System.err.println( "Error: Document is encrypted with a password." );
      System.exit( 1 );
      }
      }
      PDFTextStripper stripper = new PDFTextStripper();
      stripper.setSortByPosition( true );
      stripper.setStartPage( startPage );
      stripper.setEndPage( endPage );
      System.out.println("Text: " + stripper.getText(document));



      }
      finally
      {
      if( document != null )
      {
      document.close();
      }
      }
      }
      }




      can anybody pls help me solving this problem

      Regards,
      UK
        • 1. Re: extracting text from pdf file
          DrClap
          Sure. You should start by reading the error message. If it's a stack trace, it will tell you exactly which line of code had the error. If you want to ask a good question don't hesitate to do that.
          • 2. Re: extracting text from pdf file
            807605
            Hi
            thnks for ur reply.
            I have the code. if u can pls tell me where to modify them it will be helpful for me
            • 3. Re: extracting text from pdf file
              807605
              > I have the code.

              Presumably, you also have the stack trace and exact error message. Please read it and try to understand it. If you find it confusing, post the full text of the error message and stack trace here, and we can point you in the right direction.

              Note that the "right direction" may well be to contact pdfbox.org if the problem lies with their classes.

              ~
              • 4. Re: extracting text from pdf file
                807605
                i get the following error message:

                Exception in thread "main" java.lang.NoClassDefFoundError: org/fontbox/afm/FontMetric
                at org.pdfbox.pdmodel.font.PDFont.getAFM(PDFont.java:334)
                at org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(PDSimpleFont.java:104)
                at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:336)
                at org.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:80)
                at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
                at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
                at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
                at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
                at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
                at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
                at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
                at extracttext.Main.main(Main.java:55)
                Java Result: 1
                BUILD SUCCESSFUL (total time: 1 second)


                I would appreciate if you can please help me writing a java program that can extract only test from a pdf file
                • 5. Re: extracting text from pdf file
                  807605
                  Hi OAS,

                  just download FontBox project available at http://sourceforge.net/projects/fontbox/, and include the included JARS in your project's classpath (fontbox-0.1.0.jar).

                  H�ctor