This discussion is archived
5 Replies Latest reply: Oct 3, 2007 5:08 PM by 807605 RSS

extracting text from pdf file

807605 Newbie
Currently Being Moderated
Hi All
I want to extract only text from a pdf file.
I am trying to extrat text from a pdf file using PDFBox. But I am getting error. My code is like this:
/*
* Main.java
*
* Created on den 10 september 2007, 23:01
*
* To change this template, choose Tools | Template Manager
* and open the template in the editor.
*/

package extracttext;
import org.pdfbox.exceptions.InvalidPasswordException;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.util.PDFTextStripper;
//import java.awt.Rectangle;
//import java.util.List;
import org.pdfbox.pdmodel.PDPage;


/**
*
* @
*/
public class Main {

/** Creates a new instance of Main */
public Main() {
}

/**
* @param args the command line arguments
*/
public static void main( String[] args ) throws Exception
{
int startPage = 1;
int endPage = Integer.MAX_VALUE;


PDDocument document = null;
try
{
document = PDDocument.load( "C:\\thesis\\fileread\\sim.pdf" );
if( document.isEncrypted() )
{
try
{
document.decrypt( "" );
}
catch( InvalidPasswordException e )
{
System.err.println( "Error: Document is encrypted with a password." );
System.exit( 1 );
}
}
PDFTextStripper stripper = new PDFTextStripper();
stripper.setSortByPosition( true );
stripper.setStartPage( startPage );
stripper.setEndPage( endPage );
System.out.println("Text: " + stripper.getText(document));



}
finally
{
if( document != null )
{
document.close();
}
}
}
}




can anybody pls help me solving this problem

Regards,
UK
  • 1. Re: extracting text from pdf file
    DrClap Expert
    Currently Being Moderated
    Sure. You should start by reading the error message. If it's a stack trace, it will tell you exactly which line of code had the error. If you want to ask a good question don't hesitate to do that.
  • 2. Re: extracting text from pdf file
    807605 Newbie
    Currently Being Moderated
    Hi
    thnks for ur reply.
    I have the code. if u can pls tell me where to modify them it will be helpful for me
  • 3. Re: extracting text from pdf file
    807605 Newbie
    Currently Being Moderated
    > I have the code.

    Presumably, you also have the stack trace and exact error message. Please read it and try to understand it. If you find it confusing, post the full text of the error message and stack trace here, and we can point you in the right direction.

    Note that the "right direction" may well be to contact pdfbox.org if the problem lies with their classes.

    ~
  • 4. Re: extracting text from pdf file
    807605 Newbie
    Currently Being Moderated
    i get the following error message:

    Exception in thread "main" java.lang.NoClassDefFoundError: org/fontbox/afm/FontMetric
    at org.pdfbox.pdmodel.font.PDFont.getAFM(PDFont.java:334)
    at org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(PDSimpleFont.java:104)
    at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:336)
    at org.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:80)
    at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
    at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
    at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
    at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
    at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
    at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
    at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
    at extracttext.Main.main(Main.java:55)
    Java Result: 1
    BUILD SUCCESSFUL (total time: 1 second)


    I would appreciate if you can please help me writing a java program that can extract only test from a pdf file
  • 5. Re: extracting text from pdf file
    807605 Newbie
    Currently Being Moderated
    Hi OAS,

    just download FontBox project available at http://sourceforge.net/projects/fontbox/, and include the included JARS in your project's classpath (fontbox-0.1.0.jar).

    H�ctor