6 Replies Latest reply: Mar 26, 2012 5:58 AM by user12240205 RSS

    How to read a PDF from within PL/SQL?

    user12240205
      we have a requirement like this: We have a Time Tracking and Project Monitoring System whose DB is a Oracle 10g R2. We want to automate our ''Meeting Minute'' processing.

      -- The project lead will write minutes into a form PDF.
      -- I.e. a PDF where people can type in information to fields.
      -- The PDF will be e-mailed to the project manager.
      -- Project manager will save the PDF in a HDD directory.
      -- Then he will run a program.
      -- Program will pickup all the PDFs in directory one-by-one.
      -- For each PDF, the program should read the fields.
      -- Read values will be inserted to the DB.

      Now, my problem is, HOW ON EARTH TO READ A "PDF"???

      One option is to CONVERT the PDF to XML and then read the XML. BUT, unfortunately after googling for over 4 hours I came up empty on converting PDF to XML.

      How can we do this???

      Any help would be greatly appreciated.
        • 1. Re: How to read a PDF from within PL/SQL?
          Billy~Verreynne
          PDF is not structured data - it is formatted text (unstructured) data.

          So I would say that the basic concept of using PDF as a data entry format for structured data makes as much sense as fitting doors to a motor cycle.

          The basic process you describe is known as a workflow. Workflow systems have been around since the mid 90's. Nothing new.

          So either a workflow system is needed, or instead an Apex web application to supports the basic processing steps (using database entry web forms instead of PDF) can be put together in a couple of hours with minimal effort (and little cost as Oracle Apex is free).
          • 2. Re: How to read a PDF from within PL/SQL?
            rp0428
            >
            after googling for over 4 hours I came up empty on converting PDF to XML.
            >
            Glad I'm not paying you by the hour! You might need to switch to decaf.

            It only took a few seconds for me using 'convert pdf to xml'
            Right there on the first page was a link on how to use Adobe Acrobat to do the conversion.

            http://www.ehow.com/how_5806567_convert-xml-using-adobe-acrobat.html

            So then you use 'adobe acrobat convert pdf to xml' and it's amazing - there are links to the adobe site. One of them is about an XMl Plug-in for Windows.
            http://www.adobe.com/support/downloads/detail.jsp?ftpID=1209

            And an adobe forum question about it
            http://forums.adobe.com/thread/339814

            Maybe try searching again after the caffeine wears off a little?

            If it can be done with PDF Adobe has the tools to do it.
            • 3. Re: How to read a PDF from within PL/SQL?
              MichaelS
              Now, my problem is, HOW ON EARTH TO READ A "PDF"???
              One option is Oracle Text as shown e.g. in Re: Read from a file. Though again you'll be confronted with the problem on how to structure unstructured data unless the resulting plain text file is quite easy to parse ...
              • 4. Re: How to read a PDF from within PL/SQL?
                Hans Forbrich
                user12240205 wrote:
                we have a requirement like this: We have a Time Tracking and Project Monitoring System whose DB is a Oracle 10g R2. We want to automate our ''Meeting Minute'' processing.
                I'd probably investigate ORACLE_Text, the free document/text indexing capability of the database. Described at http://docs.oracle.com/cd/E11882_01/text.112/e24435/ind.htm#i1004902

                But that's just me ...
                • 5. Re: How to read a PDF from within PL/SQL?
                  MichaelS
                  But that's just me ...
                  Not just you ;) . See my previous post.
                  • 6. Re: How to read a PDF from within PL/SQL?
                    user12240205
                    Oracle_Text works. I tried it out. We can get the PDF output as it is. Problem is, we just get raw text and not with structure. So we can't do anything with the data. The PDF is a fillable PDF form. So we need to get the answers for the questions in the PDF. Is there any way to do this??