Extract data from PDF (Scanned image) based file

Jun 17, 2025 5:44PM

Summary:

Use Case:

The user uploads a file from the front end.
Based on the user’s instructions, specific details are extracted from the document.
The extracted details are stored in corresponding table columns, where they can be reviewed and processed further by the user.

Current Approach:

| Step | Description

|------|------------------------------------------------------------|

| 1 | **Start** |

| 2 | **Upload PDF file** |

| 3 | **Extract content and instructions** |

| 4 | **Convert to vector embeddings** |

| 5 | **Store in 23ai vector database** |

| 6 | **Perform vector search** |

| 7 | **Pass relevant vectors to LLM** |

| 8 | **LLM refines the result** |

| 9 | **Output final refined result** |

This Approach works fine when pdf is text based.

Obstacle:

We’ve encountered a significant limitation in processing the PDF documents due to the following reasons: