Forum Stats

  • 3,780,535 Users
  • 2,254,407 Discussions
  • 7,879,374 Comments

Discussions

What is the optimal way for reading the huge data from oracle DB?

User_CK8A6
User_CK8A6 Member Posts: 5 Green Ribbon

In our production DB, there are txn tables that have more than 500 million records. I am trying to read those records into python pandas data frame using pandas.read_sql and also tried with pandas.read_sql_query. But these options are very slow. Can you suggest any optimal way to do this task. This required to do the data processing.


Thanks,

Manoj Kumar

Tagged:

Answers

  • BluShadow
    BluShadow Member, Moderator Posts: 41,609 Red Diamond

    Define "very slow".

    How big are the "records"? Are they being queried directly on the server or is the data being pulled over a network (local network or remote network?) and if so how fast is the network (piece of string? optical fibre). What is the query, is it complex or simple? Does it have conditions around it? Are there appropriate indexes? As 500 million rows is a lot of data, what is considered a reasonable amount of time for querying that? In fact, why query it ALL into another structure/data store anyway? Why not perform whatever reasonable analysis, data processing etc. you want directly on the Oracle database?

    In other words, you need a better explanation of what you're trying to do, why you're doing it, and how you're going about it.

  • User_CK8A6
    User_CK8A6 Member Posts: 5 Green Ribbon

    Hi,

    Records have ONLY 6 columns some numbers and varchar2. This activity is performed on the LAN which 1 GBPS. There is no condition in the query it is just a select * from table. I am trying to create data pipeline or u can say ETL for which I need to read the raw data and then perform some transformation like calculating new columns based on the business logic. and finally move the records in the another oracle DB for the usage of another system. As of now I have stuck at Extract activity as this is taking around 1.3 hours to 2 hours to finish this. So I am thinking is there any better way to do this exercise.


    Thanks,

    manoj kumar

  • User_H3J7U
    User_H3J7U Member Posts: 810 Silver Trophy

    Check statistic via sqlplus:

    set arraysize 1000
    set autotrace traceonly
    set timing on
    <... your query ...>
    


  • Billy Verreynne
    Billy Verreynne Software Engineer Member Posts: 28,638 Red Diamond
    edited May 13, 2021 1:07PM

    Moving/reading/writing lots of data structures (rows) in serial is as fast as the slowest moving part - which typically is reading/writing to spinning rust (disk) or copper (network).

    This speed can be improved by using SSD, faster Ethernet with less jitter and latency, using fewer frames and so on. Usually a limited set of options, with some pretty costly.

    No fault of Oracle. It is directly related to the performance of the h/w and network infrastructure used.

  • Sergei Krasnoslobodtsev
    Sergei Krasnoslobodtsev Member Posts: 495 Silver Badge
    edited May 14, 2021 9:41AM

    Hi.

    According to what you write about, you have a infrastructure problem.

    First of all, you need to determine the sizing(mem,cpu,disks,lan) of your ETL server and update it.

    As for programmatically processing the data stream, the steps to improve performance can be as follows: multithreaded source-sink read / write, configuring driver parameters, and so on.

    You also need to pay attention to the libraries that you use, taking into account your tasks.

    If you are creating an industrial solution(productive,scalable, etc.), then an alternative to a self - written solution can be to use one of the existing ETL tools, based on your tasks.

    N.B. Your problem is complex, and it need to solve it step by step according to the identified problem points.

  • BluShadow
    BluShadow Member, Moderator Posts: 41,609 Red Diamond

    Even consider if you need the etl tool. If you're taking data from one Oracle database, to another, and doing some processing on the way, can that processing be done directly with Oracle itself.... very likely. SQL is a powerful language. That would cut out a middle ETL server and you could query direct from the target Oracle database and have it generate the additional columns etc. as you need.