This discussion is archived
2 Replies Latest reply: Jun 18, 2012 8:39 PM by 574272 RSS

Can I use OLH from insideJava code ?

Xavi Newbie
Currently Being Moderated
Hello I have a first map reduce job which splits a file in <contracts> and each contract is stored in HDFS , but each contract have sub tags like this.

I have this XML..
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<!--contract 1 -->
<contract>
<src_cont_id>312</src_cont_id>
<cont_src>55787-3</cont_src>
<cpny_cd_id>A833</cpny_cd_id>
<sttl_stat>0</sttl_stat>
<start_dt>2011-09-16</start_dt>
<trade_dt>2011-09-16</trade_dt>
<global_pdt>10800</global_pdt>
<pdt_ifrs>10800</pdt_ifrs>
<cpty_l_cpty_id>0755</cpty_l_cpty_id>
<cpty_l_cpty_id_src>55787-3</cpty_l_cpty_id_src>
<booking_l_cpty_id>0755</booking_l_cpty_id>
<booking_l_cpty_id_src>55787-3</booking_l_cpty_id_src>
<ownership_cd>8453</ownership_cd>
<profit_centre>NLSL</profit_centre>
<!--Añado las 2 siguientes lineas para agregar fac_src y src_fac_id -->
<fac_src>55787-3</fac_src>
<src_fac_id>99999</src_fac_id>

<leg>

<leg_id>1</leg_id>

<reset_dt_roll_cnvn>1</reset_dt_roll_cnvn>
<curr_notl_amnt>6666.00000</curr_notl_amnt>
<curr_notl_amnt_ccy>EUR</curr_notl_amnt_ccy>

<leg_ccy>EUR</leg_ccy>
<orig_notl_amnt>6666.00000</orig_notl_amnt>
<orig_notl_amnt_ccy>EUR</orig_notl_amnt_ccy>
<pay_receive_ind>R</pay_receive_ind>
<cash_sttl_ind>N</cash_sttl_ind>
<sttl_ccy>EUR</sttl_ccy>
<sttl_dt>2011-09-20</sttl_dt>
<spot_prc>1.00000</spot_prc>
<spot_prc_ccy>EUR</spot_prc_ccy>
<itrst_rate>0.0000000</itrst_rate>
<itrst_rate_type_src>55787-3</itrst_rate_type_src>
<spread_rate>0.00000</spread_rate>
<ref_rate>0.0000000</ref_rate>
<itrl_funding_rate>0.0000000</itrl_funding_rate>
<itrl_fx_prc>1.00000</itrl_fx_prc>

</leg>
<leg>

<leg_id>2</leg_id>
<reset_dt_roll_cnvn>1</reset_dt_roll_cnvn>
<curr_notl_amnt>3333.00000</curr_notl_amnt>
..........................................................................................
..........................................................................................
</contract>
<!--contract 2 -->
<contract>

<src_cont_id>324</src_cont_id>
<cont_src>55787-3</cont_src>
<cpny_cd_id>A833</cpny_cd_id>
<sttl_stat>0</sttl_stat>
<start_dt>2011-09-19</start_dt>
<trade_dt>2011-09-19</trade_dt>
<global_pdt>10800</global_pdt>
<pdt_ifrs>10800</pdt_ifrs>
<cpty_l_cpty_id>0755</cpty_l_cpty_id>
<cpty_l_cpty_id_src>55787-3</cpty_l_cpty_id_src>
<booking_l_cpty_id>0755</booking_l_cpty_id>
<booking_l_cpty_id_src>55787-3</booking_l_cpty_id_src>
<ownership_cd>8451</ownership_cd>
<profit_centre>NLSL</profit_centre>

<!--Añado las 2 siguientes lineas para agregar fac_src y src_fac_id -->
<fac_src>55787-3</fac_src>
<src_fac_id>100000</src_fac_id>

<leg>

<leg_id>1</leg_id>
<reset_dt_roll_cnvn>1</reset_dt_roll_cnvn>
<curr_notl_amnt>5555.00000</curr_notl_amnt>
<curr_notl_amnt_ccy>EUR</curr_notl_amnt_ccy>
<leg_ccy>EUR</leg_ccy>
<orig_notl_amnt>5555.00000</orig_notl_amnt>
<orig_notl_amnt_ccy>EUR</orig_notl_amnt_ccy>
<pay_receive_ind>R</pay_receive_ind>
.......

Every subtag of the contract tag must be stored in different tables...(oracle database release 11). I'm looking for the best way to do this.Any suggestions?

As OLH only can load one table...I don't know how perform this action....Could I use OLH from inside Java Code sending one XML?
  • 1. Re: Can I use OLH from insideJava code ?
    251301 Explorer
    Currently Being Moderated
    There are two problems to solve here. The first is to process the input XML into a form that can be used by OLH. The second is to run multiple OLH jobs to load the different tables. I am not aware of any XML input formats that are provided in Hadoop. This is code that you will need to write yourself.

    The bulk of the code will be processing the XML elements to map them into appropriate rows for the destination tables. For example, you need a way to map a <contract> element into a row for the CONTRACT table and one or more rows for the LEG table. This code would be in the RecordReader of the InputFormat. This RecordReader would emit key, value pairs that might look like:

    contract 312, 55787-3, A833, ..., 99999
    leg 1, 1, 6666.00000, EUR, ..., 1.00000
    leg 2, 1, 3333.000....

    The mapper would be an identity mapper. In the shuffle/sort, all the contract keys and all the leg keys will be grouped in the reducers. The reducer can then write the values into files containing <contract> and <leg>rows. Furthermore, the row values are now in delimited text form.

    Then you can run two separate OLH jobs, one to load the CONTRACT table and another to load the LEG table.

    The important design element here is that there is code that produces key, values pairs where the key is the table to load and the value is a delimited text row for that table.
  • 2. Re: Can I use OLH from insideJava code ?
    574272 Journeyer
    Currently Being Moderated
    Recently I had to deal with it. if someone is still looking for a solution..

    Considering the above xml example.

    Here is one way to parse it using Java -
    1) Driver program using mahout's XMLINPUTFORMAT to get all the tags between start tag <contract> and end tag </contract>. Set the number of task reducer to 2.
    2) Mapper to parse xml and generate key-value as described by the above user "rabbott"

    The important design element here is that there is code that produces key, values pairs where the key is the table to load and the value is a delimited text row for that table.

    3) A partitioner class to partition on key - this will send all "contract" records to one reducer and all "leg" records to other reducer
    4) Reducer to just output the value

    You can also add custom output format to generate the filename as "contract" and "leg" instead what hadoop generates part-0000X, but then reducer would have output both key-value.

    I have blogged it which demonstrate parsing xml using hadoop streaming - http://odi-ee.blogspot.com/2012/06/hadoop-humongus-xml-files-and-odi.html

    Two OLH jobs will still be needed to move data from the two files into oracle tables.

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points