Making Queries To Search Engines From SQL In Oracle Database

Version 2

    Making Queries To Search Engines From SQL In Oracle Database


    By  Yuli Vasiliev



    One of those things that make Oracle Database outstanding is a wide range of options when it comes to accessing and manipulating data of different types – located both inside and outside of the database. This article illustrates Oracle Database’s ability to derive data from the external sources that can dynamically generate datasets upon requests coming from the database, joining then those datasets with relational data, if necessary, of course. In particular, it explains how you might access a search engine’s API from within an Oracle database, obtaining search results programmatically – that is, outside of a browser context.


    How It Works


    Viewing the Web as a huge data container, you can find it similar to a database, in a sense that both can be queried for information of interest. The key difference lies in the way you write a query. When writing a query against a database, you explicitly specify the database objects holding the data you want to retrieve. In contrast, when querying the Web, you may not know exactly where the information you need can be found, relying on a search engine to find it – based on a search string you provide. Of course, from the standpoint of a database user making a programmatic query to a search engine from within a SQL query, this is a fact of minor importance that the database employs an external application’s retrieval mechanisms behind the scenes to get the desired data, rather than using its own retrieval system. What really may be interesting is that the data retrieved that way can be joined with any other data being retrieved within the SQL query.


    In fact, accessing Web content from within the database is nothing new in Oracle Database. With the introduction of Oracle XQuery in release 10.2, Oracle Database acquired the ability to query XML-expressible datasets with a URL, shredding the results into relational data if necessary. It is important to note that not only static datasets, such as HTML and XML documents, can be queried with XQuery, but also URL callable applications generating dynamic datasets, including search engines, of course. In other words, everything that is callable with a URL can be queried with XQuery. Moreover, since Oracle Database supports XQuery through SQL/XML functions, such as XMLTable and XMLQuery, you can access a search engine’s API from within a SQL query, which makes it possible to combine the search results with the other data the SQL query retrieves. Diagrammatically, this might look like the figure below:


    Figure 1: Accessing a search engine’s API from within an Oracle database with Oracle XQuery.


    As you no doubt have realized, the approach discussed here relies on programmatic ways of accessing search engines – those ways that provide the ability for an application to interact with a search engine outside of a browser context. To fulfill such needs, most search engines today provide APIs allowing to do web searches programmatically. Google, for example, provides more than one such API: Custom Search API and Google Web Search API. The first one, in turn, allows you to choose between two editions: Custom Search Engine (basic edition) and Google Site Search (business edition).


    Doing Google Searches From Within an Oracle Database


    As mentioned in the preceding section, Google provides several alternatives when it comes to programmatic searches. The Custom Search API is the newest one. Before you can use it, however, you have to take some preliminary steps, which include: creating a Google account (if you don’t have one), creating a Custom Search Engine, and obtaining an API key. Further details are beyond the scope of this article and can be found in the Custom Search API documentation.


    If you just want to get a taste of programmatic Web searches from within an Oracle database, you could try the Google Web Search API for a quick example. Although this API has been deprecated as for now, the good news is that it still works and its Developer's Guide is also still available. The main advantage to chose it over the newer alternatives is that you don’t have to sign up to use it – nor any other preliminary steps are required.


    To make a request to the Google Web Search API from your code, you have to open the URL for the API with the q URL argument supplying the search expression. For example, if you want to perform a Web search for Oracle XQuery, the URL to open should look like this:




    For simplicity, the URL in the above example includes only two arguments, namely v and q, which supply the protocol version number and the search expression, respectively. These two are not the only arguments you can append to a search URL, though. For example, you might use argument rsz to specify the number of results that you would like to receive. If not supplied, the value of this argument is assumed to be 4. To look at the entire list of available arguments, check out the Standard URL Arguments section in the Google Web Search API Class Reference manual.


    You can try to open a search URL like the one shown above in any environment that supports accessing resources through URLs. The problem you may face, though, is the inability to access a JSON encoded result set returned by the Google Web Search API. Thus, prior to Oracle Database 12c, you could not make a request to the Google Web Search API directly from within an XQuery query. The problem was that Oracle Database did not recognize JSON. You had to employ a middleman between XQuery and the API, transforming JSON into XML. In particular, you could use a script written in Python or PHP, which would initiate a programmatic Web search and then transform JSON into XML before returning the results to the calling XQuery query. The diagram in the figure below gives a graphical depiction of this scenario:


    Figure 2: Accessing Google Web Search API with Oracle XQuery through an external script as a mediator.


    Before you can write such a mediator script however, you have to know the structure of the JSON object that represents a result set returned by a Google Web Search API query. This information can be found in the Result Objects section in the Google Web Search API Class Reference manual. Another simple way to look at the structure of a JSON result set returned is through a browser. For that, you can make a programmatic call to the API by pointing your browser to an URL like the one discussed in the beginning of this section, and then examine the retrieved JSON document. Alternatively, you can obtain this same JSON document through the database by issuing the following query:




    Looking through the JSON retrieved, you may notice that it includes a number of key/value pairs describing each search result returned. Suppose you are interested only in the title and URL for each result. In that case, according to the document structure, you might want to extract the values of the titleNoFormatting and url keys.


    In Python, the code for the mediator script might look like this:


    def index(req):
      #make a request to the Google Web Search API
      import urllib2
      import urllib
      import simplejson
      from mod_python import util
      params = util.FieldStorage(req)
      searchstr = urllib.quote(params['q'].value)
      url = (''
           '?v=1.0&q=%s' % searchstr)
      request = urllib2.Request(url)
      response = urllib2.urlopen(request)
      search_results = simplejson.load(response)
      results = search_results['responseData']['results']
      #derive required info from the search results and wrap it in XML   
      from xml.dom.minidom import getDOMImplementation
      impl = getDOMImplementation()
      newdoc = impl.createDocument(None, "results", None)
      top_element = newdoc.documentElement
      for result in results:
        line = newdoc.createElement("line")
        title = newdoc.createElement("title")
        title_text = newdoc.createTextNode(urllib.unquote_plus(result['titleNoFormatting']))
        url = newdoc.createElement("url")
        url_text = newdoc.createTextNode(urllib.unquote_plus(result['url']))
      #return XML as a string 
      return newdoc.toxml()


    Assuming the above script is named and available through a Web server on the localhost, you might try it out by pointing your browser to the following URL:




    This should output a XML document containing the search results (titles and URLs only) retrieved by the Google Web Search API for Oracle XQuery. Now you might want to test it out from within an Oracle database. So, you might issue the following SQL query in SQL*Plus:


    VARIABLE search_url VARCHAR2(200);
    :search_url := 'http://localhost/';
    'for $j in $rslts/results/line
      return $j'
      PASSING xmlparse (document httpuritype
      (:search_url).getCLOB()) as "rslts"
      COLUMNS title VARCHAR2(100) PATH 'title',
              url VARCHAR2(100) PATH 'url');


    The output should look like this (the actual data may vary, of course, since Google periodically updates the search positions of web pages):


    Using XQuery with Oracle XML DB - Oracle Documentation
    XMLQUERY - Oracle Documentation
    XQuery and Oracle XML DB - Oracle Documentation
    Oracle XMLQuery XML Query -


    In the above example, note the use of the XMLTable SQL/XML function that shreds the XML returned by the XQuery query into relational data. The structure of the relational view being created over the XML is defined in the COLUMNS clause of XMLTable. In particular, in the COLUMNS clause you define the relational columns, specifying then in the PATH clause what portion of the XML is to be used as the column content.


    Native Support For JSON In Oracle Database 12c


    Now, with Oracle Database 12c understanding JSON natively, you can do without a mediator script like that one you saw in the preceding section, accessing the Google Web Search API directly from within a SQL query, as depicted in the figure below:


    Figure 3: Accessing Google Web Search API directly in Oracle Database 12c.


    Picking out the data of interest from a JSON output can be done with JSON_TABLE – an Oracle SQL function available in Oracle Database 12c. JSON_TABLE is similar to XMLTable in that it creates a relational view of non-relational data. Just like XMLTable, JSON_TABLE has the COLUMNS clause to define the columns of the relational view to be created. In the following example, you use JSON_TABLE to create a relational view of the JSON data returned by the Google Web Search API:




    The output should be the same or similar to the one the XMLTable gave you in the preceding section, provided that the search expression passed to the Google Web Search API remained the same:


    Using XQuery with Oracle XML DB - Oracle Documentation
    XMLQUERY - Oracle Documentation
    XQuery and Oracle XML DB - Oracle Documentation
    Oracle XMLQuery XML Query -


    Filtering And Joining


    Turning back to the example from the preceding section, suppose you want to output only those search results for Oracle XQuery that contain substring Oracle Documentation in the title. To achieve this, you might use the WHERE clause of SELECT like in a regular SQL query issued against relational data:




    In the above example, you filter the retrieved search results on the database side. Not surprisingly, the shown result set can be less than the usual result set (4 results by default), since WHERE cuts off irrelevant rows:


    Using XQuery with Oracle XML DB - Oracle Documentation
    XMLQUERY - Oracle Documentation
    XQuery and Oracle XML DB - Oracle Documentation


    However, if you want to apply filtering and have the maximum number of rows in the result set anyway, you can instruct the searcher to look for results on Oracle XQuery only on the Oracle Documentation website by appending the site: keyword followed by the domain:




    This time the output might look like this:


    Using XQuery with Oracle XML DB - Oracle Documentation
    XQuery and Oracle XML DB - Oracle Documentation
    XMLQUERY - Oracle Documentation
    XQUERY - Oracle Documentation


    Finally, let’s look at an example of joining search results retrieved by the Google Web Search API with the database data. As an OTN author, I hold information about my articles ever published on OTN in a simple table called articles, which actually has only two columns: pubDate and title. Googling my name within the domain should output the titles and urls of my most popular OTN articles. Joining that data with the data stored in the articles table could, for example, remind me when a particular article was written:





    Here’s the output:


    ------- ------------------------------------------------------------
    2011    Article: How to XQuery Non-JDBC Sources from JDBC - Oracle
    2012    How to Use Oracle VM VirtualBox Templates, by Yuli Vasiliev
    2013    Evaluating Oracle Solaris 11 from Inside Oracle VM VirtualBox
    2012    Tech Article: How to Use Oracle VM Templates, by Yuli Vasiliev




    In this article, you learned how easy it is to make a request to the Google Web Search API from within a SQL query issued against an Oracle database, and then shred the results retrieved into relational rows. First, you learned how the Google Web Search API can be accessed via an external script as a mediator called from within a SQL query. This approach can be used in Oracle Database starting with release 10.2. Then, you looked at an example of accessing a JSON output of the Google Web Search API directly, taking advantage of the native support for JSON in Oracle Database 12c.


    See Also



    About the Author


    Yuli Vasiliev is a software developer, freelance author, and consultant currently specializing in open source development, Java technologies, business intelligence (BI), databases, service-oriented architecture (SOA), and more recently virtualization. He is the author of a series of books on the Oracle technology, including Oracle Business Intelligence: An introduction to Business Analysis and Reporting (Packt) and PHP Oracle Web Development: Data processing, Security, Caching, XML, Web Services, and Ajax (Packt).