1 Reply Latest reply: Feb 6, 2013 5:04 AM by Brett R-Oracle RSS

    CAS Crawl - Identifying the Text color

      Hi All,

      We have a requirement where we need to recognize the color of the text in the source data (red / green). We are using CAS to crawl the unstructured data source (Word Documents). Is there a way where CAS can recognize this color code of the text? Does CAS provide this functionality at all?

        • 1. Re: CAS Crawl - Identifying the Text color
          Brett R-Oracle
          Hi Mahesh

          CAS performs text-extraction on binary files (such as regular Word documents) before it stores the content. The text-extraction process disregards and discards all formatting, leaving only the raw text, so you won't have access to any color-formatting in the CAS output or record store.

          If your files were in Word XML (and I'm guessing they aren't) you could process them as XML and handle the <w:color /> element. Otherwise, you'd have to build a custom component and use a third-party library capable of traversing the structure of the binary Word format. That's a fair bit of trouble to go to to detect text color, and I'd expect throughput to be very low in comparison.