We have a requirement where we need to recognize the color of the text in the source data (red / green). We are using CAS to crawl the unstructured data source (Word Documents). Is there a way where CAS can recognize this color code of the text? Does CAS provide this functionality at all?
CAS performs text-extraction on binary files (such as regular Word documents) before it stores the content. The text-extraction process disregards and discards all formatting, leaving only the raw text, so you won't have access to any color-formatting in the CAS output or record store.
If your files were in Word XML (and I'm guessing they aren't) you could process them as XML and handle the <w:color /> element. Otherwise, you'd have to build a custom component and use a third-party library capable of traversing the structure of the binary Word format. That's a fair bit of trouble to go to to detect text color, and I'd expect throughput to be very low in comparison.