This content has been marked as final. Show 6 replies
If you add "." as a searchable character then a search for ".net" should only return results for ".net" rather than "net" (unless .net doesn't exist and it spell-corrects). Given you've mentioned searchable characters in your post, I'm guessing you are aware of this and actually want to perform term extraction on your data (so domain.net would match a search for ".net") - is that correct?
If we add "." as searchable, then all words ending with a dot won't be searchable anymore without dot, for example "java" won't be found anymore, but "java." will.
That is correct - if you make "." searchable, then it will consider all occurrences equally - "java" would match "java", but not "java." (unless there were no incidences of "java" and then it would spell-correct to "java."). If you want "." to be indexed for certain words but not others, do you know what those words are in advance, i.e. do you have a pre-defined list of these? I would look at doing term extraction if so - that is the only way I can think of to get what you want without creating issues elsewhere (depending on your data).
Yes, we know in advance the terms we want to be searchable. It's a very short list :
- With a dot, the only one we know is : ".net" ; it's by far the most important to make searchable (customers ask every day for this)
- With other special chars, there may be "$U", "C++" or "C#" (we added "+" and "#" as searchable chars - and it works fine, but if we have the choice we would prefer specify apart only this term and remove "+" and "#" as searchable char)
Having the hand on this list, to update it when we need (some brand name may be useful), would be perfect. But the basic need if only for ".net"
I did not understood what you mean by "term extraction" ?
Term extraction is when you monitor the data during ingest and pull out specific terms you are interested in. It is usually done to extract metadata from unstructured data (e.g. if "java" exists in the text within a CV upload, for example, you would "extract" that term into a separate property, and map that separate property to your "Programming Languages" dimension).
I appreciate you want some solution that just works out of the box, but unfortunately given the use of "." as a sentence end, there is no way (or at least, none I can think of) that you can make "." a searchable character only if it appears at the start of the word "net" and not in any other cases. The way I would handle it would be to:
1) Add a manipulator (java or perl) to the pipeline that takes a pre-defined list of terms and loops through all (relevant) properties on the records
2) Any matches for those terms against the data for those terms get tagged to a new property on the record
3) If a match contains a non-searchable, non-alphanumeric character, have these characters replaced with searchable ones, e.g. "." -> "~", so .net -> ~net
4) Repeat the transformation step in (3) in your web application (e.g. "." -> "~", so .net becomes ~net)
Thank you for your answer.
Unfortunatly, as I explained at the beginning of my 1st post, we can't ask a development to each web applications using the same dgraph to do some transform on-the-fly.
And we use snipetting, with an indication of the related property. If we duplicate data in a new property dedicated to .net we will lose the property at the origin (we can imagine to store this info in the new property itself, into something like a record in json format, but it's far too complicated for webapps using our dgraph).
Edited by: 984589 on 1 févr. 2013 02:52