This discussion is archived
8 Replies Latest reply: Feb 1, 2013 3:53 AM by Michael Peel RSS

Record filters don't filter records before search?

paul s Newbie
Currently Being Moderated
The Endeca BasicDevGuide has a section "Search Query Processing Order", which lists record filtering as the first step in the search process.

I understood this to mean that the dataset on which the subsequent operations are done would be the filtered data set, and that these operations would be much faster than if they were done on the unfiltered dataset. However, I've recently learned from Oracle support that this is not the case. The record filter and the search are both applied separately to the full data set and the intersection of the two are returned as the result set.

I wanted to find out if there were other users who are using record filters under the assumption that they would limit the number of records that the search had to be carried out on. Also wanted to know from the Endeca product team if this was a feature that could be supported in the near future.

Thanks,
Paul
  • 1. Re: Record filters don't filter records before search?
    Branchbird - Pat Journeyer
    Currently Being Moderated
    Hi Paul,

    Do you feel comfortable sharing exactly what you were told by support? I have always been under the impression that Record Filters act as a pre-filter just as you described.

    From a performance standpoint, I can certainly see that, depending on your filter and search criteria, performance could be the same, better or worse from query to query and that merely limiting the records in your corpus may not buy you anything in terms of query latency. For example, a really complex record filter with 100-200 predicates combined with a search might perform worse than just performing that same search against the full data set.

    However, in terms of logical evaluation, it seems like a bug if record filtering does not pre-filter or at least "behave like it's pre-filtering". Take the example of Did You Mean. If you consider an application with strict security requirements, there would be a risk of suggesting terms that a user may not "have access to" as part of a Did You Mean result. If I had an Endeca index sourced from a repository of sensitive documents and I were to search for some obscure term, there would be a risk of Endeca suggesting a close term that is contained in the index generally, but not in my filtered corpus of data that I have access to. This would only happen if search was getting evaluated across the entire index, not just the portion to which I have been granted access.

    Hope you're doing well, my impressions are the same as yours.

    Patrick Rafferty
    http://branchbird.com
  • 2. Re: Record filters don't filter records before search?
    paul s Newbie
    Currently Being Moderated
    Hi Pat,

    What support told me is that the filtering and searches are both done on the full data set and then the intersection of the two sets is found. As you said, it behaves like a pre-filter and the security requirements you mentioned will be met. They said it was done this way to allow the record-filter and search to be cached separately.

    However, our scenario is that we have 3.5 million+ records of one type and 75K records of another type, and we use a record_type Dimension as a filter to limit searches to one or the other. We have designed the application assuming that searches on the 75K records would not put nearly as much load on the Dgraph as do searches against the 3.5 million records. Specifically, we allow a single-character wild-card searche on a subset (filtered on another dimension) of the smaller record set, because we assumed that the search was only being done against a single property on a dataset of about 2000 records.

    We also have a requirement to highlight search words when going to the details page from a search results page, and we implemented this using a record filter on the record specifier combined with the original search, rather than do it as a record query, which will not return search terms or query expansions.

    But it now appears that having a record filter in these cases will not make the searches any faster than doing the search and then refining on the record_type.

    So we're now considering whether 1) we should create separate indexes for the two record types (a lot of overhead and work-flow complexity), 2) name the properties on the two record types differently (assuming that if the property that we are searching on does not exist on the 3.5 m records, that would improve performance), or maybe 3) create a separate property using the first letters of a word in the property we are searching on, so that we can simulate a single-character wildcard search).

    A much more preferable option would be if Endeca could provide a query parameter that says" forget the caching - do the searches only on the filtered data set". I think there would be several use cases like ours where this would be immensely helpful.

    - Paul
  • 3. Re: Record filters don't filter records before search?
    paul s Newbie
    Currently Being Moderated
    Hi Pat,

    What support told me is that the filtering and searches are both done on the full data set and then the intersection of the two sets is found. As you said, it behaves like a pre-filter and the security requirements you mentioned will be met. They said it was done this way to allow the record-filter and search to be cached separately.

    However, our scenario is that we have 3.5 million+ records of one type and 75K records of another type, and we use a record_type Dimension as a filter to limit searches to one or the other. We have designed the application assuming that searches on the 75K records would not put nearly as much load on the Dgraph as do searches against the 3.5 million records. Specifically, we allow a single-character wild-card searche on a subset (filtered on another dimension) of the smaller record set, because we assumed that the search was only being done against a single property on a dataset of about 2000 records.

    We also have a requirement to highlight search words when going to the details page from a search results page, and we implemented this using a record filter on the record specifier combined with the original search, rather than do it as a record query, which will not return search terms or query expansions.

    But it now appears that having a record filter in these cases will not make the searches any faster than doing the search and then refining on the record_type.

    So we're now considering whether 1) we should create separate indexes for the two record types (a lot of overhead and work-flow complexity), 2) name the properties on the two record types differently (assuming that if the property that we are searching on does not exist on the 3.5 m records, that would improve performance), or maybe 3) create a separate property using the first letters of a word in the property we are searching on, so that we can simulate a single-character wildcard search).

    A much more preferable option would be if Endeca could provide a query parameter that says" forget the caching - do the searches only on the filtered data set". I think there would be several use cases like ours where this would be immensely helpful.

    - Paul
  • 4. Re: Record filters don't filter records before search?
    Michael Peel Journeyer
    Currently Being Moderated
    I agree with Pat on this - if I recall correctly record filter functionality was added specifically to implement security (i.e. simple ACLs), and that wouldn't work if they were somehow evaluated separately. The documentation (Advanced Developer's Guide) states: "If you specify a record filter, whether for security, custom catalogs, or any other reason, it is applied before any search processing. The result is that the search query is performed as if the data set only contained records allowed by the record filter."

    Can you run some tests with your data set just to confirm? I.e. try the filter+search combination, then repeat the same as independent operations. If filters are applied first, then the filter+search query should execute significantly faster than just the search (given 75k records vs 4.25m). If you don't have the full data, then either create a simple script to mock it up, or run against a subset with an equivalent ratio between filter+search and search only. Repeat this a few times with different search terms (but the same record filter).

    Cheers

    Michael
  • 5. Re: Record filters don't filter records before search?
    paul s Newbie
    Currently Being Moderated
    Michael - I opened a ticket with support since we were seeing high response times for some queries that used record filters to reduce the dataset, and after checking with Engineering, it was the support consultant who told me that record filters don't reduce the dataset that the search operates on. They do *act" like pre-filters from the security perspective. Anyway to check, I ran a wildcarded search against our dataset in two ways - using a refinement that has a record count of about 27K out  of the total set of 3.5 million+ records.

    The first time, I ran the search with the refinement first, and then the search with a record filter on that refinement.
    The second query did run much faster, but I think that was because the request was cached. So I ran the same queries on another Dgraph in reverse order - record filter + search first, then search + refinement.

    Here is an extract from the log files for the two queries:

    1359141777668 (ip) - 147927 44044 1919.80 *1918.11* 200 24 0 10 /graph?node=0&offset=0&nbins=10&attrs=Search_All+soft*prog*meth*|mode+matchall&filter=AND%283405%29

    1359141783332 (ip) - 147931 271201 1327.89 *1321.98* 200 24 0 10 /graph?node=3405&offset=0&nbins=10&attrs=Search_All++soft*prog*meth*|mode+matchall

    The record filter query took 1918 ms while the refinement query only took 1321 ms.

    - Paul
  • 6. Re: Record filters don't filter records before search?
    Michael Peel Journeyer
    Currently Being Moderated
    Hi Paul

    The test you are running is comparing a record filter against a navigation filter though - I thought the question was whether or not a search executes against the data after a filter is applied or not? That would look like:

    /graph?node=0&offset=0&nbins=10&attrs=Search_All+soft*prog*meth*|mode+matchall&filter=AND%283405%29
    /graph?node=0&offset=0&nbins=10&attrs=Search_All++soft*prog*meth*|mode+matchall

    I would expect query one to execute faster because the search is only against the restricted set of 27k records. Can you try that? You need to clear the cache in between queries too, use "./control/runcommand.bat Dgraph1 cycle" in between calls to restart the index and clear the cache.

    Thanks

    Michael
  • 7. Re: Record filters don't filter records before search?
    paul s Newbie
    Currently Being Moderated
    Hi Mike,

    I used the record filter vs refinement with a search because record-filters are supposed to pre-filters and refinements are post-filters. Anyway, I ran a search, a record filter (limiting the dataset on the record identifier) and search record filter, with an op=flush after each to clear the cache. Here are the log records.

    Search only (on 3.5 million records)
    1359676860306 140.98.135.200 - 8692 8198 1048.81 *556.38* 200 6288 0 10 /graph?node=0&offset=0&nbins=10&attrs=p_Abstract+sec*ind*mag*|mode+matchall&dym=1&irversion=601&nbins=1

    Record-filter only (returns 1 record)
    1359676111250 140.98.135.200 - 8330 9974 186.23 *5.06* 200 1 0 10 /graph?node=0&offset=0&nbins=10&filter=AND%28p_Endeca_Id%3a92286%29&irversion=601

    Record-filter + search -
    1359676229906 140.98.135.200 - 8393 10377 553.43 *348.45* 200 1 0 10 /graph?node=0&offset=0&attrs=p_Abstract+sec*ind*mag*|mode+matchall&dym=1&filter=AND%28p_Endeca_Id%3a92286%29&irversion=601&nbins=1

    So the filter did reduce the search time from 556 to 348, but considering that the record filter only took 5 seconds and returned one result, I would have expected a much larger improvement in response time, if the search was against a single record.

    Is there anyone from Endeca who can explain how record filters actually work?
  • 8. Re: Record filters don't filter records before search?
    Michael Peel Journeyer
    Currently Being Moderated
    Hi Paul

    From the results of those tests it definitely looks like I was wrong and the explanation provided by engineering via support is right. Not sure how the dgraph resolves the intersection of the two result sets under the hood, but hopefully someone from Product Management/Engineering will chip in. In terms of application design though, if you've got wildcard searching against a subset of data it definitely looks like having the subset in a separate index is the best approach.

    Michael

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points