Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Percolator is much slower than in ES1, and pre-selecting do not work #16285

Open
garipovazamat opened this issue Oct 11, 2024 · 3 comments
Open
Labels
bug Something isn't working Performance This is for any performance related enhancements or bugs Search:Performance Search Search query, autocomplete ...etc

Comments

@garipovazamat
Copy link

garipovazamat commented Oct 11, 2024

What is the bug?

We have been trying to migrate from Elasticsearch version 1.7.6 to the latest version (8.15) in our company and discovered that the latest version has become much slower. To find the reason for this degradation, I conducted several experiments. During these experiments, I found that some claimed improvements likely do not work as expected. I have duplicated this issue from elasticseaerch repository, because I found the same problem in Opensearch. I'm sure that problem migrated when Opensearch was forked.

Experiment details

I created the following index mapping:

{  
    "properties": {  
        "props": {  
            "properties": {  
                "entity_obj": {  
                    "properties": {  
                        "category": {"type": "keyword"},  
                        "id": {"type": "integer"},  
                        "priceTotal": {"type": "integer"},  
                        "totalArea": {"type": "double"}  
                    }  
                },  
                "price": {"type": "long"},  
                "room": {"type": "short", "store": True}  
            }  
        },  
        "query": {"type": "percolator"}  
    }  
}

I filled index with 10 000 duplicated queries, which contain only must, should, term and range conditions

{  
    "query": {  
        "bool": {  
            "must": [  
                {"term": {"props.entity_obj.category": "flat1"}},  # first, simple condition
                {  # second, more comlicated condition
                    "bool": {  
                        "must": [  
                            {"bool": {  
                                "should": [  
                                    {'term': {"props.entity_obj.category": 'flat2'}},  
                                    {"range": {"props.price": {"gte": 1000}}},  
                                    {'term': {"props.entity_obj.category": 'flat2'}},  
                                    {"range": {"props.price": {"gte": 1000}}},  
                                    {'term': {"props.entity_obj.category": 'flat2'}},  
                                    {"range": {"props.price": {"gte": 1000}}},  
                                    {'term': {"props.entity_obj.category": 'flat2'}},  
                                    {"range": {"props.price": {"gte": 1000}}},  
                                    {'term': {"props.entity_obj.category": 'flat2'}},  
                                    {"range": {"props.price": {"gte": 1000}}},  
                                    {'term': {"props.entity_obj.category": 'flat2'}},  # the more such conditions, the longer the percolation
                                    {"range": {"props.price": {"gte": 1000}}},  
                                ]  
                            }},  
                        ]  
                    }  
                }  
            ]  
        }  
    }  
}

You can see, that there are two main conditions inside must: the first is simple, and the second a bit more complex. Logically, there is no reason to check the second condition if the first one is false. However, my experiments showed that if the first condition is false for a document, adding conditions inside should (the second condition) increases the percolation time. Therefore, I conclude that the improvements claimed in this article https://www.elastic.co/blog/elasticsearch-percolator-continues-to-evolve do not work.

Also the percolator will no longer load the percolator queries as Lucene queries into memory as they are instead read from disk. Pre 5.0 if you had thousands of percolator queries they’d take up megabytes of precious JVM heap space, putting pressure on jvm garbage collecting and if not being careful lead to an infamous jvm out of memory error. Back then loading the percolator queries into memory made sense because all the percolator queries were evaluated all the time so we made executing each one as fast as possible. Now with pre-selecting, only percolator queries that are likely to match. We decided to trade speed for stability, removing the caching to free up memory. The speed loss is more than paid for by skipping most queries in most cases.

I ran the percolation with the following request:

{  
  "constant_score": {  
    "filter": {  
      "percolate": {  
        "field": "query",  
        "document": {  
          "props": {  
            "entity_obj": {  
              "category": ["flat2"],  
              "id": 1,  
              "priceTotal": 10001,  
              "totalArea": 100  
            },  
            "price": 10001,  
            "room": 1  
          }  
        }  
      }  
    }  
  } 

As a result, I got the following percolation time with one document: ~0.157 seconds.
I conducted a similar experiment on Elasticsearch version 1.7.6 with identical data, and the result was: ~0.008 seconds, which is ! ~20x faster.

We also tried percolating with real production data. The only improvement we saw was when we added additional filters with the percolate query by using metadata, which we extracted from the primary query. For example, we took the query mentioned above and added metadata (meta_data.category field).

{  
  "query": {  
    "bool": {  
      "must": [  
        {"term": {"props.entity_obj.category": "flat1"}},  
        {  
          "bool": {  
            "must": [  
              {  
                "bool": {  
                  "should": [  
                    {"term": {"props.entity_obj.category": "flat2"}},  
                    {"range": {"props.price": {"gte": 1000}}}  
                  ]  
                }  
              }  
            ]  
          }  
        }  
      ]  
    }  
  },  
  "meta_data": {  
    "category": "flat1"  
  }  
}

Then I sent the following request:

{    
  "constant_score": {    
    "filter": {    
      "bool": {    
        "must": [  
          {  # additional filter
            "bool": {  
              "should": [  
                {"term": {"meta_data.category": "flat1"}},  
                {"bool": {"must_not":  {"exists": {"field": "meta_data.category"}}}}  # condition for cases, when query has no filter by category
              ]  
            }  
          },  
          {"percolate": {"field": "query", "document": {# our document} }}    
          }
        ]    
      }    
    }    
  }    
}

But this approach has a disadvantage. It becomes more difficult to percolate a large batch. If I need to percolate many documents, I have to separate them by the category field, resulting in smaller batches. This negates the improvement of percolating many documents in one query. I also tried using named percolation (the name field in the percolate query) and made a query with a few percolate queries inside (one for each category), but this approach did not have any advantage compared to separate requests (the percolation time was the same).
In general, extracting metadata and adding additional filters for this metadata seems like unnecessary work, forcing us to maintain those extra filters. It seems that the search engine should handle such optimizations itself. I suspect this is the "pre-selecting" feature.

Python scripts for experiments (python 3.12): scripts

Conclusion

Currently, percolation with queries, even simple filters, performs significantly slower than in older versions of Elasticsearch. It seems likely that the latest version lacks the pre-selecting optimization, or it is not functioning correctly. Alternatively, I might have missed something, and it can be enabled. I would appreciate any help you can provide to resolve this problem.

@garipovazamat garipovazamat added bug Something isn't working untriaged labels Oct 11, 2024
@peternied peternied transferred this issue from opensearch-project/.github Oct 11, 2024
@peternied peternied added Performance This is for any performance related enhancements or bugs Search Search query, autocomplete ...etc Search:Performance and removed untriaged labels Oct 11, 2024
@peternied
Copy link
Member

@garipovazamat Moving this issue to the OpenSearch core repo where it would be addressed, thanks for creating this issue.

@dblock
Copy link
Member

dblock commented Oct 11, 2024

This is very well detailed thank you @garipovazamat. Since you have a repro, have you tried bisecting this to a change via ./gradlew run?

@garipovazamat
Copy link
Author

garipovazamat commented Oct 14, 2024

have you tried bisecting this to a change

@dblock Most likely not. What you mean? What you suggest to bisect?
I did everything in docker container.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Performance This is for any performance related enhancements or bugs Search:Performance Search Search query, autocomplete ...etc
Projects
Status: 🆕 New
Development

No branches or pull requests

3 participants