Germplasm search optimizations #7

jloux-brapi · 2025-03-17T21:29:12Z

A couple of optimizations, bug fixes and workarounds have been provided to improve the performance and usability of the germplasm search endpoint.

Hibernate was issuing a warning: firstResult/maxResults specified with collection fetch; applying in memory which was a big clue to why this search endpoint was performing so poorly. Tim essentially tried to implement as vlad described here, but there was a critical error, we passed through all of the query logic to the search query builder, which today runs paginated queries no matter what. This means that while we did the grunt work of fetching the IDs separately and giving them to another query, we ended up with the same applying in memory error as before. This code has been changed so that not only does the SearchQueryBuilder support non-paginated queries, there is also more support for the kind of double query that the paginated fetches require. (See GermplasmService.findGermplasmEntities()). The performance improvement is orders of magnitude. I went from about 4.5 seconds 100 records to about 500ms, 15 seconds on 1000 to 1 second on a dataset of 550k germs on a program.
A workaround was created for BI to send either a null page and pageSize or neither attributed be present in the request to kick off some new logic on the germplasm search POST endpoint which will utilize the new non-paginated query that SearchQueryBuilder supports to return all data at once, non-paginated. This has a breaking point, however, as at about 250k germ records per program java completely exhausts its heap trying to load all the data in entity objects and converting them to Json. It should be noted this also isn't particularly fast, as this is a large amount of data to transmit. 125k records takes about 30 seconds on average to get back. But this should work as a stop gap in the meantime.
While testing both of these features I also noticed that when more pages are requested than are currently available, the system can act in peculiar and sometimes disastrous behaviors, such as fetching all records for a query instead of paginating at all, which, if performed on a large data set can completely bring the server to its knees. To avoid this, additional logic was built into the BrAPIRepostitory impl to prevent these kinds of situations from arriving entirely by doing a max count first before fetching the query to compare the pageSize requested and the page number requested to the amount of available data. These situations were populated up the stack for all service methods that call them to send back 400 responses when they occur with information on how to avoid these situations (this explains how many files were touched). Additional optimizatons were made here to short-circuit the query lookup code if max count produced 0 results to avoid unnecessary long-running queries.
To totally eliminated the Could not prepare SQL statement error, we needed a way to completely refuse lookups that could produce more than 65k sql params, as this is the limit. These occurred mostly in my testing when I tried to paginate germplasms large than 65k records, because in order to fetch these records we need to pass IDs of found records in inital query to later join-fetch queries. I suppose there are other ways around this, like we could break up the queries into more queries, but the right solution feels like to incentivize the requester to actually request data from the search endpoint in a meaningful and more performant way. That is, we have configured a way to control the maximum allowable page size for page requests on the server. For now, it is 65k, and this applies to all entities, not just Germplasm. Specifically for BI, this will be a problem for the cache, which we have addressed for the germplasm entity but not for other larger entities they might have, like observations and observation units. I may revert this commit and put it somewhere else separately if it is a problem loading a cache for large datasets, or I might add to this body of code.

If not specified, this sort will be used to keep the endpoints idempotent.

…tion Added utility methods to SearchQueryBuilder and BrAPIRepositoryImpl to allow for proper paginating for hibernate fetch queries that don't suffocate memory. Also added methods to run queries without pagination entirely using the SearchQueryBuilder to prevent the use of pagination when it's not required, an issue that specifically had to be addressed for the BI cache, but one that introduced code that is reusable for other use cases. Modified the GermplasmApiController's searchGermplasmPost endpoint to accomodate two code paths: - One where no page and pageSize are supplied. In this scenario the code will grab all germplasm without the use of pagination. Good for large data grabs, but gets dangerous with excessively large amounts of data. This is entirely to meet BI's current use case, which we have strongly advised they move off of. - When page and/or pageSize are supplied, paginate as requested, default page size of 1000 if not requested.

…errors

Additionally make these configurable vars consistent and usable across BrAPIController and PagingUtility, which both utilize them.

jloux-brapi · 2025-03-24T16:07:06Z

Closing in favor of the MR for the BI fork. When this is merged I will merge these changes into this repo.

jloux-brapi added 7 commits March 14, 2025 14:36

Prevent pagination from occurring while join fetching on germs

7c67aac

Add SecurityUtils

8366f0c

Add default sort to all SearchQueryBuilder queries on entity id.

5bd7ca6

If not specified, this sort will be used to keep the endpoints idempotent.

Fix paging response metadata bug, patch UUID issue

64c1629

Add error handling for bad pagination RQs to avoid giant queries and …

e00e0f2

…errors

Add configurable default page size and max allowed page size

d7c0478

Additionally make these configurable vars consistent and usable across BrAPIController and PagingUtility, which both utilize them.

jloux-brapi mentioned this pull request Mar 17, 2025

[BI-2578][BI-2489] - Optimize BrAPI Germplasm Search Breeding-Insight/bi-api#447

Merged

jloux-brapi assigned BrapiCoordinatorSelby and mlm483 Mar 18, 2025

jloux-brapi closed this Mar 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Germplasm search optimizations #7

Germplasm search optimizations #7

Uh oh!

jloux-brapi commented Mar 17, 2025

Uh oh!

jloux-brapi commented Mar 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Germplasm search optimizations #7

Germplasm search optimizations #7

Uh oh!

Conversation

jloux-brapi commented Mar 17, 2025

Uh oh!

jloux-brapi commented Mar 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants