Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP geodetic distance search #1086

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from
Draft

Conversation

leoger
Copy link
Contributor

@leoger leoger commented Dec 14, 2024

Why

With some additional work, this branch is intended to fix issue #1079. In its current state, there is still an unresolved issue where both NearFilter and WithinFilter actually just checking the minimal bounding box and never test the actual geometry.

My intended next step is to discuss the best path forward. I have left detailed notes in the form of a comment in the code with my understanding of options for expanding the fix.

What

  • Added NearFilter unit test with real lat-long data. The current implementation of NearFilter doesn't return the expected results.
  • Added a dependency on net.sf.geographiclib : GeographicLib-Java : 2.0 in nitrite-spatial/pom.xml
  • Added an alternative "geometry factory" that creates a geodetic "small circle" instead of a cartesian circle as the existing JTS geometry factory does.
    • I'm happy to refactor this in any number of ways. This current implementation is just what I felt was the most straightforward replacement for existing functionality, serving as a proof of concept without forcing a ripple of changed type signatures.
  • Added another unit test to check the behavior of WithinFilter once I stepped through in the debugger and saw that SpatialIndex.java is mistakenly treating the filter work as done after the initial RTree check. Sure enough, the same problem occurs for WithinFilter. I added an ASCII diagram in a comment within the new unit test, testWithinTriangleNotJustTestingBoundingBox.

Added NearFilter unit test with real lat-long data. The current implementation of NearFilter doesn't return the expected results.

See this shared custom map on Google Maps for an illustration of the Oslo test case: https://www.google.com/maps/d/viewer?mid=1WXlEa5nBOSvBej3HSUNhsh_LLahoab8
Copy link
Contributor

coderabbitai bot commented Dec 14, 2024

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@@ -0,0 +1,45 @@
package org.locationtech.jts.util;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets use nitrite package name here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I thought this needed to be in the same package as the existing GeometricShapeFactory in order to access the protected fields, but I see now that I was misremembering my Java inheritance visibility rules! 😄 I did a quick refactor and it works just fine. Pushing an update...

@anidotnet
Copy link
Contributor

Your changes looks great to me.

@leoger
Copy link
Contributor Author

leoger commented Dec 14, 2024

Thanks for the quick feedback Anindya!

Since these changes don't actually work yet, due to the bounding box problem, please advise how you'd recommend proceeding. There seem to be some simple options, e.g. we can use the set of NitriteIds that come back from SpatialIndex::findNitriteIds to look up the exact geometry and do the exact test as a second pass...

My concern here is that I don't want to start pulling threads and turning this into a larger change because I haven't taken the time to understand the philosophy of how you split the work of the query up across these different layers. (Hence why this PR is a Draft.)

@anidotnet
Copy link
Contributor

anidotnet commented Dec 14, 2024

After a quick review, I see one way we can move forward:

  1. Change the NitriteRTree interface and add Geometry as the argument in add, remove and find* methods.
  2. Store the Geometry along with NitriteId as the value of the RTree
  3. Filter by BoundingBox as it happens today. We will now get Geometry along with NitriteIds. Here we can do finer filtering and return the exact NitriteIds corresponding to the right Geometry values.

If you find a better way don't hesitate to share also.

@leoger
Copy link
Contributor Author

leoger commented Dec 14, 2024

Thanks, I'll give it a try and see how the code shapes up!

@leoger
Copy link
Contributor Author

leoger commented Dec 14, 2024

Also, how much weight should we be putting on avoiding breaking interface changes? Theoretically, someone out there may have taken a dependency on Nitrite 4.x and created their own implementation class for the NitriteRTree interface.

We could just mark the existing methods as @Deprecated and add additional methods. That way bumping the Nitrite dependency version from 4.3.0 to 4.4.0 would produce warnings rather than compile errors. :-\

@anidotnet
Copy link
Contributor

Ofcourse we should keep them backward compatible for next minor version upgrade and create overloaded methods to add Geometry as argument. And please put proper warning message in @Deprecated annotations.

@leoger
Copy link
Contributor Author

leoger commented Dec 15, 2024

As I started looking at the four implementations of NitriteRTree, and in particular the NitriteMVRTreeMap for the MVStore adapter and the RocksDBRTree for the RocksDB adapter, I realized that if we added Geometry to the signature of add and actually added it to the data within the RTreeMap, there are multiple negatives, which I'll enumerate momentarily. (I haven't tested any of these yet, I'm just going on experience and intuition, so let me know if you see any of it differently.)

Considerations

Concrete technical considerations

If we add the Geometry data itself into the RTreeMap, then the actual bytes representing the (serialized) geometry will be written into either the RocksDBMap field backingMap or the MVRTreeMap field mvMap. Depending on the geometry, this could be a significant increase in either the in-memory or on-disk size of the DB.

More importantly, it is my understanding that this would be redundant data. The geometry data could be both in the main collection/repository map as well in the RTree map. (e.g. I see that both the openMap(...) and the openRTree(...) method in RocksDBStore.java contain calls to new RocksDBMap<>(...).) I expect this means that if I spent enough time looking through a hex-dump of a RocksDB Nitrite DB file, I would be able to spot two copies of each Points/Coordinates that makes up each geometry object. Beyond the simple desirability of using less RAM or storage on behalf of Nitrite users, I would expect any such increase to come with its own CPU performance penalty in the form of increased rate of L1/L2 cache misses.

Intuitive considerations based on application of general software-design principles

I tried to peel the next layer of the onion to figure out how to judge whether this is an appropriate trade-off in context. This led me to (a) look at the implementation of SingleFieldIndex and CompoundIndex; and (b) think about what I know of how RDBMS's typically implement indexes. In both cases, I believe the pattern is that the additional data taken up by the index itself consists of only the covered fields and some primary key or equivalent. Furthermore, thinking about the unique design of the R-Tree data structure and associated algorithms, it feels like we'd actually be working "against the grain". I may not have deep experience specifically with designing DB indices and being bitten by their trade-offs first hand, but this feels very much like a situation where we need a stronger motivation to "swim upstream" than we currently have.

Admittedly, on the one hand, for every other type of Index we have, full data of the field being indexed becomes the content of the index alongside the ids. As such, there's an expectation that those indices would enable a wide range of filter operations because those operations would have access to the "complete" data of the indexed field. On the other hand, it seems that a spatial index is necessarily different by it's nature. The R-Tree is built on the idea that using only the bounding boxes this comes with significant benefits that outweigh its significant limitations. If we extend it, it's not really just an R-Tree anymore.

Conclusion

On balance, I think this adds up to strong reason to prefer the approach where we treat each WithinFilter( ..., geometry) as an and(within(boundingBox(geometry)), within(geometry)), where the first will always be powered by the index and the second will go through non-indexed apply.

Sorry for the novella-length analysis. 😄 I look forward to hearing what you think.

@anidotnet
Copy link
Contributor

Thanks for the detailed analysis. While filtering via bounding box (using the current algorithm) the resulting set of NitriteIds will always be a super set of the correct NitriteIds. So our job is - to further narrow down the set of NitriteIds we get from the index search by pulling the Geometry fields corresponding those ids and perform the calculations on those values in final pass and return the NitriteIds from there. In this approach you only need to work on the SpatialIndex.findNitriteIds without modifying the NitriteRTree interface.

@anidotnet
Copy link
Contributor

Hi any update on this or it has been abandoned?

@leoger
Copy link
Contributor Author

leoger commented Jan 13, 2025

Hi Anindya, thanks for checking in. Things got busy around the holidays at home but I'm planning to pick this back up next week. I appreciate your help and patience.

@leoger
Copy link
Contributor Author

leoger commented Jan 20, 2025

@anidotnet I've got a working implementation 🎉 but it only solves NearFilter for cases where fieldValue instanceof Point. I'll clean it up and post it to this branch tomorrow so you can take a look at the approach and give your feedback.

After reading ReadOperations and FindOptimizer carefully to understand the intention, I concluded that it make sense to usethe existing mechanism where AndFilter is flattened into its constituent plans. There are multiple ways to get multiple types of filter to share this behavior, and I think some of the nicest options involve breaking the existing type hierarchy. (e.g. NearFilter is not really an IndexOnlyFilter anymore, but we can make this a non-breaking change if we're willing to sacrifice some clarity of the inheritance graph...) Anyway, this will be more clear once you've got the code to refer to.

Cheers,
Leo

@anidotnet
Copy link
Contributor

While I am revisiting this discussion, I am thinking may be we should put a clear distinction between geo-spatial and spatial data/queries. A user while only dealing with spatial queries (like a surface in a game world) does not need to be concerned about geodesic, where as a user dealing with lat long data does. Let me know your thought on this.

@leoger
Copy link
Contributor Author

leoger commented Jan 21, 2025

While I am revisiting this discussion, I am thinking may be we should put a clear distinction between geo-spatial and spatial data/queries.

😄 This is precisely what I was going to suggest, as well. I discovered as I was modifying some of my unit tests that I was mixing up coordinate systems and this seems like the natural solution.

EDIT: I missed that you included "data". Yes, that too.

@leoger
Copy link
Contributor Author

leoger commented Jan 21, 2025

As for the new commit I pushed, I'd like to clarify that I changed a bunch of names but those are all just placeholders meant to highlight how I'm thinking about it in this intermediate state.

The important thing, as I see it, is the work so far has shown me we need to clarify:

  • Do we need to separate GeoSpatial from other Spatial operations? Should we add a GeoCoordinate type that extends Geometry and has get/set for latitude and longitude rather than x and y. I think this is just asking for trouble to leave it as x and y because there are conflicting standards in the GIS world for what order lat/lon go in, and the ISO standard actually goes opposite to how we normally think of x and y axes base on cartesian coordinates.
  • Is the intent of NearFilter actually "contains" or "intersects"? It seems to me that if we try to do NearFilter with generalized geometry, e.g. a triangle, then it's not immediately clear whether the query should return a triangle that has one corner within the given distance of the given point. Is that "near"? Is that distinct from "within"?
  • Should we break the type hierarchy so that Near, Within, and Intersects are no longer IndexOnlyFilters? If we keep them as IndexOnlyFilters and don't change the semantics of IndexOnlyFilter, it ends up being a little awkward that it's both a "flattenable" (aka "and-like") filter and needs to rely on another spatial filter that can't be "index only".
    • To get the unit tests to pass, I took a shortcut of making some strange changes to the type hierarchy, just to save myself from having to create a half-dozen new classes with the right characteristics. I figure we can discuss what's the right approach before making such a large change, and we can align on naming while we're at it.

I'm sure I've forgotten a couple of minor things, e.g. the fact that there's throw-away work happening to compute the polygon of the circle and it's not really being utilized. We can do some micro-benchmarking later to see if using the Inverse geodesic calculation is faster than relying on the JTS implementation of intersects or contains. For now we have a proof of concept that has illuminated design questions.

Thanks again. I look forward to your thoughts.

@anidotnet
Copy link
Contributor

The first and foremost things we have to do is - separate the filter/data for spatial and geo-spatial queries. We can create respective Geo-* filters to separate the concerns. Your idea of GeoCorodinate is brilliant. we should follow this path.

Regarding theNearFilter, we should decide based on some test results. If NitriteRTree.findContainedKeys gives correct results, then it should inherit WithinFilter otherwise inherit IntersectsFilter.

Regarding the hierarchy, I think we need a complete redesign of spatial index scanning - SpatialIndex.findNitriteIds() and make it inline with other index scanning - SingleFieldIndex and CompoundIndex. That would solve both this problem and as well as handling of 2 pass scanning algorithm.

@leoger
Copy link
Contributor Author

leoger commented Jan 24, 2025

1. Defining the behavior of NearFilter

Regarding theNearFilter, we should decide based on some test results.

Ah, but that's going to be difficult because as of right now, there is no definition of what is the correct behavior. Nothing in the documentation or in the existing unit tests tells what the expected behavior is for this scenario. Given the existing test geometry defined in BaseSpatialTest.java, the next nearest point would be 525 512 in the LINESTRING.

So, you tell me. When you wrote it, did you have some idea of what "near" means for non-point geometry or were you just expecting "near" would be used for point-only data-sets?

If we increase the search distance in the test case to 42, such that it catches one point of the LINESTRING, but not all of it, should "near" return object2 or not?

2. Clarification about redesign of spatial index scanning

... complete redesign of spatial index scanning - SpatialIndex.findNitriteIds() and make it inline with other index scanning - SingleFieldIndex and CompoundIndex. That would solve both this problem and as well as handling of 2 pass scanning algorithm.

I don't understand what you are suggesting here. In particular, what does "make it inline" mean? Can you spell out what would be different?

Are you suggesting that in ReadOperations::findSuitableStream(FindPlan),we change the general behavior like this?

  • CURRENT: any given FindPlan only has a single IndexDescriptor->NitriteIndexer->NitriteIndex and we call findNitriteIds() on that SINGLE index, and then FILTER that stream of ids using apply(Pair<...>) of the remaining filters in the plan
  • REDESIGN: any given FindPlan can leverage multiple indexes, each of which would return a separate RecordStream<...> and we could use something like a hash join to efficiently find the intersection of the RecordStreams. That minimized rawStream is only then passed into new FilteredStream(rawStream, findPlan.getCollectionScanFilter())?

@anidotnet
Copy link
Contributor

When you wrote it, did you have some idea of what "near" means for non-point geometry or were you just expecting "near" would be used for point-only data-sets?

My initial thought was with point only data, so I went with the mental picture of - "points within a circle", but when I think of it now, Intersects makes much more sense for any kind of geometry. The correct logic should be - if the circle intersects with any portion of the geometry, the value should be selected.

Clarification about redesign of spatial index scanning

Current spatial filter and its index scanning does not handle multiple spatial filters in and filter. There is a bug in the SpatialIndexTest.testAndSpatialQuery.

Are you suggesting that in ReadOperations::findSuitableStream(FindPlan),we change the general behavior like this?

I am not planning to change current ReadOperations. Instead, we need to change the spatial index scanning so that it can handle multiple spatial filter scanning. This should solve your initial problem also of fine tuning the result using 2 pass scanning.

@anidotnet
Copy link
Contributor

One way to solve the multiple spatial filters scanning or 2 pass scanning is to modify FindPlan supplied in public LinkedHashSet<NitriteId> findNitriteIds(FindPlan findPlan). We can add filters in the collectionScanFilter field of FindPlan; ReadOperation will take care of the rest.

@leoger
Copy link
Contributor Author

leoger commented Jan 30, 2025

... We can add filters in the collectionScanFilter field of FindPlan; ReadOperation will take care of the rest.

So you're suggesting modifying the FindPlan within findNitriteIds, which takes the FindPlan as input? I would generally be against modifying an input parameter without a compelling reason because it tends to make the code harder to debug and reason about.

Can you tell me more about why you would rather do it there than to change anything in ReadOperations? It sounds like you've got some reason, but it is not obvious to me from the outside.

@anidotnet
Copy link
Contributor

For general purpose filters, the plan is calculated before the query operation. For spatial filters, the plan is being calculated during the search operation as it progresses. As for modifying the input object, we can return a new type containing NitriteIds and FindPlan for findNitriteIds()

I am trying to avoid larger changes in the code, that's why I am skeptical about changing ReadOperations. And also this problem is not generic and very specific to spatial filters. But if you have some elegant solutions regarding ReadOperations to handle this scenario, I am all ears.

@leoger
Copy link
Contributor Author

leoger commented Jan 31, 2025

For spatial filters, the plan is being calculated during the search operation as it progresses.

Is that true? I had convinced myself that we already know what the plan will be ahead of time. i.e. as soon as we start optimizing the plan, we already know we can (1) break it down into an initial bounding box index-filter that only has access to SpatialKey, not the full Geometry object, and (2) a scanning filter on the Geometry field.

How is the plan "being calculated during the search operation"? I don't understand.


I am trying to avoid larger changes in the code, that's why I am skeptical about changing ReadOperations. And also this problem is not generic and very specific to spatial filters.

Okay, that makes sense. I have some intuition that this can apply more broadly, but it makes sense to keep it isolated until we actually have those other concrete use-cases at hand.


But if you have some elegant solutions regarding ReadOperations to handle this scenario ...

I don't know how elegant any of it would be, but it feels like now is a good time to take another iteration at the code itself rather than keep talking about the possibilities for the code. 😁 I'll try keeping it isolated in findNitriteIds to start and whatever I find in the process will give us some additional clarity about the problem...

@anidotnet
Copy link
Contributor

For spatial filters, the plan is being calculated during the search operation as it progresses.

My bad. It's a typo. It should be - the plan will be calculated, as we are discussing about modifying FindPlan.

@leoger
Copy link
Contributor Author

leoger commented Feb 9, 2025

FYI, my work has gotten busy so it might take another 2 weeks before I have more commits to share on this PR.

@anidotnet
Copy link
Contributor

Thanks for the update. No worries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants