Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List strings without dedupe. #233

Open
StevenCurran opened this issue Jan 27, 2025 · 8 comments
Open

List strings without dedupe. #233

StevenCurran opened this issue Jan 27, 2025 · 8 comments

Comments

@StevenCurran
Copy link

Hi there, was trying to use the tool to determine if our application would benefits from string dedupe. Looking at a dump with -L shows a lot of strings but it seems to de deduping them when printing. Can you confirm this is the case? And perhaps it would be useful for a way to run without duplication if so to determine if running the jvm with string dedupe enabled would benefits. Many thanks.

@agourlay
Copy link
Owner

Hey thanks for your interest in hprof-slurp 👋

You are correct, the strings are deduplicated.

When parsing the test HPROF dumps I own, strings are defined only once.

TBH I am not sure how to provide an instance count per string.
Is this something other profiling tools are providing?

@StevenCurran
Copy link
Author

Tool is fantastic btw has been a great help on large hprof files that usually kill my local intellij.

I guess I'm wondering if you dedupe in your tool (although while I'm not a rust dev looking at your code you're parsing them into a vec and not a set) so does that mean they are already deduped in the hprof file itself?

If they are duped in the hprof file even if they were just printed out multiple times in the slurp output I could pass that to uniq to see the counts of dupe strings. But if it's not even duplicated in hprof then that clearly wouldn't be possible. I'm not familiar at all with the hprof format however.

@agourlay
Copy link
Owner

agourlay commented Jan 28, 2025

Thanks for the kind words, happy you are able to analyze large hprof files 👍

AFAIK the strings are not duplicated into the hprof format.

I am accumulating those into a hashmap string_id -> string_value.

It may be possible to compute the number of instances per string based on the full instance graph but it would require two passes on the dump file.

@StevenCurran
Copy link
Author

StevenCurran commented Jan 29, 2025

got it, how would the second pass help if they are using the same object ids?

I wonder what a heap dump would look like with say 10 dupe strings (but different string instances in code) with "-XX:+UseStringDeduplication" vs "-XX:-UseStringDeduplication"

I guess I would have thought the ids would be shared in the case where string dedupe is enabled, but not where it is disabled?

@agourlay
Copy link
Owner

how would the second pass help if they are using the same object ids?

In that case it would not help either 👍

I guess I would have thought the ids would be shared in the case where string dedupe is enabled, but not where it is disabled?

This sounds like a logical explanation.
If you send me two test hprof files with and without the dedup option enabled, I could double check if the string definition changes.

Sound good to you?

@agourlay
Copy link
Owner

agourlay commented Feb 1, 2025

I ended up cutting a new release with a basic String duplication detector in 0.5.4

This will find Strings with the same value but registered under different ids.

@StevenCurran
Copy link
Author

StevenCurran commented Feb 1, 2025

oh amazing, sorry i was not able to get you the heap dumps before then. Let me give it a try on some real heaps though. Many thanks for this!

EDIT:
Took a look at the PR, is there any way that we could print out the string with the count instead of just the number of dupe entries?

        List<String> names = List.of("Liam", "Sophia", "Noah", "Liam", "Sophia", "Emma");
        
        Map<String, Long> nameCount = names.stream()
                                          .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));

Similar to this in java.

@agourlay
Copy link
Owner

agourlay commented Feb 2, 2025

oh amazing, sorry i was not able to get you the heap dumps before then. Let me give it a try on some real heaps though.

No worries, I figured it would be really inconvenient to send large heap dump files over the Internet anyway.

Took a look at the PR, is there any way that we could print out the string with the count instead of just the number of dupe entries?

Yes, the specifics of the report can be changed easily.
My goal was to validate first the hypothesis of being able to detect duplicated strings by value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants