-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Close connection of database after initialization correctly #179
Conversation
253c0b8
to
7c11ebf
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #179 +/- ##
==========================================
- Coverage 99.90% 99.57% -0.33%
==========================================
Files 10 10
Lines 2099 2118 +19
==========================================
+ Hits 2097 2109 +12
- Misses 2 9 +7 ☔ View full report in Codecov by Sentry. |
911d210
to
6f1e2e2
Compare
b1f67e2
to
3474a15
Compare
e082ce2
to
21a7c2c
Compare
21a7c2c
to
2ecde9c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two comments:
- Using
_operation_session
and_container_session
to distinguish seems work but increase the complicity. Why not a functioninit_container
that will create the container and handle the DB within the function scope? - Please also update the doc for when using the container, it requires context manager.
Again, I think there is some design flaw on the API and resources management. In my rust implementation rsdos, these issues are just not there in the first place. I'd strongly argue to consider to give it a try a some point. @giovannipizzi
That would require a more significant refactor as the Container is not enforced to be used within a context. That means almost every test in
The usage of the Container has not changed. So using the context manager is optional as it was before. Whatever is in the doc is still valid. |
I agree, this PR is more a temporary fix than a real fix of the problem that the container not well designed to manage resources correctly (acquire and free them). I think any usage that opens db connection or leaves open file handlers should be enforced to be within a context. Any function that leaves any resource open should not be a public function. |
Checked the docs, it is never mentions to close the container... I will update ... |
2ecde9c
to
7fe06d7
Compare
Yes, I think it is mandatory to close the resource after using it, otherwise it is for sure the leaking of resources. I encounter quite a lot of "too many files open" in this python implementation when benchmark. |
make sense. Let's just do another workaround :( |
@giovannipizzi the benchmarks, this PR seems to make it slightly slower, but nothing very significant
This PR
|
Maybe this fix also solves the issue #178. |
The benchmark test of pack write also shows another problem. From what @agoscinski shows in the bench result, the min/max time of this bench is similar but it should not, since the test function in benchmark runs multiple times. The reason is that when adding same content to the pack, it will not check the duplication but write into it anyway. (maybe there is some consideration to go this way, but from performance side it is not good if you want to write things to the pack directly, or maybe I am wrong there is check on the hash somewhere before write to pack??) The problem is out the scope of this PR, just mentioned it since I did encountered it in my implementation. @pytest.mark.benchmark(group="write", min_rounds=3)
def test_pack_write(temp_container, benchmark):
"""Add 10'000 objects to the container in packed form, and benchmark write and read speed."""
num_files = 10000
data_content = [str(i).encode("ascii") for i in range(num_files)]
expected_hashkeys = [
hashlib.sha256(content).hexdigest() for content in data_content
]
hashkeys = benchmark(
temp_container.add_objects_to_pack, data_content, compress=False
)
assert len(hashkeys) == len(data_content)
assert expected_hashkeys == hashkeys |
My design consideration here was that if writing directly to packs, and you want a truly streaming approach where if you read the stream once to compute the hash, it's not obvious you can start back and read it again to actually store it, then I don't think there is an efficient way or do it. Maybe the way would be to stream anyways to sandbox first, and then copy to the pack. But then you are going to write most data twice... I don't remember how much I benchmarked this, and of course it depends on how much duplication you expect, but my intuition was/is that it's better to store all, and at the end do once a full repack, and this is the fastest option. |
For the loose to pack, it is not a problem since hash is there already. For directly write to pack, I think it will be a good feature to have a option key to support write to sandbox first. So users can choose to have a more disk space optimized solution or a more performance solution. |
The session created during initialisation of the container was never properly closed. This unclosed session was until py3.12 garbage collected since it was unreferenced. With py3.13 the sessions however are not anymore garbage collected and thus remain open. Resulting in an open file descriptors of the `pack.idx` for each initialisation of the container. This commit fixes it by keeping track of the session that initialises the container `_container_session`. We adapt the name `_session` to `_operation_session` for a clearer distinguishment between the two session types.
7fe06d7
to
5ea4ab2
Compare
@nmounet @giovannipizzi agreed on merging I bypass approve |
No description provided.