-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto File Management Part 1: Introducing a Datafile Management Tool #235
Open
mabruzzo
wants to merge
11
commits into
grackle-project:main
Choose a base branch
from
mabruzzo:auto-data
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+2,872
−40
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
b2cc8d4
to
f91413c
Compare
I also added documentation and integrated the tool into the testing framework.
We plan to eventually install the grdata tool as a standalone command line program. Essentially the build-system will perform some substitutions (the CMake build system uses CMake's built-in ``configure_file`` command while the classic build system uses the analogous ``configure_file.py`` script) This commit introduces a few minor tweaks to grdata.py so that it can more easily be consumed by the ``configure_file.py`` script. - The ``configure_file.py`` script, itself, will ultimately require a few more tweaks so that it doesn't report occurences of python's decorator-syntax as errors - However, this commit minimizes the number of required changes
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR introduces a datafile management tool. A lot of this information is also stated in the documentation.
Brief Description
Currently, this management tool is only accessible through
pygrackle
. To invoke it you would usewhere
COMMAND
is usually one of the following:fetch
fetch all data filesls-versions
list the grackle versions for which we have local copies of the data filesrm
can be used to remove the data-files associated with a version (or all data files)getpath
lists the path to the directory containing all of the grackle datacalcreg
calculates an updated registry of file names and checksums to be used in future versions of this tool (basically, you would call it anytime you change the data-files)The ultimate goal is to make it possible for the Grackle library itself, to be able to directly access these files.1 I'm happy to talk a little more about what this might look like down below.
Motivation
Why does this tool exist? Datafiles are required by ANY non-trivial program (e.g. a simulation-code or python script) that invokes Grackle.
It is instructive to consider the historic experience of an end-user of one of these programs. To build Grackle, they would typically clone the git repository for Grackle (including the data files). To invoke their program, they would manually specify the path to the downloaded data file. Frankly, this doesn't seem so bad; the manual intervention is a minor inconvenience, at worst. While it would be nice to eliminate the manual intervention, this it doesn't seem it warrants development of a special tool.
Indeed, this is all true. Users who like this workflow can continue using it. However, this manual management of datafiles becomes problematic in any use-case that is marginally more complex. There are 3 considerations worth highlighting:
Portability: Currently, there is no out-of-the-box approach for any program using Grackle configured to run on one computer to run on another machine without manual intervention.
If there are differences in how the machines are set up (e.g. where the data files are placed), the paths to the Grackle data file(s) need to be updated. This is relevant if you want to use a Pygrackle script on a different machine or if you want to use a configuration script to rerun a simulation (involving Grackle) on a different machine.
This is particularly noteworthy when it comes to automated testing!
For example, right now Pygrackle currently assumes that it was installed as an editable installation to run some of the test suite. After this PR, this is no longer a necessary assumption. The introduction of this tool also makes it easier to run the python examples.2
the test-suite of Enzo-E is another example where extra book-keeping is required for all test-problems that invoke Grackle.
If the Grackle repository isn't present:
This includes the case where a user deletes the repository after installing Grackle. (we don't really care about this case)
It is more important to consider the case where users are installing programs that use Grackle without downloading the repository (or, even if the repository is downloaded, it is done so without the user's knowledge). This will become increasingly common as we make Pygrackle easier to install3. This is also plausible for cmake-builds of downstream projects that embed Grackle compilation as part of their build.
Having multiple Grackle Versions Installed: This is going to be increasingly common as we make Pygrackle easier to install. Users have 2 existing options in this case: (i) they maintain separate repositories of data files for each version or (ii) they assume that they can just use the newest version of the data-file repository. The latter option, has historically been true (and will probably continue to be true). But, it could conceivably lead to cases where people could unintentionally use a data-file created for a newer version of grackle. (While this likely won't be a problem, users should probably be explicitly aware that they are doing this on the off-chance that problems do arise).
This tool is a first step to addressing these cases.
Currently the tool just works for Pygrackle. But, as I noted before, my plan is to introduce functionality to let the Grackle C layer take advantage of how this tool organizes data files.
How it works
Fundamentally, the data management system manages a data store. We will return to that in a moment.
Protocol Version
This internal logic has an associated protocol-version, (you can query this via the
--version-protocol
flag). The logic may change between protocol versions. The protocol version will change very rarely (if it ever changes at all)Data Directory
This is simply the data directory that includes all grackle data. This path is given by the
GRACKLE_DATA_DIR
environment variable, if it exists. Otherwise it defaults to the operating-system's recommendation for user-site-data.This contains several entries including the:
a user-data directory. This directory currently isn't used yet, but it is reserved for users to put custom data-files in the future.
a tmp directory (used by the data-management tool)
it sometimes holds a lockfile (used to ensure that multiple instances of this tool aren't running at once)
the data store directory(ies). This is named
data-store-v<PROTOCOL-VERSION>
so that earlier versions of this tool will continue to function if we ever change the protocol. (Each of these directories are completely independent of each other).Outside of the user-data directory, users should not modify/create/delete any files within Data Directory (unless the tool instructs them to).
Data Store
This is where we track the data files managed by this system. This holds a directory called object-store and 1 or more "version-directories".
The primary-representation of each file is tracked within the
object-store
subdirectory.The name of each item in this directory is a unique key. This key is the file’s SHA-1 checksum.
Git internally tracks objects in a very similar way (they have historically used SHA-1 checksums as unique keys). The chance of an accidental collision in the checksum in a large Git repository is extremely tiny.4
Each version-directory is named after a Grackle version (NOT a Pygrackle version).
object-store
.When a program outside of this tool accesses a data-file, they will ONLY access the references in the version-directory that shares its name with the version of Grackle that the program is linked against.
This tool makes use of references and the
object-store
to effectively deduplicate data. Whenever this tool deletes a "data-file" reference it will also delete the corresponding file from theobject-store
if it had no other references. I choose to implement references as "hard links" in order to make it easy to determine when a file inobject-store
has no reference.Closing Thoughts
I'm totally open to any feedback!
A lot of the complexity here comes from the fact that I made this tool deduplicate files. For reference, we currently ship 25 MB of datafiles with Grackle. That starts to add up if you have 3 or 4 copies of the files floating around (presumably, the number of datafiles that we ship will only increase in the future). If you don't think its necessary, we can drop this functionality.
One thing to think about is the choice of hash-algorithm. Currently we use SHA-1. But maybe we should use SHA-256 instead or something else? (The
git
developers plan to transitiongit
to SHA-256 since it is cryptographically secure).One relevant point here is the functionality I plan to add to the grackle library:
data_file_handling
.grackle_data_file
as it always has.1
) will instruct Grackle to search for the file specified bygrackle_data_file
within the directories managed by this grdata tool.Footnotes
I actually got quite far and plan to introduce that as a followup PR, but I underestimated just how many lines of code that would take (specifically, writing a C version of the 15 line
_get_data_dir
function turned into a few hundred lines). ↩In the future, I hope to also make it easier to run the code examples. ↩
For example, once GH-Transition
Pygrackle
fromsetuptools
toscikit-build-core
#208 is merged, you will be able to instruct pip to install pygrackle by just specifying the URL of the GitHub repository ↩It was only 10 or 12 years after Git was created that the developers started worrying about collisions (and they are primarily concerned with intentional collisions from maclicious actors). ↩