-
-
Notifications
You must be signed in to change notification settings - Fork 369
Optionally disable mapset locking and cleaning #602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
(I took liberty to make the PR title more understandable) |
Thanks. There was a reason why this was marked as draft. In fact I'll keep it that way. It needs at least documentation. I'm not sure if I can create test for this. However, you can go ahead and have a look. |
All good but as far as I see a later editing of the title doesn't make it into the changelog. Only writing the PR title right from the beginning will help (I edited anyway, for the PR list). So, please please use a reasonable title when opening the PR. |
Sorry, meant to include a smiley face to the message :-) but about the changelog: The PR title is in no way the final title/subject line for the commit. Commit message can and should be edited when doing Squash and merge which is another advantage of Squash and merge I think. Besides tailoring it towards what is actually being merged/committed at the end, you can spell check it in your browser, and clean up the commit messages from individual commits on the branch which are usually messy. The problem is that GitHub suggestion for commit message title is based on the branch name or the only the commit if it is the only commit. It does not seem to me that a PR title plays any role in that. See, e.g., #544 which has no PR title edit, but the suggested commit message (if you click is Squash and merge) is a more raw version of the text coming from the commit. Hence, the only way how to get good commit titles/subject lines for changelog is editing it when doing Squash and merge unless we want to resort to suggesting just one commit per branch/PR and getting it right the first time on the command line like in the Subversion times. |
... ok, then all pls do that. The creation of https://trac.osgeo.org/grass/wiki/Release/7.8.3-News took too much of my lifetime. But that's OT here :-) For reviewing this PR, is it best to locally test it? |
If anybody can think about some cases (use cases or more importantly potential failures) which I did not consider in the description, that would great. I did not figure out a useful way of testing this with limited resources for automated tests in CI. The test in the description will take a lot of memory, but I had to use a lot of processes (1000 rather than 100) to be sure to get the errors without --no-clean consistently. |
Just out of curiosity: could you explain why this is needed ? mapset creation is so easy, and you can always access data from other mapsets, that I've always been able to avoid any locking issues by just working with temporary mapsets. |
@mlennert I have added documentation which may answer some of your questions. As for the comparison with the workflow involving many temporary mapsets: This is similar to
Here the only part which is here multiple times is the one which needs to be multiple times:
Isn't this more dangerous, less robust, and won't work for some workflows anyway (e.g., writing to vector attributes to SQLite)? Yes, but even now people can do this 1) within one session or 2) when setting up the session manually.
This partially relates to the splitting of a session into runtime needed to get programs running and connection to a mapset or more precisely this relates to separating the session concept into smaller pieces. |
This is already used for system tmp dir and gislock. Renames and better documents the mapset cleaning functions.
…gis env vars (gisrc)
This allows to skip locking of a mapset which in combination with no cleaning of a mapset allows for multiple jobs to be executed in one mapset using --exec. This should be the same as running multiple jobs together right in a single session (having the same limitations, but also not touching any general files). To account for the use case when the mapset is not used for any interactive session, a separate flag for cleaning is provided for convenience (same as something like --exec g.region -p). Similarly to --tmp-location and --tmp-mapset, usage of --no-lock and --no-clean without --exec is not disallowed, but it is not documented either before its usefulness is evaluated. One use case is starting an interactive session with --no-lock --no-clean to examine results of jobs running with these. The combining the new flags with the old ones and between each other is not limited except for few cases which clearly exclude each other (--clean-only means only cleaning and no cleaning and cleaning only don't make sense either). Although --no-lock and --no-clean are meant to be used together, saying what is the correct or allowed use is hard because the purpose of these flags is to disable the standard session behavior.
Using Bash for the parallel and lock use case. It is little simpler than using multiprocessing and even with that in place, it would be a good alternative test in the future because it is a likely use case. It covers well the no locking part. The no cleaning part is covered by in the Python test using two different approximations of a real case. The current cleaning functionality is tested as well, but not in a standalone test case nor with --clean-only. The most relevant command line combinations are tested in a separate test.
142df27
to
3c3bf67
Compare
If you allow skip locking of a mapset in order to run multiple jobs in the same mapset at the same time, it is entirely up to the user to ensure that the multiple jobs do not interfere with each other, e.g overwriting each other's results. That means it is up to the user to design multiple parallel jobs correctly. |
Anybody can run multiple jobs in one mapset even now. If you are on one machine and start one session, you can launch multiple jobs in it. The user even doesn't have to do any scripting. Run one from command line and another from GUI. Or multiple ones from GUI. There is nothing in GRASS GIS which is stopping user from doing (or at least trying) this. User can run parallel jobs with the --exec (old GRASS_BATCH_JOB) interface when you supply a parallel script. However, when the jobs are running through What is this PR doing is just optionally lifting the parallel execution restrictions due to mapset locking from a particular use of the --exec interface.
Designing parallel jobs is complex and user needs to have at least some understanding to run these successfully. So, although I would agree that it is not generally desirable to have more ways how users can shoot themselves in the foot, in this case, I think it is more desirable to remove obstacles for those who know what they are doing.
For (too many) years a typical (the main?) way of running GRASS GIS in batch execution scenarios was setting up bunch of environmental variables. This is still possible and documented and locking and cleaning is entirely optional and totally up to the user. I think --exec is a great way for doing the same thing the right way, but I don't want the --exec impose restrictions which are not present when setting up a session manually, running an interactive one, or even running one --exec job with a parallel script. |
On 16/05/20 04:28, Vaclav Petras wrote:
If you allow skip locking of a mapset in order to run multiple jobs
in the same mapset at the same time...
Anybody can run multiple jobs in one mapset even now. If you are on one
machine and start one session, you can launch multiple jobs in it. The
user even doesn't have to do any scripting. Run one from command line
and another from GUI. Or multiple ones from GUI. There is nothing in
GRASS GIS which is stopping user from doing (or at least trying) this.
User can run parallel jobs with the --exec (old GRASS_BATCH_JOB)
interface when you supply a parallel script. However, when the jobs are
running through |grass ... --exec|, they cannot run in parallel in one
mapset unlike the jobs in all the other cases.
What is this PR doing is just optionally lifting the parallel execution
restrictions due to mapset locking from a particular use of the --exec
interface.
...it is entirely up to the user to ensure that the multiple jobs do
not interfere with each other, e.g overwriting each other's results.
That means it is up to the user to design multiple parallel jobs
correctly.
Designing parallel jobs is complex and user needs to have at least some
understanding to run these successfully. So, although I would agree that
it is not generally desirable to have more ways how users can shoot
themselves in the foot, in this case, I think it is more desirable to
remove obstacles for those who know what they are doing.
I understand your arguments, Vaclav, and I agree that those who run
parallel jobs will probably be those who know what they are doing. The
thing is that once you create such parameters, you can be sure that
sooner or later people who don't know what they are doing will use them
as well.
A classical issue I've seen many times is that people try to run
parallel jobs on different regions, but don't realize that by changing
the region for the second job, they also do it for the first. Working in
separate mapsets is part of the protection against such errors.
I guess, some of the issue comes from the different ways of viewing
GRASS GIS: for some its an environment, for others its a program. In the
environment paradigm it's up to you to set up the environment to then be
able to run GRASS modules on GRASS data. You can do that either through
classical startup, and that includes protecting you from a series of
errors, or you can do it by setting up the environment yourself (setting
the variables, etc). Doing the latter is more challenging, which is
somewhat of a protection against errors in itself (although the
increasing number of public Jupyter notebooks doing this might change
this a bit).
Your --exec mechanism is more in line with a GRASS GIS as a program with
different ways to launch that program. In that context, including ways
to remove the failsafe mechanisms as straightforward parameters might
open the door to more bad manipulation by users.
Moritz
|
@mlennert There is nothing which prevents users even now to run parallel jobs within one mapset as long as they get a session somehow, e.g., by starting standard GRASS GIS with command line or with --exec:
You can do the same thing even with GUI. The
|
I agree with @mlennert that we should not open additional doors for potential misuse. We do have modules that do controlled parallel processing. Users can use these modules as inspiration to create their own scripts. |
@wenzeslaus This is very interesting discussion. I wanted to add my 2 cents here. I understand that even now we can run parallel jobs from the GUI or command line that can cause issues, but that is not the nature or a designed feature of GRASS GIS. Rather, it's a feature of the GUI or UN*X shell and GRASS GIS just happened to run inside that environment. I personally believe that the solution to this current parallel feature should be to put even more restrictions to disallow parallel outputs to the same map, if possible at all. I see this parallelism as an issue, not as a feature. From this perspective, this PR would not be ideal because it would open the door to more misuse cases and issues like @mlennert and @metzm mentioned. Why don't we instead implement a feature that can seamlessly merge mapsets? Then, we would be able to keep the current locking paradigm and allow multiple separate processes on different mapsets, and later just merge the results into one mapset? Just an idea. |
GRASS is a powerful tool for big data processing and I think we should promote that. Part of that is I think reducing the barriers to using it in HPCs and similar environments. Like with everything, that unfortunately does open ways to 'misuse'. Since the ability to run parallel jobs in single mapset is already there, I think it's better to expose it and properly document rather than to go with the old way of environment variables, which seem rather error-prone and cumbersome to me. I think disabling parallel processing within mapset completely is not very practical idea at this point and we would have to compensate for that with tools which we don't have now. I think the potential misuse could be very well addressed by documenting the cases when this is perfectly safe, for example, don't write to db, don't modify region, for everything else use separate mapsets. Running parallel processes is tricky, everybody knows that, so having good documentation is important and if user doesn't read it, it's their problem. Disclaimer, I am biased here, I would like to use this workflow on HPC. |
Interesting discussion indeed ;-) I wonder whether some generational differences are at work here... One of the underlying postulates that I see here is that the current functioning is a "barrier" to using GRASS in HPC and that the use of GRASS_BATCH_JOB is "old" (should I hear "to be deprecated"). Personally I find GRASS in its current state very easy to use in an HPC environment, and when I say that I mean without using --exec which I find a very interesting addition, but haven't really felt the need for up to now. So maybe it would be helpful to get some more information about what exactly you find these barriers to be ? Is it just the fact that environment variables are used ? Are environment variables a thing of the past ? |
No, but to me there are too low-level and just less convenient, I prefer using command line parameters. Environment variables are error prone (e.g. misspelling), setting them depends on the particular shell. Parameters are typically well defined and they seem like a natural way of controlling a program. I think using --exec is convenient also outside of HPC, for some external programs for example. |
Env variables are so much easier in my eyes because you only have to set them once and not worry about them again. You have to repeat parameters every single time you call the executable... Just a question of perspective I guess. And if you don't want to worry about setting env variables, you can just let the grass startup script do it for you (although admittedly you do have to set GRASS_BATCH_JOB).
I actually saw --exec usage essentially fir third-party software, e.g. making GRASS calls easier for the QGIS processing toolbox, or possibly a workflow where you just have one or two calls to GRASS amongst many calls to other tools. As already mentioned, I haven't found the itch it would scratch for me in my work, but that's maybe mostly the stubbornness of age playing out ;-) |
@mlennert As your comment suggests there is no clear distinction between a third-party software and some user's script/workflow. There is no reason why GRASS GIS power users should be bared of the opportunity to use --exec when they need more flexibility with cleaning and locking. This PR is aiming at leveling the playing field among the different interfaces and environments. As for third-party software providing interface to GRASS GIS, this PR is not aiming at this use case, but these still benefit because this allows to do operations in a mapset while managing the locking and cleaning differently, for example, there is no need to clean a temporary mapset or the application runs a computation using one To give an additional perspective, to make --exec even more useful for tools which manage the GRASS database such as the GRASS plugin for QGIS, another, more radical flag would be needed to, e.g., peek into a locked mapset. This --ignore-lock would be something in between the existing -f (force lock removal) and --no-lock (do not lock mapset) as it would not break into the mapset like the current -f does, but it would not require all processes to agree on locking or not locking (which is what --no-lock does). |
Is the discussion on this PR still relevant today? Have the positions changed? I converted this as a draft as it can’t be updated as is to the current code here, but I’d like to hear what you think of this today: something that needs to be decided on? Or just leave the current code as is and close the PR, as it isn’t planned to integrate this? |
This allows to skip locking of a mapset which in combination with no cleaning of a mapset allows for multiple jobs to be executed in one mapset using --exec. This should be the same as running multiple jobs together right in a single session (having the same limitations, but also not touching any general files).
To account for the use case when the mapset is not used for any interactive session, a separate flag for cleaning is provided for convenience (same as something like --exec g.region -p).
Similarly to --tmp-location and --tmp-mapset, usage of --no-lock and --no-clean without --exec is not disallowed, but it is not documented either before its usefulness is evaluated. One use case is starting an interactive session with --no-lock --no-clean to examine results of jobs running with these.
The combining the new flags with the old ones and between each other is not limited except for few cases which clearly exclude each other (--clean-only means only cleaning and no cleaning and cleaning only don't make sense either). Although --no-lock and --no-clean are meant to be used together, saying what is the correct or allowed use is hard because the purpose of these flags is to disable the standard session behavior.
Additional changes:
This is a much better version of (no closed) Trac 2685 which suggested flag to ignore the lock. This achieves the same more formally by not creating the unwanted lock in the first place which is a standard situation on Windows and when managing the session manually directly through environmental variables. Additionally, this provides an important addition to avoid removal temporary files of another session during a clean up and a convenient way to do that afterwards.
As for the importance of --no-clean, as far as I understand, clean_temp program responsible for removing the files looks at pid of a process which supposedly created the file based on the file name and if the process is still running, it does not remove the file. (For other files it looks at the date and it works within the machine's/node's subdirectory.) However, in my test I do get some problems (maps not created) and warnings from clean_temp:
WARNING: Can't stat file ...: No such file or directory,skipping
. Additionally current clean up at the end of the session is vacuuming SQLite, so multiple vacuums are requested. (Perhaps this is a moderate issue because parallel jobs running in one mapset can practically use a single SQLite database only for reading.) The test follows.This should improve parallel workflows, e.g. with GNU parallel or on HPC. Expected usage: