Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore backup not working #404

Closed
SiHoll opened this issue Feb 4, 2025 · 11 comments
Closed

Restore backup not working #404

SiHoll opened this issue Feb 4, 2025 · 11 comments

Comments

@SiHoll
Copy link

SiHoll commented Feb 4, 2025

I just wanted to test the SysReptor backup flow and ran into a problem: I can not restore the created backup.
Can you please assist and check, why backups are not working properly?

To create a backup, I followed the instructions on your backup wiki page. Running the comman docker compose run --rm app python3 manage.py backup > backup.zip produces the following output. It does not produce any message like "Successfully created backup" or anything:

[+] Creating 1/1
✔ Container sysreptor-db  Running0.0s
2025-02-04 16:11:44,200 [INFO] root: Backup requested

When restoring the backup with the command cat backup.zip | docker compose run --rm --no-TTY app python3 manage.py restorebackup on a different machine in a freshly installed SysReptor version I get the following output:

[+] Creating 1/1
 ✔ Container sysreptor-db  Running                                                                                                                                                   0.0s Traceback (most recent call last):
  File "/app/api/manage.py", line 22, in <module>
    main()
  File "/app/api/manage.py", line 18, in main
    execute_from_command_line(sys.argv)
  File "/usr/local/lib/python3.12/site-packages/django/core/management/__init__.py", line 442, in execute_from_command_line
    utility.execute()
  File "/usr/local/lib/python3.12/site-packages/django/core/management/__init__.py", line 436, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/usr/local/lib/python3.12/site-packages/django/core/management/base.py", line 413, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/usr/local/lib/python3.12/site-packages/django/core/management/base.py", line 459, in execute
    output = self.handle(*args, **options)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/api/reportcreator_api/management/commands/restorebackup.py", line 36, in handle
    with ZipFile(file=f, mode='r') as z:
         ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/zipfile/__init__.py", line 1349, in __init__
    self._RealGetContents()
  File "/usr/local/lib/python3.12/zipfile/__init__.py", line 1416, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

If I check the created backup.zip file with unzip -l backup.zip I get the following error:

Archive:  backup.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of backup.zip or
        backup.zip.zip, and cannot find backup.zip.ZIP, period.
@SiHoll SiHoll changed the title Restore Backup not working Restore backup not working Feb 4, 2025
@aronmolnar
Copy link
Contributor

aronmolnar commented Feb 6, 2025

Hi, thank you for reporting.

We were, unfortunately, not able to reproduce the issue. We tried on three different installations of different sizes and the command produced a valid zip file (there is no success message, btw, but we can add this for more clarity).

  • Does the command return any content, or is the file empty (zero size)?
  • Do you have sufficient space on your hard disk/partition?
  • Does the procedure of generating backups via the web ui work?

The root cause seems to be not the restore, but the backup creation.

@SiHoll
Copy link
Author

SiHoll commented Feb 6, 2025

Yes, you are right. The issue most likely is the creation of the backup not restoring it.

  1. The command creates a backup file that is roughly 6GB large. The command runs around 5 minutes. On our disk, around 80GB are used. The SysReptor is the only application running on that system.
  2. We have around 60GB of disk space available. This should not be the problem.
  3. Generating the backup via web ui does also not work. It just stops the download after around 2GB.

Are there any logs or other info I can take a look at?

@aronmolnar
Copy link
Contributor

Let us check internally and maybe we can troubleshoot this in detail next week, if that's okay for you.

/cc @MWedl

@MWedl
Copy link
Contributor

MWedl commented Feb 10, 2025

Hi,
did you encrypt the backups via the --key CLI option? Could you check if the file header is a valid zip file

The backup request via web UI / API is most likely cancelled by request timeouts of the reverse proxy because the request took too long. For the CLI command, there are no timeouts.

Maybe the problem is related to docker stdout forwarding. Please try if writing directly to a file in a bind-mounted directory produces a valid backup zip:

docker compose run --rm --volume=${PWD}:/backup app python3 manage.py backup /backup/backup.zip

@SiHoll
Copy link
Author

SiHoll commented Feb 10, 2025

No, I do not use the --key option to encrypt the backup.

The file header is a valid zip file as far as I can tell:

~$ file backup.zip
backup.zip: Zip archive data, at least v2.0 to extract, compression method=deflate

~$ head -n 1 backup.zip | xxd
00000000: 504b 0304 1400 0800 0800 1b36 4a5a 0000  PK.........6JZ..
00000010: 0000 0000 0000 0000 0000 0700 0000 5645  ..............VE
00000020: 5253 494f 4e33 3230 32d5 3334 0200 504b  RSION3202.34..PK
00000030: 0708 05b2 a786 0900 0000 0700 0000 504b  ..............PK
00000040: 0304 1400 0800 0800 1b36 4a5a 0000 0000  .........6JZ....
00000050: 0000 0000 0000 0000 0f00 0000 6d69 6772  ............migr
00000060: 6174 696f 6e73 2e6a 736f 6ebd 5ad9 8ea3  ations.json.Z...
00000070: 3814 fd95 529e 5b1a d62c f32b ad96 e580  8...R.[..,.+....
...

The command to write the backup directly into a file produces the same error. The file is approximately the same size and produces the same error as befor when viewing its contents with unzip -l backup.zip.

@MWedl
Copy link
Contributor

MWedl commented Feb 10, 2025

I think I found the problem. In the backup CLI command, the output file is not closed. Because of that, there might be a race condition where the last chunk of the zip file (containing the end of the central directory) is not flushed before the process exists. Unfortunately, I cannot confirm if this is actually the root cause, because I could not reproduce generating corrupt backup.zip files.

@SiHoll
Copy link
Author

SiHoll commented Feb 10, 2025

Okay, I will try the cli backup after the next SysReptor update and will let you know if it is working.

In the meantime: I just tried to increase the timeout for the nginx proxy, that sits in front of the SysReptor in our case. It was configured with your suggested configuration file (deploy/nginx/sysreptor.nginx). I updated the following line:

proxy_read_timeout 15m; # changed from "5m"

However, the web backup functionality still stops after around 3 minutes and ~3GB. Can you provide more details of what timeout of the reverse proxy you mean. Also it would be nice, if the documentation and/or the default server configuration contain appropriate values for timeouts.

@MWedl
Copy link
Contributor

MWedl commented Feb 27, 2025

After some debugging, I found the root cause why backup requests are sometimes failing. It's not related to nginx proxy_read_timeout, because this timeout does not apply to the whole request but only between two chunks of data.

The gunicorn server regularly restarts worker processes to prevent them from consuming too much resources (python garbage collection does not always return memory back to the OS). When a worker process is restarted, the worker has to finish all currently running requests in a timeout (300s). When the timeout exceeds because requests take longer, the worker process is killed and running requests are aborted.

A backup request might get aborted if it takes too long while a worker process is restarting. The backup request might take so long because it contains much data or the client has a slow network connection or the server uses slow file storages, it gets aborted.

To solve this problem, we can increase the timeout to have more time to finish the download. I'm not yet sure what a sufficient timeout should look like.

@SiHoll
Copy link
Author

SiHoll commented Mar 3, 2025

Small update from my side: after your new release, the CLI backup command returned the exit code 137. This exit codes indidcates that our system was running out of RAM. After I upgraded our SysReptor VM from 8 to 16 GB of RAM, the command continues to run until our disk runs out of free space. So I am dealing with a new problem now 😄 As a hint: you documentation states that 8GB of RAM is sufficient. Maybe 8GB of RAM makes my error reproducible and you might want to consider this limitation when working on the backup functionality.

The web backup functionality would (hopefully) solve the problem of not having enough free disk space. In my opinion, timeouts should be independent of backup size and download speed. Is there a way to implement multiple workers that process smaller parts of the backup and can finish within the regular timeout of 300s?

@MWedl
Copy link
Contributor

MWedl commented Mar 4, 2025

Thanks for the update. You are right, backup creation consumes a lot of memory. This should not be the case, because backup creation is streamed and should only use minimal memory. We will investigate and try to reduce the memory load.

Splitting the backup in small chunks to multiple requests is not that easy with the current backup function. Backups are generated in memory as streamed response using the zipstream-ng library. Saving the backup state is unfortunately not supported by the library. We would have to store the full backup file on the filesystem (or another file storage e.g. S3) and then download this file in chunks. The downside is that this approach consumes a lot of disk space.

@MWedl
Copy link
Contributor

MWedl commented Mar 5, 2025

Fixed multiple backup bugs in https://github.com/Syslifters/sysreptor/releases/tag/2025.20

  • Fix corrupt backup.zip because output file is not closed in backup CLI command
  • Add log message when backup is finished
  • Increase gunicorn worker restart timeout to prevent aborted backup requests
  • Do not fully load files into memory in S3 storage to reduce memory usage during backups

@MWedl MWedl closed this as completed Mar 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants