Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider mounting volumes outside of the container for data persistence #3

Open
machawk1 opened this issue Aug 30, 2018 · 6 comments
Open
Assignees
Labels
enhancement New feature or request

Comments

@machawk1
Copy link

You can then stop the container using docker stop wasp and start it again with docker start wasp. Note that your archive is stored in the container. If you remove the container, your archive is gone.

Docker allows one to mount directories outside of the container as volumes. Doing so would prevent the above scenario of the data disappearing when the container is gone.

@ibnesayeed
Copy link

Persisting data from containers is a well-known and well-documented topic so we can assume those familiar with Docker would know how to do it, be it using volumes, bind mounts, or third-party storage drivers. However, the documentation of this repo should at least describe all the places (or declare them as volumes) where data of different services are being stored so that users know where to mount drives for persistence. Also, a simple bind mount example command will not hurt either.

That said, I am not a big fan of monolithic containers that run too many services in a single container. This might work well when things are used as a portable desktop application, but for any serious scalable work setup every service should have its own container and orchestrated using a stack file (or docker-compose).

@arjenpdevries
Copy link
Collaborator

Mastodon solves this with docker-compose - that works indeed quite nice, so we can "borrow" their setup.

@johanneskiesel
Copy link
Member

That would be useful indeed. Currently we have the following places:

  • /home/user/srv/warcprox/archive contains the WARC files
  • /home/user/srv/pywb/collections/archive contains a link to the WARC files and the pywb indexes + templates
  • /home/user/srv/elasticsearch/index contains the elastic search index

pywb automatically indexes what it finds in the linked directory, so the pywb index would not need to be stored persistently. This is (currently) not the case for the elasticsearch index, but this could be changed relatively easily.

It would then be enough to store the WARC files persistently, which would make sense to me. This would also allow you to just add WARC files you recorded with another system.

What do you think?

(As the different services are currently "talking" to each other by the file system, separating them into different services would take some effort. I agree that this is the way to go for scalable setups, but a scalable setup is probably not needed for a one-person archiver.)

@johanneskiesel johanneskiesel added the enhancement New feature or request label Aug 30, 2018
@johanneskiesel johanneskiesel self-assigned this Aug 30, 2018
@ibnesayeed
Copy link

I think we don't need to put applications in deeper directories when running in containers because of the file system isolation. I would perhaps suggest to place all the individual apps directly under the / or the container file system or make a directory at /wasp and place everything under that. This way, unnecessary repetition of the /home/user/srv path prefix can be avoided when dealing with volumes.

Alternatively, we should be able to modify data directories of all these applications and place them under something like /data/{warcprox,pywb,ealsticsearch}. This way, the code is isolated from the data and one is not a sub-directory of the other. Also, this structure would allow mounting just one directory if the sub-directory structure on the host is the same or each application's data directory separately.

pywb automatically indexes what it finds in the linked directory, so the pywb index would not need to be stored persistently.

If I know it correctly, PyWB indexes WARC files automatically that are not indexed already (i.e., their CDXJ records are missing). @ikreymer correct me if I am wrong here. If so, then persisting PyWB index is also important otherwise each time a container is started, CDXJ indexing needs to happen all over again. This might not be a big deal for small collections, but it will become important otherwise.

As the different services are currently "talking" to each other by the file system, separating them into different services would take some effort.

If a stack/compose file is provided, it can define necessary volumes and make them available in each service to share the file system to deal with this.

I agree that this is the way to go for scalable setups, but a scalable setup is probably not needed for a one-person archiver.

If the intent of this project is only for single-user small setups then this assumption is fair enough.

@arjenpdevries
Copy link
Collaborator

arjenpdevries commented Aug 31, 2018

Mastodon uses e.g. Postgres and Redis as external services, that run using their own images.
This is e.g. how Postgres is used in Mastodon.

In my docker-compose.yml file for the idf.social, the Postgres storage directories are connected to directories in the host file system:

db:
  restart: always
  image: postgres:9.6-alpine
networks:
  - internal_network
volumes:
  - /data/mastodon/postgres/postgres:/var/lib/postgresql/data:z

The volumes directive causes the data for the Postgres database to reside outside the container, in directory /data/mastodon/postgres/postgres. (The :z attribute is necessary for SELinux.)

See also the docker-compose.yml for the full, more complex definition, as Mastodon uses more services, and also defines some additional volumes to store its own dynamic data in the host filesystem.

@ibnesayeed
Copy link

There are many ways to achieve this. We can ever declare volumes and networks as top-level objects in the compose file, then use those to deploy using docker compose for quick testing or built-in docker stack for a more robust long-running system. Shared volumes will allow file-based communication and shared networks would allow containers service to talk to reach to each other using service names.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants