Consider mounting volumes outside of the container for data persistence #3

machawk1 · 2018-08-30T17:13:58Z

You can then stop the container using docker stop wasp and start it again with docker start wasp. Note that your archive is stored in the container. If you remove the container, your archive is gone.

Docker allows one to mount directories outside of the container as volumes. Doing so would prevent the above scenario of the data disappearing when the container is gone.

ibnesayeed · 2018-08-30T18:59:32Z

Persisting data from containers is a well-known and well-documented topic so we can assume those familiar with Docker would know how to do it, be it using volumes, bind mounts, or third-party storage drivers. However, the documentation of this repo should at least describe all the places (or declare them as volumes) where data of different services are being stored so that users know where to mount drives for persistence. Also, a simple bind mount example command will not hurt either.

That said, I am not a big fan of monolithic containers that run too many services in a single container. This might work well when things are used as a portable desktop application, but for any serious scalable work setup every service should have its own container and orchestrated using a stack file (or docker-compose).

arjenpdevries · 2018-08-30T20:18:11Z

Mastodon solves this with docker-compose - that works indeed quite nice, so we can "borrow" their setup.

johanneskiesel · 2018-08-30T22:39:15Z

That would be useful indeed. Currently we have the following places:

/home/user/srv/warcprox/archive contains the WARC files
/home/user/srv/pywb/collections/archive contains a link to the WARC files and the pywb indexes + templates
/home/user/srv/elasticsearch/index contains the elastic search index

pywb automatically indexes what it finds in the linked directory, so the pywb index would not need to be stored persistently. This is (currently) not the case for the elasticsearch index, but this could be changed relatively easily.

It would then be enough to store the WARC files persistently, which would make sense to me. This would also allow you to just add WARC files you recorded with another system.

What do you think?

(As the different services are currently "talking" to each other by the file system, separating them into different services would take some effort. I agree that this is the way to go for scalable setups, but a scalable setup is probably not needed for a one-person archiver.)

ibnesayeed · 2018-08-30T23:00:49Z

I think we don't need to put applications in deeper directories when running in containers because of the file system isolation. I would perhaps suggest to place all the individual apps directly under the / or the container file system or make a directory at /wasp and place everything under that. This way, unnecessary repetition of the /home/user/srv path prefix can be avoided when dealing with volumes.

Alternatively, we should be able to modify data directories of all these applications and place them under something like /data/{warcprox,pywb,ealsticsearch}. This way, the code is isolated from the data and one is not a sub-directory of the other. Also, this structure would allow mounting just one directory if the sub-directory structure on the host is the same or each application's data directory separately.

pywb automatically indexes what it finds in the linked directory, so the pywb index would not need to be stored persistently.

If I know it correctly, PyWB indexes WARC files automatically that are not indexed already (i.e., their CDXJ records are missing). @ikreymer correct me if I am wrong here. If so, then persisting PyWB index is also important otherwise each time a container is started, CDXJ indexing needs to happen all over again. This might not be a big deal for small collections, but it will become important otherwise.

As the different services are currently "talking" to each other by the file system, separating them into different services would take some effort.

If a stack/compose file is provided, it can define necessary volumes and make them available in each service to share the file system to deal with this.

I agree that this is the way to go for scalable setups, but a scalable setup is probably not needed for a one-person archiver.

If the intent of this project is only for single-user small setups then this assumption is fair enough.

arjenpdevries · 2018-08-31T14:43:38Z

Mastodon uses e.g. Postgres and Redis as external services, that run using their own images.
This is e.g. how Postgres is used in Mastodon.

In my docker-compose.yml file for the idf.social, the Postgres storage directories are connected to directories in the host file system:

db:
  restart: always
  image: postgres:9.6-alpine
networks:
  - internal_network
volumes:
  - /data/mastodon/postgres/postgres:/var/lib/postgresql/data:z

The volumes directive causes the data for the Postgres database to reside outside the container, in directory /data/mastodon/postgres/postgres. (The :z attribute is necessary for SELinux.)

See also the docker-compose.yml for the full, more complex definition, as Mastodon uses more services, and also defines some additional volumes to store its own dynamic data in the host filesystem.

ibnesayeed · 2018-08-31T20:58:37Z

There are many ways to achieve this. We can ever declare volumes and networks as top-level objects in the compose file, then use those to deploy using docker compose for quick testing or built-in docker stack for a more robust long-running system. Shared volumes will allow file-based communication and shared networks would allow containers service to talk to reach to each other using service names.

johanneskiesel added the enhancement New feature or request label Aug 30, 2018

johanneskiesel self-assigned this Aug 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider mounting volumes outside of the container for data persistence #3

Consider mounting volumes outside of the container for data persistence #3

machawk1 commented Aug 30, 2018

ibnesayeed commented Aug 30, 2018

arjenpdevries commented Aug 30, 2018

johanneskiesel commented Aug 30, 2018

ibnesayeed commented Aug 30, 2018

arjenpdevries commented Aug 31, 2018 •

edited

Loading

ibnesayeed commented Aug 31, 2018

Consider mounting volumes outside of the container for data persistence #3

Consider mounting volumes outside of the container for data persistence #3

Comments

machawk1 commented Aug 30, 2018

ibnesayeed commented Aug 30, 2018

arjenpdevries commented Aug 30, 2018

johanneskiesel commented Aug 30, 2018

ibnesayeed commented Aug 30, 2018

arjenpdevries commented Aug 31, 2018 • edited Loading

ibnesayeed commented Aug 31, 2018

arjenpdevries commented Aug 31, 2018 •

edited

Loading