suggestion: consider using record markers in the packs 

I have a suggestion for improving the long-temr resilience for the pack files - we should consider including record markers in the pack file (e.g. similar to the Fortran sequential file). 

The main advantage is that this would allow extraction of the data without the SQLite database (the index) at all. For example, in the case of disk storage failure rendering the latter corrupted. 

My understanding is that the current implementation simply concatenate the *pack* file with the stream, and record the offset and lengths in the SQLite database. Instead, a slight modification can be done when writing the stream, something like this:

1. Write a *N* bytes integer marker whose value is the record length
2. Write the actual data
3. Write a *N* bytes integer marker repeated - indicating the end of the record
4. Record the offset and the length of the object taking account of the record sizes. 
 
In Fortran, N is actually compiler dependent, with N=4 a maximum record length of about 2GB but there is support for "sub-records", but this is mainly for backward compatibility with F77. One can also use a 8 byte marker so the maximum size can be greatly incrased, but this means some storage overheads would be involved. The sign of the integer may be used as a flag for compression or not, or one can include dedicicate bytes etc. 

Now, I realised that a BIG difference here (compard to Fortran WRITE) is that the size of the object may be unkown before writing it, as it can just be stream when directly writting data to the pack. But we can just omit step one above, or make it a fixed value (for integrity check). and to reconstruct the index, we read the pack file backwards and find each object using the markers. 

This would allow reconstruction of the SQLite index purely from the pack file. One first read the marker, then use the length to read the content (object), followed by the ending marker as a verification (optional). Then the hash can be computed for this object (uncompress it first if needed), and the offest, size, length, can be recorded into the new SQLite database. 

I think the only part that would need to be updated in the code is to include the record mark(s) in the (`_write_data_to_packfile_`):
https://github.com/aiidateam/disk-objectstore/blob/7a09ea2a953f0b0dfa79a6688306c51a501f874b/disk_objectstore/container.py#L1245-L1251

and also the `repack` methods.

This approach should maintain full backward compatibility as the precise size, length and offsets are stored in the SQLite database, it is just that the old pack format would not be supported for reconstruction without the SQLite database. 

Reference: 
- Pack format used by `restic` backup software that allows index reconstruction (we don't need to be as complicated as this, as they also need the data to be encrypted): https://restic.readthedocs.io/en/stable/100_references.html#pack-format
- Some of fortran record format: https://traktofon.github.io/FortranFiles.jl/stable/files.html
- Some python code for reading fortran record: https://github.com/zhubonan/castepxbin/blob/3b57e0b9470d4217afc599c96bbef4c132026263/castepxbin/castep_bin.py#L702-L740

	def _write_data_to_packfile(
	self,
	pack_handle: StreamWriteBytesType,
	read_handle: StreamReadBytesType,
	compress: bool,
	hash_type: Optional[str] = None,
	) -> Union[Tuple[int, None], Tuple[int, str]]:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

suggestion: consider using record markers in the packs #124

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

suggestion: consider using record markers in the packs #124

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions