Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zip_entries_delete fails for in memory stream zip #330

Closed
prot0man opened this issue Jan 12, 2024 · 12 comments
Closed

zip_entries_delete fails for in memory stream zip #330

prot0man opened this issue Jan 12, 2024 · 12 comments

Comments

@prot0man
Copy link
Contributor

prot0man commented Jan 12, 2024

It looks like zip_entries_delete fails to delete files for in memory stream zips because pState->m_pFile is always NULL for them. Some example code to trigger the issue is:

    int result = 0;
    const char outfile_path[] = "myoutfile";
    char file1_name[] = "file1";
    char file1_data[] = "sometestdataforfile1";
    char file2_name[] = "file2";
    char file2_data[] = "sometestdataforfile2";
    char file3_name[] = "file3";
    char file3_data[] = "sometestdataforfile3";
    char *del_entries[] = {file2_name};
    uint8_t *zdata = NULL;
    size_t zdata_len = 0;
    struct zip_t *pzip = NULL;
    ssize_t num_del_entries = 0;

    pzip = zip_stream_open(NULL, 0, ZIP_DEFAULT_COMPRESSION_LEVEL, 'w');
    if(!pzip) {
        printf("Failed to init in-memory zip\n");
        result = TZIP_ERROR;
        goto cleanup;
    }

    zip_entry_open(pzip, file1_name);
    zip_entry_write(pzip, file1_data, strlen(file1_data));
    zip_entry_close(pzip);

    zip_entry_open(pzip, file2_name);
    zip_entry_write(pzip, file2_data, strlen(file2_data));
    zip_entry_close(pzip);

    zip_entry_open(pzip, file3_name);
    zip_entry_write(pzip, file3_data, strlen(file3_data));
    zip_entry_close(pzip);

    num_del_entries = zip_entries_delete(pzip, del_entries, 1);
    if(num_del_entries < 0) {
        printf("Failed to delete entry\n");
        result = TZIP_ERROR;
        goto cleanup;
    }

    zip_stream_copy(pzip, (void **)&zdata, &zdata_len);
    if(!zdata) {
        printf("Failed to get compressed data");
        result = TZIP_ERROR;
        goto cleanup;
    }

    result = write_to_disk(outfile_path, zdata, zdata_len);
    if(result != TZIP_OK) {
        printf("Failed to write to disk: %d\n", result);
        return result;
    }
    printf("Wrote zip to %s\n", outfile_path);

cleanup:
    free(zdata);

    if(pzip) {
        zip_stream_close(pzip);
    }

    return result;
@prot0man
Copy link
Contributor Author

I've created an MR into my local fork of this repo to illustrate some changes that may help fix the issue at prot0man@2287565

Though those changes result in zip_entries_delete successfully deleting the file, I get CRC warnings when I
unzip:

Archive:  myoutfile
  inflating: file1
file3:  mismatching "local" filename (file2),
         continuing with "central" filename version
  inflating: file3                    bad CRC a0c3e8fc  (should be d7c4d86a)

@prot0man
Copy link
Contributor Author

prot0man commented Jan 12, 2024

Actually, with the changes in my fork, it doesn't actually successfully delete the correct file (file2) based on the hexdump of the resulting zip:

proto@D:~/repos/zip/src/scratch$ xxd myoutfile
00000000: 504b 0304 1400 0808 0800 db64 2c58 0000  PK.........d,X..
00000010: 0000 0000 0000 0000 0000 0500 0400 6669  ..............fi
00000020: 6c65 3101 0000 0001 1400 ebff 736f 6d65  le1.........some
00000030: 7465 7374 6461 7461 666f 7266 696c 6531  testdataforfile1
00000040: 504b 0708 46b9 ca39 1900 0000 0000 0000  PK..F..9........
00000050: 1400 0000 0000 0000 504b 0304 1400 0808  ........PK......
00000060: 0800 db64 2c58 0000 0000 0000 0000 0000  ...d,X..........
00000070: 0000 0500 0400 6669 6c65 3201 0000 0001  ......file2.....
00000080: 1400 ebff 736f 6d65 7465 7374 6461 7461  ....sometestdata
00000090: 666f 7266 696c 6532 504b 0708 fce8 c3a0  forfile2PK......
000000a0: 1900 0000 0000 0000 1400 0000 0000 0000  ................
000000b0: 504b 0102 0000 1400 0808 0800 db64 2c58  PK...........d,X
000000c0: 46b9 ca39 1900 0000 1400 0000 0500 0400  F..9............
000000d0: 0000 0000 0000 0000 0000 0000 0000 6669  ..............fi
000000e0: 6c65 3101 0000 0050 4b01 0200 0014 0008  le1....PK.......
000000f0: 0808 00db 642c 586a d8c4 d719 0000 0014  ....d,Xj........
00000100: 0000 0005 0004 0000 0000 0000 0000 0000  ................
00000110: 0058 0000 0066 696c 6533 0100 0000 504b  .X...file3....PK
00000120: 0506 0000 0000 0200 0200 6e00 0000 b000  ..........n.....
00000130: 0000 0000                                ....

@kuba--
Copy link
Owner

kuba-- commented Jan 12, 2024

Unfortunately, deleting entries has some limitations. I means, you cannot open the archive for writing w and delete entries.
You have to close it, first (zip_close) and then open for deleting (there is also, just for consistency a special d mode for deleting).

In other words, the workflow should be like this:

zip_open
{
    zip_entry_open
    zip_entry_write
    zip_entry_close
}
zip_close
zip_open("foo.zip", 0, 'd');
{
    zip_entries_delete

    // you can also delete by index, instead of by name
    // zip_entries_deletebyindex
}
zip_close

look at readme example.

@prot0man
Copy link
Contributor Author

prot0man commented Jan 12, 2024 via email

@kuba--
Copy link
Owner

kuba-- commented Jan 12, 2024

It's all about calculating index and offsets of entries. It's all has to be written to the central directory, at the end.

@prot0man
Copy link
Contributor Author

prot0man commented Jan 12, 2024

Yeah, I was following through zip_central_dir_delete trying to see if there problem was there. I at least verified that after the call to zip_central_dir_move in the first while loop, the end result of the zip (pState->m_central_dir.m_p) didn't contain any references to the deleted file2 entry (via hexdump'ing that buffer). It looks like some other piece of memory is still referencing the deleted entry though.

@kuba--
Copy link
Owner

kuba-- commented Jan 13, 2024

Generally deleting is kind of experimental feature, so I would be very careful and for sure do not mix it with writing.

@prot0man
Copy link
Contributor Author

prot0man commented Jan 13, 2024

It's actually still not clear to me why in zip_entries_delete_mark, the following is done:

  while (i < entry_num) {
    while ((i < entry_num) && (entry_mark[i].type == MZ_KEEP)) {
      ...
    }
    while ((i < entry_num) && (entry_mark[i].type == MZ_DELETE)) {
      ...
    }
    while ((i < entry_num) && (entry_mark[i].type == MZ_MOVE)) {
      ..
    }
    ...
}

Instead of:

  while (i < entry_num) {
    if(entry_mark[i].type == MZ_KEEP) {
      ...
    } else if(entry_mark[i].type == MZ_DELETE) {
      ...
    } else if (entry_mark[i].type == MZ_MOVE) {
      ..
    }
    ...
}

Is it some sort of optimization because doing those actions in bulk via the while loop is more efficient?

After I resolved some of my technical debt with the zip file format, I realized the problem I'm running into is that the central_dir is not in sync with the file headers in some way (probably because the seeks that are done when we're dealing with a file stream need to be mirrored for zip on the heap). I suspect I will be forced to treat the mode 'w' as delete as well or else I'd have to do a disgusting work around that:

  • Closes the zip I'm actively writing to and gets the zip data
  • Opens zip with zip data from previous step as delete
  • Removes entries
  • Closes zip and gets the zip data
  • Re-opens zip with zip data from previous step as write
  • Continue adding to zip until we've reached our size limit

I understand that separating the operations for delete and write simplify the API because keeping everything up-to-date for both in-memory zips and file-backed zips can get complex, but are there any other compelling reasons for this design? I'll continue pushing forward in my branch and hopefully come up with a solution that doesn't require a lot of flaming hoops to jump through.

@prot0man
Copy link
Contributor Author

I wonder if the issue is that in zip_close, you make a call to mz_zip_writer_finalize_archive, but in zip_stream_close, there is no analogous call to mz_zip_writer_finalize_heap_archive.

@prot0man
Copy link
Contributor Author

prot0man commented Jan 13, 2024

Nevermind, mz_zip_writer_finalize_heap_archive is just basically a wrapper on mz_zip_writer_finalize_archive that additionally copies the data into the passed buffer.

@prot0man
Copy link
Contributor Author

prot0man commented Jan 13, 2024

Alright, finally got things working in #332

I plan on doing some more thorough testing of zip_entries_delete when I get time to see if it handles the edge cases I think of.

@kuba--
Copy link
Owner

kuba-- commented Jan 13, 2024

Thanks!
Already merged, so I'm closing the issue.

If you want to add more comments, do not hesitate to reopen it, or create a new one.

@kuba-- kuba-- closed this as completed Jan 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants