Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataCap Refresh] <2nd> Review of <Public Open Dataset Pathway> #284

Open
dampud opened this issue Jan 31, 2025 · 4 comments
Open

[DataCap Refresh] <2nd> Review of <Public Open Dataset Pathway> #284

dampud opened this issue Jan 31, 2025 · 4 comments
Assignees
Labels
Awaiting RKH Refresh request has been verified by Public, Watchdog, and Governance - now awaiting release of DC DataCap - Doubled Refresh Applications received from existing Allocators for a refresh of DataCap allowance

Comments

@dampud
Copy link

dampud commented Jan 31, 2025

  1. Type of allocator: manual
  2. Paste your JSON number: 996
  3. Allocator verification: yes
  4. Allocator Application
  5. Compliance Report
  6. Previous reviews

Current allocation distribution

Client name DC granted
CommonCrawl 4.736 PiB
w3s 1.97 PiB
Han Tang Cloud (XSG) 0.5957 PiB
Mianyang Anyi Data Service Co., Ltd. 0.04 PiB
OpendataLab 5 PiB

CommonCrawl

  • DC requested: 5 PiB
  • DC granted so far: 4.736 PiB

II. Dataset Completion

-

III. Does the list of SPs provided and updated in the issue match the list of SPs used for deals?

SP list disclosed in comments

IV. How many replicas has the client declared vs how many been made so far:

9 declared vs 10

V. Please provide a list of SPs used for deals and their retrieval rates

SPID % retrieval Meet the >75% retrieval?
f0122215 83.33% YES
f03035686 31.61% NO
f03231154 61.45% NO
f03157905 29.53% NO
f03136267 0.00% NO
f03088052 78.52% YES
f03144077 0.00% NO
f03252730 82.45% YES
f03162991 82.73% YES
f03218576 0.00% NO
f03333933 0.00% NO
f03157910 48.87% NO
f03215853 28.58% NO
f03099987 92.11% YES
f03202108 92.03% YES
f02029742 86.25% YES
f03291373 55.10% NO
f03260592 78.27% YES
f03220172 51.66% NO

W3s

  • DC requested: 4 PiB
  • DC granted so far: 2.95PiB PiB

II. Dataset Completion

due to specific of data stored by the user, that’s unavailable

III. Does the list of SPs provided and updated in the issue match the list of SPs used for deals?

Due to the specifics of the user system used to select the SPs, which was unavailable, concerns about it were raised by the allocator and addressed by the user.

IV. How many replicas has the client declared vs how many been made so far:

7 vs 10, however 88.51% of deals are for data replicated across less than 4 storage providers and that will be explained with the client.

V. Please provide a list of SPs used for deals and their retrieval rates

SPID % retrieval Meet the >75% retrieval?
f01240 0.00% NO
f01392893 50.05% NO
f01786387 0.00% NO
f01896422 0.00% NO
f020378 86.68% YES
f02639429 86.26% YES
f0717969 87.12% YES
f08403 28.50% NO
f097777 0.00% NO

Han Tang Cloud (XSG)

  • DC requested: 3 PiB
  • DC granted so far: 0.6 PiB

II. Dataset Completion

The client committed to preparing .csv files with metadata to link the files to their sources

III. Does the list of SPs provided and updated in the issue match the list of SPs used for deals?

yes

IV. How many replicas has the client declared vs how many been made so far:

4 vs 6

V. Please provide a list of SPs used for deals and their retrieval rates

SPID % retrieval Meet the >75% retrieval?
f03178077 0.00% NO
f03179570 5.85% NO
f03179572 6.49% NO
f03214937 7.19% NO

Mianyang Anyi Data Service Co., Ltd.

  • DC requested: 1 PiB
  • DC granted so far: 0.04 PiB

II. Dataset Completion

*The client prepared a proper mapping file that allows for connecting the sealed data with the original files.

III. Does the list of SPs provided and updated in the issue match the list of SPs used for deals?

*no, client received first small test round. Before granting the next tranche, performance should be improved

IV. How many replicas has the client declared vs how many been made so far:

1 vs 10

V. Please provide a list of SPs used for deals and their retrieval rates

SPID % retrieval Meet the >75% retrieval?
f03321072 61.91% NO

OpendataLab

  • DC requested: 5 PiB
  • DC granted so far: 5PiB

II. Dataset Completion

No test retrieval was made due to lack of proper mapping file

III. Does the list of SPs provided and updated in the issue match the list of SPs used for deals?

updating in comments

IV. How many replicas has the client declared vs how many been made so far:

8 vs 7

V. Please provide a list of SPs used for deals and their retrieval rates

SPID % retrieval Meet the >75% retrieval?
f01926635 64.93% NO
f01999119 88.08% YES
f02035256 82.88% YES
f02828242 0.00% NO
f02851143 91.28% YES
f03078786 0.00% NO
f03099987 93.24% YES
f03157910 47.87% NO
f03202108 92.80% YES
f03215853 29.25% NO
f03218576 0.00% NO
f03220172 50.09% NO
f03220176 42.71% NO
f03251993 51.24% NO
f03253497 59.57% NO
f03253580 82.83% YES
f09693 5.00% NO

Allocation summary

Notes from the Allocator

CommonCrawl, w3s Low retrieval on part of SP’s that was explained by the client and caused by the problems with the Service Providers Han Tang Cloud (XSG) Improvement is expected before the next allocation is granted. Mianyang Anyi Data Service Co., Ltd. Improvement is expected before the next allocation is granted.

Did the allocator report up to date any issues or discrepancies that occurred during the application processing?

Yes, discrepancies are clarified on an ongoing basis

What steps have been taken to minimize unfair or risky practices in the allocation process?

Constant monitoring of performance, and a strict approach to ensuring the uniqueness of data followed up by KYC procedures that are conducted before an application is approved.

How did these distributions add value to the Filecoin ecosystem?

Selected data sets present unique or above-average value to society and community. We make every effort to ensure that the datasets are as well described and cataloged as possible, and that potential end users have knowledge of how to use them. Technical issues on the client side and the complexity of certain processes unfortunately led to technical problems, which were reflected in the reports. However, the priority of data uniqueness prevailed in this case.

Please confirm that you have maintained the standards set forward in your application for each disbursement issued to clients and that you understand the Fil+ guidelines set forward in your application

Yes

Please confirm that you understand that by submitting this Github request, you will receive a diligence review that will require you to return to this issue to provide updates.

Yes

@filecoin-watchdog filecoin-watchdog added Refresh Applications received from existing Allocators for a refresh of DataCap allowance Awaiting Community/Watchdog Comment DataCap Refresh requests awaiting a public verification of the metrics outlined in Allocator App. labels Jan 31, 2025
@filecoin-watchdog
Copy link
Collaborator

@dampud

CommonCrawl

  • Proper KYC and verification.
  • The list of Storage Providers (SPs) is not always disclosed or updated in the original thread, making comparisons and due diligence difficult.
  • Data was stored multiple times on the network; however, this was extensively explained by the client.
  • Although 10 replicas were declared, most of the data is stored in fewer than six copies.
  • Almost 50% of the data is stored on just two SPs.
  • A crucial data preparation step is missing, which would allow community users to connect data with CAR files stored on the network (index file). This step was mentioned by the client but has not yet been delivered.

W3s

  • The list of SPs is not fully updated in the original thread. Keeping it updated would improve compliance and due diligence efforts.
  • Issues with data retrievability:
    • 44.44% of storage providers have a retrieval success rate of zero.
    • 55.56% of storage providers have a retrieval success rate of less than 75%.
  • The client declared a need for storing 10 replicas, yet most data is stored in fewer than three copies.
  • Approximately 40% of the data is sealed on a single SP (f0717969).

Han Tang Cloud (XSG)

  • Significant retrievability issues need to be addressed in collaboration with the client.
  • A user had declared the intention to provide a manifest.csv file that would allow future connections between sealed files and the original dataset. However, there is currently no access to such a file, and this should be followed up.
  • Four replicas are stored, despite six being declared.

Mianyang Anyi Data Service Co., Ltd.

  • Initial distribution among data centers is small, with only one SP. Before the next allocation, the allocator has stated that key issues should be addressed (e.g., increasing the number of SPs, improving replication, and enhancing retrievability).
  • File preparation was properly described by the user, including a shared page that allows the community to connect sealed data with the original files.

OpendataLab

  • Mixed results in retrievability, though generally acceptable.
  • Not all SPs were disclosed in advance.
  • Seven replicas were declared, but data is stored unevenly—most of it exists in only one copy, while small amounts have up to eight replicas.
  • The unique data stored by the client amounts to 1.49 PiB, whereas 1 PiB was declared.
  • Proper diligence was demonstrated in discussions. The allocator has been actively working to address ongoing issues.

@filecoin-watchdog filecoin-watchdog added Awaiting Response from Allocator If there is a question that was raised in the issue that requires comment before moving forward. and removed Awaiting Community/Watchdog Comment DataCap Refresh requests awaiting a public verification of the metrics outlined in Allocator App. labels Feb 5, 2025
@dampud
Copy link
Author

dampud commented Feb 5, 2025

Thank you for your review.
CommonCrawl, w3s: We will request clarifications about the compliance issues identified. Our team will continue to detect and address any issues on an ongoing basis.
Han Tang Cloud (XSG): The client has stated they will discontinue sealing to problematic nodes and will provide replacements.
Mianyang Anyi Data Service: The user must clearly demonstrate improved performance before the next distribution.
OpendataLab: The application reached its target data cap. If the organization seeks additional allocations in the future, further measures will be taken to maintain compliance.

@Kevin-FF-USA Kevin-FF-USA added Diligence Audit in Process Governance team is reviewing the DataCap distributions and verifying the deals were within standards and removed Awaiting Response from Allocator If there is a question that was raised in the issue that requires comment before moving forward. labels Feb 6, 2025
@galen-mcandrew
Copy link
Collaborator

Overall strong diligence and bookkeeping from this pathway. Echoing the findings above:

  • Maintain diligence around dataset preparation requirements; continue to drive better standards for community
  • Continued diligence on retrieval, SP distribution, dataset replications
  • Dataset calculations to drive efficiency, reduce padding or over-replication

We are requesting an additional 20PiB of DataCap for this pathway.

@dampud
Copy link
Author

dampud commented Feb 12, 2025

Thank you for the assessment, @galen-mcandrew.

@Kevin-FF-USA Kevin-FF-USA added Awaiting RKH Refresh request has been verified by Public, Watchdog, and Governance - now awaiting release of DC DataCap - Doubled and removed Diligence Audit in Process Governance team is reviewing the DataCap distributions and verifying the deals were within standards labels Feb 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Awaiting RKH Refresh request has been verified by Public, Watchdog, and Governance - now awaiting release of DC DataCap - Doubled Refresh Applications received from existing Allocators for a refresh of DataCap allowance
Projects
None yet
Development

No branches or pull requests

4 participants