Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report stats comparing PAINT release versions #43

Open
dustine32 opened this issue Apr 3, 2020 · 4 comments
Open

Report stats comparing PAINT release versions #43

dustine32 opened this issue Apr 3, 2020 · 4 comments
Assignees

Comments

@dustine32
Copy link
Collaborator

Add some stats to the standard update pipeline reports comparing changes between two versions of the PAINT release (i.e. the IBD file and the set of IBA GAFs). Ideally, the parameters should just be two dates corresponding to before and after releases (e.g. 2020-01-31 and 2020-03-26).

We already have two reports yet to be committed to this repo:

  1. A simple SQL query to count IBDs created between the two parameter dates, split out by curator. Example result of comparing 2020-01-31 vs 2020-03-26:
name count
Pascale Gaudet 97
Huaiyu Mi 978
Marc Feuermann 2153
Michael Kesling 884
Total 4112
  1. A python script that works only with the contents of our monthly releases posted on our FTP server. It compares sets of IBDs from the IBD.gaf files and cross-references to IBAs through the PANTHER:PTN in the IBA's with/from column.

Further description of the stats the python script calclulates:

  1. Added IBDs - Given two IBD/IBA sets, "before" and "after", find the IBDs in "after" that aren't in "before".
  2. Obsoleted IBDs - Now find IBDs in "before" that aren't in "after"
  3. Added IBAs - In the "after" set of IBA GAFs, count all IBAs that reference IBD PTN and term in Added IBDs
  4. Obsoleted IBAs - In the "before" set of IBA GAFs, count all IBAs that reference IBD PTN and term in Obsoleted IBDs
  5. Net IBA change = Added IBAs - Obsoleted IBAs

When running the script on "before" release 2020-01-31 and "after" release 2020-03-26 I get these numbers:

Added IBDs: 4062
Obsoleted IBDs: 1224
Added IBAs: 319,250
Obsoleted IBAs: 71,491
Net IBA change: 247,759

A third report displaying the % change by individual IBA GAF (e.g. paint_mgi, paint_human) as well as overall % change in IBA count will be added.

These reports will help quickly QA and identify potential data issues that would've then got out to the GO release data.

@dustine32 dustine32 self-assigned this Apr 3, 2020
@pgaudet
Copy link
Collaborator

pgaudet commented Apr 6, 2020

Thanks @dustine32, this is great !

The stats will be here ? https://drive.google.com/drive/folders/1MrtIQVmtdfd6gJhVcEfofXrU0IIPnOW7

And I guess this will be a new file in the next release?

@dustine32
Copy link
Collaborator Author

@pgaudet Right, it'll go in that folder, prefixed with the run date, e.g. 2020-03-26-[report_name]. Since these are sort of global update stats we can probably just call the new report 2020-03-26_update_stats? What do you think?

@pgaudet
Copy link
Collaborator

pgaudet commented Apr 7, 2020

Sounds good.

dustine32 added a commit that referenced this issue Apr 7, 2020
@dustine32
Copy link
Collaborator Author

Add "Net IBD change" count

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants