-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add bgen-reader backend #38
Conversation
One other thing I noticed that may be useful in the future is that bgen-reader supports variable ploidy and number of alleles, unlike PyBGEN which is restricted to diploid, bi-allelic BGEN files. |
Very nice! Two options that come to mind for the representation are:
|
Thanks for setting out some options @eric-czech. I've tried approach 2 in this latest update. I added a If we go this route, we should probably just change Another thing I noticed is that |
👍
Do we want to just do away with those wrappers now? My intention with them was mostly to organize code and attach attributes for later optimizations that aren't really necessary yet (or maybe ever), and certainly not to actually override internal Xarray methods, so perhaps this reader would be a good example for what reading in dosage data or hard calls would look like otherwise? |
Yes, I think so. So in this case the |
Yep. Some things that are probably worth thinking about though if the datasets are returned from readers as-is:
|
If we have them, then it's a good idea to add them I think. For the other points, I'm not sure what the best option is yet. I suggest we work through them as a part of refactoring the API. I think removing the wrappers is big enough to warrant a separate issue and PR(s). This is basically an additive change, so we could merge it in the meantime. |
👍 , will do. I'll add an issue for it. |
The code in #36 uses PyBGEN to read BGEN files. This (incomplete) PR adds a bgen-reader so we can compare the two libraries. A few differences/comments:
.sample
side file.The main problem with this code as it stands is that it doesn't create one of the genetic datasets defined in https://nbviewer.jupyter.org/github/related-sciences/gwas-analysis/blob/master/notebooks/platform/xarray/data_structures.ipynb.
Each variant/sample has three probabilities (AA, AB, BB), but none of the data structures support that, unless I'm missing something.
GenotypeProbabilityDataset
looks like it's the closest, but it looks like it requires phased data, since there are four allele/ploidy combinations. Alistair's comment in https://discourse.smadstatgen.org/t/n-d-array-conventions-for-variation-data/16/8 seems relevant:I wonder if we should add support for this data representation too.
Thoughts @eric-czech?