-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finding marker genes in eah group #136
Comments
Hi, thank you for the reply. So, actually what I am looking for are
|
As {splatter} is a simulation package it doesn't perform clustering but it can produce datasets with clusters which we call "groups". This information is stored in the column data of the produced SingleCellExperiment object ( The issue I linked to previously describes how to find the simulated DE genes for each of these groups (some of these can be considered "markers" but probably not all, it depends how you define that term). |
I see. Thank you very much for the information. I will check it out.
Best regards,
Huda
…On Fri, 18 Feb 2022, 08:00 Luke Zappia, ***@***.***> wrote:
As *{splatter}* is a simulation package it doesn't perform clustering but
it can produce datasets with clusters which we call "groups". This
information is stored in the column data of the produced
*SingleCellExperiment* object (colData(sce)$Group). The number and size
of these groups are controlled by various parameters as described in the main
splatter vignette
<https://bioconductor.org/packages/release/bioc/vignettes/splatter/inst/doc/splatter.html>
and the parameters vignette
<https://bioconductor.org/packages/release/bioc/vignettes/splatter/inst/doc/splat_params.html>
.
The issue I linked to previously describes how to find the simulated DE
genes for each of these groups (some of these can be considered "markers"
but probably not all, it depends how you define that term).
—
Reply to this email directly, view it on GitHub
<#136 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKNZHW375KMTXQ3CL732QRLU3X4B5ANCNFSM5OSS22XA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Dear Luke Zappia,
I hope you stay safe and healthy. I have been trying to understand how Splatter generates synthetic data. I know it does not perform clustering, to begin with. However, when we set the parameter to have a specific number of genes and a specific number of cells, as well as the number of groups (clusters?) for the simulation, I am wondering how are cells assigned to which group when the splatter generates synthetic data? Are the cells randomly chosen and assigned to a group? Or it depends on the package SC3 (and let the SC3 packages do the grouping)?
Thank you very much for your help.
Best regards,
Huda
… On 18 Feb 2022, at 08:00, Luke Zappia ***@***.***> wrote:
As {splatter} is a simulation package it doesn't perform clustering but it can produce datasets with clusters which we call "groups". This information is stored in the column data of the produced SingleCellExperiment object (colData(sce)$Group). The number and size of these groups are controlled by various parameters as described in the main splatter vignette <https://bioconductor.org/packages/release/bioc/vignettes/splatter/inst/doc/splatter.html> and the parameters vignette <https://bioconductor.org/packages/release/bioc/vignettes/splatter/inst/doc/splat_params.html>.
The issue I linked to previously describes how to find the simulated DE genes for each of these groups (some of these can be considered "markers" but probably not all, it depends how you define that term).
—
Reply to this email directly, view it on GitHub <#136 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AKNZHW375KMTXQ3CL732QRLU3X4B5ANCNFSM5OSS22XA>.
Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.
|
During the simulation process cells are randomly assigned to groups (with numbers depending on the |
Dear Luke Zappia,
Thank you for the clarification email about the cluster (groups). I am trying to generate synthetic data (scRNA-seq Glioblastoma dataset) using Splatter, but the clusters obtained of the synthetic data result is somehow really different from the real data clusters. Both clusters are obtained by the Seurat clustering algorithm. Please see the UMAP plots for both datasets. I also attached the file in which I produce the synthetic data. Could you please have a look, in case I made a mistake in a step generating the synthetic data.
In general, what I have done to produce the synthetic data are;
1. Load the real data (scRNA_seq Glioblastoma)
2. Create a Seurat object for this loaded data
3. Do the quality control (such as only including counts with the minimum cell are 3 and minimum gene is 350)
4. Clusters the cells using the Seurat algorithm
5. Plot the UMAP for these clusters
6. Getting the matrix counts
7. Use this matrix as the input for SplateSimulate() function to obtain the synthetic data
8. Set the number of groups I want (10 groups), the size of each group (randomly assigning the prob. for each size), and the DE prob by 0.3.
9. I repeat steps 2-5 to obtain new clusters for the synthetic data.
10. compare the UMAP Plots (Here I found the two UMAP Plots show the clusters of real data and synthetic data are significantly different).
Could you please help why this is the case? Thank you very much.
Best regards,
Huda
… On 4 Mar 2022, at 08:04, Luke Zappia ***@***.***> wrote:
During the simulation process cells are randomly assigned to groups (with numbers depending on the group.prob parameter). You can consider these "Group" labels as the ground truth identity for each cell which you could compare to the clusters from whatever clustering method such as SC3. There is no direct relationship between {splatter} and any downstream analysis method.
—
Reply to this email directly, view it on GitHub <#136 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AKNZHWYB3YD7NPTP5HVEUR3U6G72HANCNFSM5OSS22XA>.
Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.
|
I can't see the uploaded files (I think you need to attach them on GitHub, not by email) but this is not unexpected. The parameters for creating groups are not estimated directly from a real dataset and it would take some adjusting to get them to match. For example, I think you have randomly set I would consider why you want to have an exact replica of a real dataset? The strength of simulations is being able to test scenarios for which there is no real data. If you already have this data I would suggest using it directly rather than trying to create a simulated copy of it. If there is a reason that won't work think about what scenario you want to simulate rather than just recreating a real dataset. |
Dear Luke,
I see. I misunderstood of the parameter estimation process in this case.
Yes, these points you mentioned do make sense. I was thinking to recreate (mimic) the real dataset to check whether the classification methods I make work properly if both result show high similarities. However, as you mentioned, I can actually do both separately and use them for analysis.
Thank you very much for the explanation.
Best regards,
Huda
Get Outlook for Android<https://aka.ms/AAb9ysg>
…________________________________
From: Luke Zappia ***@***.***>
Sent: Monday, March 7, 2022 8:26:57 AM
To: Oshlack/splatter ***@***.***>
Cc: Moh Huda ***@***.***>; Mention ***@***.***>
Subject: Re: [Oshlack/splatter] Finding marker genes in eah group (Issue #136)
I can't see the uploaded files (I think you need to attach them on GitHub, not by email) but this is not unexpected. The parameters for creating groups are not estimated directly from a real dataset and it would take some adjusting to get them to match. For example, I think you have randomly set group.prob but this would need to match the proportions in the real dataset (it's a bit hard to tell without any code). You may be able to get something that looks very close to your example dataset but it's non-trivial.
I would consider why you want to have an exact replica of a real dataset? The strength of simulations is being able to test scenarios for which there is no real data. If you already have this data I would suggest using it directly rather than trying to create a simulated copy of it. If there is a reason that won't work think about what scenario you want to simulate rather than just recreating a real dataset.
—
Reply to this email directly, view it on GitHub<#136 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AKNZHW52PMHLHSHLVTO5IIDU6W4VDANCNFSM5OSS22XA>.
Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
I have been using splatter to generate scRNA synthetic data. However, when I used the generated synthetic data for the Seurat object, I found some clusters do not have a single marker gene (or sometimes they have but only a few). Is there any way that I can find marker genes for each group or cluster then add them to the metadata?
The text was updated successfully, but these errors were encountered: