Finding marker genes in eah group #136

emkahuda · 2022-02-16T19:41:10Z

I have been using splatter to generate scRNA synthetic data. However, when I used the generated synthetic data for the Seurat object, I found some clusters do not have a single marker gene (or sometimes they have but only a few). Is there any way that I can find marker genes for each group or cluster then add them to the metadata?

lazappi · 2022-02-17T08:12:10Z

Hi @emkahuda

Thanks for giving {splatter} a go! Please have a look at this issue for information on how to find the ground truth simulated DE genes #57. I will also point out that all the simulated DE genes may not be detected by the marker gene test depending on the foldchanges etc.

emkahuda · 2022-02-17T11:20:16Z

Hi, thank you for the reply. So, actually what I am looking for are

How can we know which cells belong to which cell types (which cluster or group)?
Then, after knowing a list of cells in each group, I am wondering if then we can find the list of marker genes of each group?
I cannot find the clustering algorithm in splatter. What I understand, we can set what kind of synthetic data we want such as how many group or clusters and the size of those groups, including to set the probability of genes to be DE genes. Could you please help? I might have missed something

lazappi · 2022-02-18T08:00:19Z

As {splatter} is a simulation package it doesn't perform clustering but it can produce datasets with clusters which we call "groups". This information is stored in the column data of the produced SingleCellExperiment object (colData(sce)$Group). The number and size of these groups are controlled by various parameters as described in the main splatter vignette and the parameters vignette.

The issue I linked to previously describes how to find the simulated DE genes for each of these groups (some of these can be considered "markers" but probably not all, it depends how you define that term).

emkahuda · 2022-02-18T17:27:59Z

I see. Thank you very much for the information. I will check it out. Best regards, Huda

…

On Fri, 18 Feb 2022, 08:00 Luke Zappia, ***@***.***> wrote: As *{splatter}* is a simulation package it doesn't perform clustering but it can produce datasets with clusters which we call "groups". This information is stored in the column data of the produced *SingleCellExperiment* object (colData(sce)$Group). The number and size of these groups are controlled by various parameters as described in the main splatter vignette <https://bioconductor.org/packages/release/bioc/vignettes/splatter/inst/doc/splatter.html> and the parameters vignette <https://bioconductor.org/packages/release/bioc/vignettes/splatter/inst/doc/splat_params.html> . The issue I linked to previously describes how to find the simulated DE genes for each of these groups (some of these can be considered "markers" but probably not all, it depends how you define that term). — Reply to this email directly, view it on GitHub <#136 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AKNZHW375KMTXQ3CL732QRLU3X4B5ANCNFSM5OSS22XA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

emkahuda · 2022-03-03T22:36:55Z

Dear Luke Zappia, I hope you stay safe and healthy. I have been trying to understand how Splatter generates synthetic data. I know it does not perform clustering, to begin with. However, when we set the parameter to have a specific number of genes and a specific number of cells, as well as the number of groups (clusters?) for the simulation, I am wondering how are cells assigned to which group when the splatter generates synthetic data? Are the cells randomly chosen and assigned to a group? Or it depends on the package SC3 (and let the SC3 packages do the grouping)? Thank you very much for your help. Best regards, Huda

…

On 18 Feb 2022, at 08:00, Luke Zappia ***@***.***> wrote: As {splatter} is a simulation package it doesn't perform clustering but it can produce datasets with clusters which we call "groups". This information is stored in the column data of the produced SingleCellExperiment object (colData(sce)$Group). The number and size of these groups are controlled by various parameters as described in the main splatter vignette <https://bioconductor.org/packages/release/bioc/vignettes/splatter/inst/doc/splatter.html> and the parameters vignette <https://bioconductor.org/packages/release/bioc/vignettes/splatter/inst/doc/splat_params.html>. The issue I linked to previously describes how to find the simulated DE genes for each of these groups (some of these can be considered "markers" but probably not all, it depends how you define that term). — Reply to this email directly, view it on GitHub <#136 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AKNZHW375KMTXQ3CL732QRLU3X4B5ANCNFSM5OSS22XA>. Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.

lazappi · 2022-03-04T08:04:41Z

During the simulation process cells are randomly assigned to groups (with numbers depending on the group.prob parameter). You can consider these "Group" labels as the ground truth identity for each cell which you could compare to the clusters from whatever clustering method such as SC3. There is no direct relationship between {splatter} and any downstream analysis method.

emkahuda · 2022-03-06T23:42:31Z

Dear Luke Zappia, Thank you for the clarification email about the cluster (groups). I am trying to generate synthetic data (scRNA-seq Glioblastoma dataset) using Splatter, but the clusters obtained of the synthetic data result is somehow really different from the real data clusters. Both clusters are obtained by the Seurat clustering algorithm. Please see the UMAP plots for both datasets. I also attached the file in which I produce the synthetic data. Could you please have a look, in case I made a mistake in a step generating the synthetic data. In general, what I have done to produce the synthetic data are; 1. Load the real data (scRNA_seq Glioblastoma) 2. Create a Seurat object for this loaded data 3. Do the quality control (such as only including counts with the minimum cell are 3 and minimum gene is 350) 4. Clusters the cells using the Seurat algorithm 5. Plot the UMAP for these clusters 6. Getting the matrix counts 7. Use this matrix as the input for SplateSimulate() function to obtain the synthetic data 8. Set the number of groups I want (10 groups), the size of each group (randomly assigning the prob. for each size), and the DE prob by 0.3. 9. I repeat steps 2-5 to obtain new clusters for the synthetic data. 10. compare the UMAP Plots (Here I found the two UMAP Plots show the clusters of real data and synthetic data are significantly different). Could you please help why this is the case? Thank you very much. Best regards, Huda

…

On 4 Mar 2022, at 08:04, Luke Zappia ***@***.***> wrote: During the simulation process cells are randomly assigned to groups (with numbers depending on the group.prob parameter). You can consider these "Group" labels as the ground truth identity for each cell which you could compare to the clusters from whatever clustering method such as SC3. There is no direct relationship between {splatter} and any downstream analysis method. — Reply to this email directly, view it on GitHub <#136 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AKNZHWYB3YD7NPTP5HVEUR3U6G72HANCNFSM5OSS22XA>. Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.

lazappi · 2022-03-07T08:26:44Z

I can't see the uploaded files (I think you need to attach them on GitHub, not by email) but this is not unexpected. The parameters for creating groups are not estimated directly from a real dataset and it would take some adjusting to get them to match. For example, I think you have randomly set group.prob but this would need to match the proportions in the real dataset (it's a bit hard to tell without any code). You may be able to get something that looks very close to your example dataset but it's non-trivial.

I would consider why you want to have an exact replica of a real dataset? The strength of simulations is being able to test scenarios for which there is no real data. If you already have this data I would suggest using it directly rather than trying to create a simulated copy of it. If there is a reason that won't work think about what scenario you want to simulate rather than just recreating a real dataset.

emkahuda · 2022-03-07T09:59:52Z

Dear Luke, I see. I misunderstood of the parameter estimation process in this case. Yes, these points you mentioned do make sense. I was thinking to recreate (mimic) the real dataset to check whether the classification methods I make work properly if both result show high similarities. However, as you mentioned, I can actually do both separately and use them for analysis. Thank you very much for the explanation. Best regards, Huda Get Outlook for Android<https://aka.ms/AAb9ysg>

…

________________________________ From: Luke Zappia ***@***.***> Sent: Monday, March 7, 2022 8:26:57 AM To: Oshlack/splatter ***@***.***> Cc: Moh Huda ***@***.***>; Mention ***@***.***> Subject: Re: [Oshlack/splatter] Finding marker genes in eah group (Issue #136) I can't see the uploaded files (I think you need to attach them on GitHub, not by email) but this is not unexpected. The parameters for creating groups are not estimated directly from a real dataset and it would take some adjusting to get them to match. For example, I think you have randomly set group.prob but this would need to match the proportions in the real dataset (it's a bit hard to tell without any code). You may be able to get something that looks very close to your example dataset but it's non-trivial. I would consider why you want to have an exact replica of a real dataset? The strength of simulations is being able to test scenarios for which there is no real data. If you already have this data I would suggest using it directly rather than trying to create a simulated copy of it. If there is a reason that won't work think about what scenario you want to simulate rather than just recreating a real dataset. — Reply to this email directly, view it on GitHub<#136 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AKNZHW52PMHLHSHLVTO5IIDU6W4VDANCNFSM5OSS22XA>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

lazappi added the question label Feb 17, 2022

lazappi closed this as completed Apr 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finding marker genes in eah group #136

Finding marker genes in eah group #136

emkahuda commented Feb 16, 2022

lazappi commented Feb 17, 2022

emkahuda commented Feb 17, 2022

lazappi commented Feb 18, 2022

emkahuda commented Feb 18, 2022 via email

emkahuda commented Mar 3, 2022 via email

lazappi commented Mar 4, 2022

emkahuda commented Mar 6, 2022 via email

lazappi commented Mar 7, 2022

emkahuda commented Mar 7, 2022 via email

Finding marker genes in eah group #136

Finding marker genes in eah group #136

Comments

emkahuda commented Feb 16, 2022

lazappi commented Feb 17, 2022

emkahuda commented Feb 17, 2022

lazappi commented Feb 18, 2022

emkahuda commented Feb 18, 2022 via email

emkahuda commented Mar 3, 2022 via email

lazappi commented Mar 4, 2022

emkahuda commented Mar 6, 2022 via email

lazappi commented Mar 7, 2022

emkahuda commented Mar 7, 2022 via email