A comparison between similarity matrices for principal component analysis to assess population stratification in sequenced genetic data sets. Issue 1 (30th December 2022)
- Record Type:
- Journal Article
- Title:
- A comparison between similarity matrices for principal component analysis to assess population stratification in sequenced genetic data sets. Issue 1 (30th December 2022)
- Main Title:
- A comparison between similarity matrices for principal component analysis to assess population stratification in sequenced genetic data sets
- Authors:
- Lee, Sanghun
Hahn, Georg
Hecker, Julian
Lutz, Sharon M
Mullin, Kristina
Hide, Winston
Bertram, Lars
DeMeo, Dawn L
Tanzi, Rudolph E
Lange, Christoph
Prokopenko, Dmitry - Abstract:
- Abstract: Genetic similarity matrices are commonly used to assess population substructure (PS) in genetic studies. Through simulation studies and by the application to whole-genome sequencing (WGS) data, we evaluate the performance of three genetic similarity matrices: the unweighted and weighted Jaccard similarity matrices and the genetic relationship matrix. We describe different scenarios that can create numerical pitfalls and lead to incorrect conclusions in some instances. We consider scenarios in which PS is assessed based on loci that are located across the genome ('globally') and based on loci from a specific genomic region ('locally'). We also compare scenarios in which PS is evaluated based on loci from different minor allele frequency bins: common (>5%), low-frequency (5–0.5%) and rare (<0.5%) single-nucleotide variations (SNVs). Overall, we observe that all approaches provide the best clustering performance when computed based on rare SNVs. The performance of the similarity matrices is very similar for common and low-frequency variants, but for rare variants, the unweighted Jaccard matrix provides preferable clustering features. Based on visual inspection and in terms of standard clustering metrics, its clusters are the densest and the best separated in the principal component analysis of variants with rare SNVs compared with the other methods and different allele frequency cutoffs. In an application, we assessed the role of rare variants on local and global PS,Abstract: Genetic similarity matrices are commonly used to assess population substructure (PS) in genetic studies. Through simulation studies and by the application to whole-genome sequencing (WGS) data, we evaluate the performance of three genetic similarity matrices: the unweighted and weighted Jaccard similarity matrices and the genetic relationship matrix. We describe different scenarios that can create numerical pitfalls and lead to incorrect conclusions in some instances. We consider scenarios in which PS is assessed based on loci that are located across the genome ('globally') and based on loci from a specific genomic region ('locally'). We also compare scenarios in which PS is evaluated based on loci from different minor allele frequency bins: common (>5%), low-frequency (5–0.5%) and rare (<0.5%) single-nucleotide variations (SNVs). Overall, we observe that all approaches provide the best clustering performance when computed based on rare SNVs. The performance of the similarity matrices is very similar for common and low-frequency variants, but for rare variants, the unweighted Jaccard matrix provides preferable clustering features. Based on visual inspection and in terms of standard clustering metrics, its clusters are the densest and the best separated in the principal component analysis of variants with rare SNVs compared with the other methods and different allele frequency cutoffs. In an application, we assessed the role of rare variants on local and global PS, using WGS data from multiethnic Alzheimer's disease data sets and European or East Asian populations from the 1000 Genome Project. … (more)
- Is Part Of:
- Briefings in bioinformatics. Volume 24:Issue 1(2023)
- Journal:
- Briefings in bioinformatics
- Issue:
- Volume 24:Issue 1(2023)
- Issue Display:
- Volume 24, Issue 1 (2023)
- Year:
- 2023
- Volume:
- 24
- Issue:
- 1
- Issue Sort Value:
- 2023-0024-0001-0000
- Page Start:
- Page End:
- Publication Date:
- 2022-12-30
- Subjects:
- population stratification -- similarity matrix -- Jaccard matrix -- genetic relationship matrix -- rare variant -- principal component analysis
Genetics -- Data processing -- Periodicals
Molecular biology -- Data processing -- Periodicals
Genomes -- Data processing -- Periodicals
572.80285 - Journal URLs:
- http://bib.oxfordjournals.org ↗
http://www.oxfordjournals.org/content?genre=journal&issn=1477-4054 ↗
http://ukcatalogue.oup.com/ ↗
http://firstsearch.oclc.org ↗ - DOI:
- 10.1093/bib/bbac611 ↗
- Languages:
- English
- ISSNs:
- 1467-5463
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 2283.958363
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 25160.xml