A new human reference genome represents the most common sequences

T.The human reference genome is a DNA blueprint that is used as a standard for comparison in basic research and clinical settings. Despite the improvements in accuracy and completeness that have been made over the years, there are still limitations that can lead to erroneous results.

In the current version of the reference, called GRCh38 or Build 38, 93 percent of the sequence comes from just 11 people and 70 percent from just one man, resulting in a lack of diversity and at least 300 million missing DNA letters. In addition, a small percentage of the genes in the reference genome are represented by alleles, which are not the most common forms of the genes.

To address these issues, some scientists are developing a new reference, the pangenome or graph genome, which contains a large collection of genomes that represent all possible DNA sequences for a particular location. But presenting this data – the 3 billion bases in one person times the hundreds of thousands of people scientists want to include – is extremely complicated.

The problem with a pangenome is that integrating it with existing research practices and software would be a huge undertaking as it requires a graphical representation as opposed to a single linear genome. For example, the methods used in transcriptomics to help scientists determine which genes are active in a particular cell would need to be completely revised.

While an absolute decrease by a factor of two to three sounds like an impressive difference, in reality it goes from what I would say exceptionally good to something more than exceptionally good.

– Jesse Gillis, Cold Spring Harbor Laboratory

“Most of the methods that do transcription expression analysis work on a single sequence, such as a single reference genome, or expect it as input. They don’t expect graphics, ”says Christina Boucher, bioinformatician at the University of Florida. “That’s a big leap in input. So the methods that actually execute the transcription expression would have to be redesigned to include a graph instead of a single reference. The algorithms in and of themselves would have to be redeveloped. “

Because of this, researchers like Jesse Gillis, a computer biologist at the Cold Spring Harbor Laboratory, came up with a new idea: the “consensus genome”. It’s still a single genome, just like the current reference, but it represents the most common alleles among thousands of individuals and not what the few individuals used to create the current reference happen to have in their DNA. This allows for a near painless adoption in terms of use in an existing genome analysis software, says Gillis.

Posted in a preprint to bioRxiv On December 22nd, Gillis and colleagues, including Alexander Dobin of the Cold Spring Harbor Laboratory, who developed the popular STAR RNA sequence analysis software, compared their consensus genome to the current reference genome, as well as to population-specific consensus genomes, both of which created superpopulations like East Asia and Subpopulations like Han Chinese in Beijing.

See “The Pangenome: Are Individual Reference Genomes Dead?”

They created consensus genomes using the 1,000 Genome Project Database, which contains more than 2,500 genomes in 26 subpopulations and grouped into five superpopulations. They tested how GRCh38 and each consensus genome performed during transcriptomics using STAR to see if improving the input reference genome would improve gene expression analysis.

As with DNA analysis, the data obtained during RNA sequencing is provided in chunks called reads. To determine where these pieces of the genome came from, researchers often map these readings to a reference genome, a process known as mapping, or alignment. Then they can count how much messenger RNA there is for each gene to quantify gene activity.

As a basis, Gillis and his colleagues first matched the measured values ​​of an individual to their own genome and measured the gene expression. They then did the same using the reference and consensus genomes and compared the results to the baseline, quantifying the differences or the degree of error between them.

They found that although the inaccuracies the reference genome creates during alignment and gene expression measurements are small, according to Gillis, the consensus genomes had even fewer errors. In particular in comparison to the reference genome, the consensus genomes showed an improvement in the mapping error rate from around 9 percent to around 4 percent. And since errors in mapping lead to errors in enumerating messenger RNA, the reference also created errors in measuring gene expression in almost six times as many genes as consensus.

“While an absolute two to three factor reduction sounds like an impressive difference, the reality is that it goes from what I would say exceptionally good to slightly above the exceptionally good,” says Gillis. “And that should be a relief, because we’ve been doing science with reference for a long time. If we found this to be a life changing difference, it would be worrying. “

Gillis and his team also found that the population-specific genomes had only a marginal improvement in error reduction beyond the general consensus, a maximum difference of about 1 percent. This suggests that RNA sequencing analysis may not require having dedicated references for each population.

This is good news for Elizabeth Atkinson of Massachusetts General Hospital and the Broad Institute of MIT and Harvard, who are studying mixed populations whose most recent lineage comes from multiple sources. She says that not only would population-specific genomes make it difficult to compare individuals with multiple ancestors, but assigning people to these groups is also challenging.

“If you have someone of mixed race, which race do you choose for the population consensus genome?” says Atkinson. “The population is getting more and more mixed over time, so it makes sense to me if the pan types [consensus] Option seems to work effectively too [as the population-specific consensus]This would bypass some of those wrinkles when comparing populations and deciding how to even map people to their correct population. “

Although Gillis believes other researchers could replicate these consensus genomes relatively quickly, he and his colleagues have developed software that they can use to build their own consensus and perform RNA sequencing. The programs are free, open source, and available on GitHub.



Source link

Posted in Science

Leave a Comment

JUDAH

Welcome to Judah , We`re dedicated to providing you the very best of service and products. We hop you enjoy our service and our products as much as we enjoy offering them. Donations

Explore

Subscribe



©Copyright 2021 by JUDAH

Thanks for visiting get comfortable with the space, consider donating all donations big or small entitle you to a free gift with free shipping.