Introduction to genetics and genomics.

 

 

 

 

 

 

 

 

 

 

 

 

Genome Structure

 

Genome

 

 

• The genome is all the DNA in a cell. – All the DNA on all the chromosomes

– Includes genes, intergenic sequences, repeats • Specifically, it is all the DNA in an organelle. • Eukaryotes can have 2-3 genomes

– Nuclear genome

 

– Mitochondrial genome – Plastid genome

• If not specified, “genome” usually refers to the nuclear genome.

 

Genomics

 

 

Genomics is the study of genomes, including large chromosomal segments containing many genes.

 

• The initial phase of genomics aims to map and sequence an initial set of entire genomes.

 

Functional genomics aims to deduce information about the function of DNA sequences.

 

– Should continue long after the initial genome sequences have been completed.

 

Human genome

 

 

 

• 22 autosome pairs + 2 sex chromosomes

 

• 3 billion base pairs in the haploid genome

 

 

• Where and what the 30,000 to

 

 

ar From NCBI web site, photo from T. Ried, Natl Human Genome Research Institute, NIH

 

40,000 genes?

 

Components of the human Genome

 

 

• Human genome has 3.2 billion base pairs of DNA

 

• About 3% codes for proteins

 

• About 40-50% is repetitive, made by (retro)transposition

 

• What is the function of the remaining 50%?

 

The Genomics Revolution

 

 

 

• Know (close to) all the genes in a genome, and the sequence of the proteins they encode.

 

BIOLOGY HAS BECOME A FINITE SCIENCE

 

– Hypotheses have to conform to what is present, not what you could imagine could happen.

 

No longer look at just individual genes – Examine whole genomes or systems of genes

 

Finding the function of genes

 

Genome Structure

 

 

 

 

 

 

 

l Distinct components of genomes

 

l Abundance and complexity of mRNA l Normalized cDNA libraries and ESTs l Genome sequences: gene numbers l Comparative genomics

 

Much DNA in large genomes is non-coding

 

 

 

• Complex genomes have roughly 10x to 30x more DNA than is required to encode all the RNAs or proteins in the organism.

 

• Contributors to the non-coding DNA include: – Introns in genes

– Regulatory elements of genes

 

– Multiple copies of genes, including pseudogenes

 

– Intergenic sequences – Interspersed repeats

 

Distinct components in complex genomes

 

 

 

• Highly repeated DNA

 

R (repetition frequency) >100,000

 

– Almost no information, low complexity • Moderately repeated DNA

– 10<R<10,000

 

– Little information, moderate complexity • “Single copy” DNA

R=1 or 2

 

– Much information, high complexity

 

 

Reassociation kinetics measure sequence complexity

 

Sequence complexity is not the same as length

 

Complexity is the number of base pairs of unique, i.e. nonrepeating, DNA.

 

• E.g. consider 1000 bp DNA.

 

• 500 bp is sequence a, present in a single copy. • 500 bp is sequence b (100 bp) repeated 5X

a                  b     b     b                                 b     b |___________|__|__|__|__|__|

 

L = length = 1000 bp = a + 5b

N = complexity = 600 bp = a + b

 

Less complex DNA renatures faster

 

 

Let a, b, ... z represent a string of base pairs in DNA that can hybridize. For simplicity in arithmetic, we will use 10 bp per letter.

 

 

DNA 1 = ab. This is very low sequence complexity, 2 letters or 20 bp.

DNA 2 = cdefghijklmnopqrstuv. This is 10 times more complex (20 letters or 200 bp).

DNA 3 = izyajczkblqfreighttrainrunninsofastelizabethcottonqwftzxvbifyoud ontbelieveimleavingyoujustcountthedaysimgonerxcvwpowentdo wntothecrossroadstriedtocatchariderobertjohnsonpzvmwcomeon homeintomykitchentrad.

This is 100 times more complex (200 letters or 2000 bp).

 

Less complex DNA renatures faster, #2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

For an equal mass/vol:

 

Types of DNA in each kinetic component

 

 

Human genomic DNA                             Fig. 1.7.5

 

Clustered repeated sequences

 

Human chromosomes, ideograms

G-bands

 

Tandem repeats on every chromosome: Telomeres Centromeres

 

 

 

 

 

5 clusters of repeated rRNA genes:

Short arms of chromosomes 13, 14, 15, 21, 22

 

DNA Transposons

 

Almost all transposable elements in mammals fall into one of four classes

 

Short interspersed repetitive elements: SINEs

 

• Example: Alu repeats

 

– Most abundant repeated DNA in primates – Short, about 300 bp

– About 1 million copies

 

– Likely derived from the gene for 7SL RNA – Cause new mutations in humans

• They are retrotranposons

 

– DNA segments that move via an RNA intermediate. • MIRs: Mammalian interspersed repeats

– SINES found in all mammals

 

• Analogous short retrotransposons found in genomes of all vertebrates.

 

Long interspersed repetitive elements: LINEs

 

 

• Moderately abundant, long repeats – LINE1 family: most abundant

– Up to 7000 bp long – About 50,000 copies

• Retrotransposons

 

– Encode reverse transcriptase and other enzymes required for transposition

– No long terminal repeats (LTRs)

 

• Cause new mutations in humans

 

• Homologous repeats found in all mammals and many other animals

 

Other common interspersed repeated sequences in humans

 

 

• LTR-containing retrotransposons

 

– MaLR: mammalian, LTR retrotransposons – Endogenous retroviruses

– MER4 (MEdium Reiterated repeat, family 4) • Repeats that resemble DNA transposons

– MER1 and MER2 – Mariner repeats

– Were active early in mammalian evolution but are now inactive

 

Finding repeats

 

 

• Compare a sequence to a database of known repeat sequences from the organism of interest

 

• RepeatMasker

 

• Arian Smit and P. Green, U. Wash.

 

• http://ftp.genome.washington.edu/cgi-bin/RepeatMasker

 

• Try it on INS gene sequence