Deoxyribonucleic acid (DNA) is a nucleic acid that contains the genetic instructions for the development and function of living things. All known cellular life and some viruses contain DNA. The main role of DNA in the cell is the long-term storage of information. It is often compared to a blueprint, since it contains the instructions to construct other components of the cell, such as proteins and RNA molecules. The DNA segments that carry genetic information are called genes, but other DNA sequences have structural purposes, or are involved in regulating the expression of genetic information.
In eukaryotes such as animals and plants, DNA is stored inside the cell nucleus, while in prokaryotes such as bacteria, the DNA is in the cell's cytoplasm. Unlike enzymes, DNA does not act directly on other molecules; rather, various enzymes act on DNA and copy its information into either more DNA, in DNA replication, or transcribe it into protein. In chromosomes, chromatin proteins such as histones compact and organize DNA, as well as helping control its interactions with other proteins in the nucleus.
The structure of part of a DNA double helix.DNA is a long polymer of simple units called nucleotides, which are held together by a backbone made of sugars and phosphate groups. This backbone carries four types of molecules called bases and it is the sequence of these four bases that encodes information. The major function of DNA is to encode the sequence of amino acid residues in proteins, using the genetic code. To read the genetic code, cells make a copy of a stretch of DNA in the nucleic acid RNA. These RNA copies can then used to direct protein synthesis, but they can also be used directly as parts of ribosomes or spliceosomes.
[edit] Physical and chemical properties
The two strands of DNA are held together by hydrogen bonds between bases. The sugars in the backbone are shown in light blue.DNA is a long polymer made from repeating units called nucleotides. The DNA chain is 22 to 24 angstroms wide and one nucleotide unit is 3.3 angstroms long.[1] Although these repeating units are very small, DNA polymers can be enormous molecules containing millions of nucleotides. For instance, the largest human chromosome is 220 million base pairs long.[2]
In living organisms, DNA does not usually exist as a single molecule, but instead as a tightly-associated pair of molecules. These two long strands entwine like vines, in the shape of a double helix (see the illustration above). The nucleotide repeats contain both the backbone of the molecule, which holds the chain together, and a base, which interacts with the other DNA strand in the helix. In general, a base linked to a sugar is called a nucleoside and a base linked to a sugar and one or more phosphate groups is called a nucleotide. If multiple nucleotides are linked together, as in DNA, this polymer is referred to as a polynucleotide.
The backbone of the DNA strand is made from alternating phosphate and sugar residues. The sugar in DNA is the pentose (five carbon) sugar 2-deoxyribose. The sugars are joined together by phosphate groups that form phosphodiester bonds between the third and fifth carbon atoms in the sugar rings. These asymmetric bonds mean a strand of DNA has a direction. In a double helix the direction of the nucleotides in one strand is opposite to their direction in the other strand. This arrangement of DNA strands is called antiparallel. The asymmetric ends of a strand of DNA bases are referred to as the 5' (five prime) and 3' (three prime) ends. One of the major differences between DNA and RNA is the sugar, with 2-deoxyribose being replaced by the alternative pentose sugar ribose in RNA.
The DNA double helix is held together by hydrogen bonds between the bases attached to the two strands.[3] The four bases found in DNA are adenine (abbreviated A), cytosine (C), guanine (G) and thymine (T).[3] These four bases are shown below and are attached to the sugar/phosphate to form the complete nucleotide, as shown for adenosine monophosphate.
Adenine Guanine Thymine Cytosine Adenosine monophosphate
Structures of the four bases found in DNA and the nucleotide adenosine monophosphate.These bases are classified into two types, adenine and guanine are fused five and six-membered heterocyclic compounds called purines, while cytosine and thymine are six-membered rings called pyrimidines. A fifth pyrimidine base called uracil (U), replaces thymine in RNA and differs from thymine by lacking a methyl group on its ring. Uracil is normally only found in DNA as a breakdown product of cytosine, but a very rare exception to this rule is a bacterial virus called PBS1 that contains uracil in its DNA.[4]
Structure of a section of DNA. The bases lie horizontally between the two spiraling strands. Created from PDB 1D65.The double helix is a right-handed spiral. As the DNA strands wind around each other, they leave gaps between each set of phosphate backbones, revealing the sides of the bases inside (see animation). There are two of these grooves twisting around the surface of the double helix: one groove is 22 angstroms wide and the other 12 angstroms wide.[5] The larger groove is called the major groove, while the smaller, narrower groove is called the minor groove. The narrowness of the minor groove means that the edges of the bases are more accessible in the major groove. As a result, proteins like transcription factors that can bind to specific sequences in double-stranded DNA usually read the sequence by making contacts to the sides of the bases exposed in the major groove.[6]
A GC base pair with three hydrogen bonds (shown as dashed lines).
An AT base pair with two hydrogen bonds (shown as dashed lines).
[edit] Base pairing
Further information: Base pair
Each type of base on one strand forms a bond with just one type of base on the other strand. This is called complementary base pairing. Here, purines form hydrogen bonds to pyrimidines, with A bonding only to T, and C bonding only to G. This arrangement of two nucleotides joined together across the double helix is called a base pair. In a double helix, the two strands are also held together by forces generated by the hydrophobic effect and pi stacking, but these forces are not affected by the sequence of the DNA.[7] As hydrogen bonds are not covalent, they can be broken and rejoined relatively easily. The two strands of DNA in a double helix can therefore be pulled apart like a zipper, either by a mechanical force or high temperature. As a result of this complementarity, all the information in the double-stranded sequence of a DNA helix is duplicated on each strand, which is vital in DNA replication. Indeed, this reversible and specific interaction between complementary base pairs is critical for all the functions of DNA in living organisms.
The two types of base pairs form different numbers of hydrogen bonds, AT forming two hydrogen bonds, and GC forming three hydrogen bonds (see figures, left). The GC base-pair is therefore stronger than the AT base pair. As a result, it is both the percentage of GC base pairs and the overall length of a DNA double helix that determine the strength of the association between the two strands of DNA. Long DNA helices with a high GC content have strongly interacting strands, while short helices with high AT content have weakly interacting strands. Parts of the DNA double helix that need to separate easily, such as the TATAAT Pribnow box in bacterial promoters, tend to have sequences with a high AT content, making the strands easier to pull apart.[8] In the laboratory, the strength of this interaction can be measured by finding the temperature required to break the hydrogen bonds, their melting temperature (also called Tm value). When all the base pairs in a DNA double helix melt, the strands separate and exist in solution as two entirely independent molecules. These single-stranded DNA molecules have no single shape, but some conformations are more stable than others.[9]
[edit] Sense and antisense
Further information: Sense (molecular biology)
DNA is copied into RNA by RNA polymerase enzymes that only work in the 5' to 3' direction.[10] A DNA sequence is called "sense" if its sequence is copied by these enzymes and then translated into protein. The sequence on the opposite strand is complementary to the sense sequence and is therefore called the "antisense" sequence. Both sense and antisense sequences can exist on different parts of the same strand of DNA. In both prokaryotes and eukaryotes, antisense sequences are transcribed, but the functions of these RNAs are not entirely clear.[11] One proposal is that antisense RNAs are involved in regulating gene expression through RNA-RNA base pairing.[12]
A few DNA sequences in prokaryotes and eukaryotes, and more in plasmids and viruses, blur the distinction made above between sense and antisense strands by having overlapping genes.[13] In these cases, some DNA sequences do double duty, encoding one protein when read 5' to 3' along one strand, and a second protein when read in the opposite direction (still 5' to 3') along the other strand. In bacteria, this overlap may be involved in the regulation of gene transcription.[14] While in viruses, overlapping genes increase the amount of information that can be encoded within the small viral genome.[15] Another way of reducing genome size is seen in some viruses that contain linear or circular single-stranded DNA as their genetic material.[16][17]
[edit] Supercoiling
Further information: DNA supercoil
DNA can be twisted like a rope in a process called DNA supercoiling. Normally, with DNA in its "relaxed" state a strand circles the axis of the double helix once every 10.4 base pairs, but if the DNA is twisted the strands become more tightly or more loosely wound.[18] If the DNA is twisted in the direction of the helix this is positive supercoiling and the bases are held more tightly together. If they are twisted in the opposite direction this is negative supercoiling and the bases come apart more easily. In nature, most DNA has slight negative supercoiling that is introduced by enzymes called topoisomerases.[19] These enzymes are also needed to relieve the twisting stresses introduced into DNA strands during processes such as transcription and DNA replication.[20]
From left to right, the structures of A, B and Z DNA.
[edit] Alternative double-helical structures
Further information: Mechanical properties of DNA
DNA exists in several possible conformations. The conformations so far identified are: A-DNA, B-DNA, C-DNA, D-DNA, E-DNA[21], and Z-DNA. However, only A-DNA, B-DNA, and Z-DNA are believed to be found in nature. Which conformation DNA adopts depends on the sequence of the DNA, the amount and direction of supercoiling,[22] chemical modifications of the bases and also solution conditions, such as the concentration of metal ions and polyamines.[23] Of these three conformations, the "B" form described above is most common under the conditions found in cells. The two alternative double-helical forms of DNA differ in their geometry and dimensions.
The A form is a wider right-handed spiral, with a shallow and wide minor groove and a narrower and deeper major groove. The A form occurs under non-physiological conditions in dehydrated samples of DNA, while in the cell it may be produced in hybrid pairings of DNA and RNA strands.[24] Segments of DNA where the bases have been methylated may undergo a larger change in conformation and adopt the Z form. Here, the strands turn about the helical axis in a left-handed spiral, a mirror image of the more common B form.[25]
Structure of a DNA quadruplex formed by telomere repeats. Produced from NDB UD0017.
[edit] Quadruplex structures
At the ends of the linear chromosomes are specialized regions of DNA called telomeres. The main function of these regions is to allow the cell to replicate chromosome ends using the enzyme telomerase, as normal DNA polymerases working on the lagging strand cannot copy the extreme 3' ends of their DNA templates.[26] If a chromosome lacked telomeres it would become shorter each time it was replicated. These specialized chromosome caps also help protect the DNA ends from exonucleases and stop the DNA repair systems in the cell from treating them as damage to be corrected.[27] In human cells, telomeres are usually lengths of single-stranded DNA containing several thousand repeats of a simple TTAGGG sequence.[28]
These guanine-rich sequences may stabilise chromosome ends by forming very unusual quadruplex structures. Here, four guanine bases form a flat plate, through hydrogen bonding, and these flat four-base units then stack on top of each other, to form a stable quadruplex.[29] These structures are often stabilized by chelation of a metal ion in the center of each four-base unit. The structure shown to the left is of a quadruplex formed by a DNA sequence containing four consecutive human telomere repeats. The single DNA strand forms a loop, with the sets of four bases stacking in a central quadruplex three plates deep. In the space at the center of the stacked bases are three chelated potassium ions.[30] Other structures can also be formed and the central set of four bases can come from either one folded strand, or several different parallel strands.
In addition to these stacked structures, telomeres also form large loop structures called telomere loops, or T-loops. Here, the single-stranded DNA curls around in a circle stabilized by telomere-binding proteins.[31] The very end of the T-loop, the single-stranded telomere DNA is held onto a region of double-stranded DNA by the telomere strand disrupting the double-helical DNA and base pairing to one of the two strands. This triple-stranded structure is called a displacement loop or D-loop.[29]
[edit] Chemical modifications
[edit] Regulatory base modifications
Further information: DNA methylation
The expression of genes is influenced by modifications of the bases in DNA. In humans, the most common base modification is cytosine methylation to produce 5-methylcytosine. This modification reduces gene expression and is important in X-chromosome inactivation.[32] The level of methylation varies between organisms, with Caenorhabditis elegans lacking cytosine methylation, while vertebrates show high levels, with up to 1% of their DNA being 5-methylcytosine.[33] Unfortunately, the spontaneous deamidation of 5-methylcytosine produces thymine and methylated cytosines are therefore mutation hotspots.[34] Other base modifications include adenine methylation in bacteria and the glycosylation of uracil to produce the "J-base" in kinetoplastids.[35][36]
[edit] DNA damage
Further information: Mutation
Benzopyrene, the major mutagen in tobacco smoke, in an adduct to DNA. Produced from PDB 1JDG.DNA can be damaged many different sorts of mutagens. These include oxidizing agents, alkylating agents and also high-energy electromagnetic radiation such as ultraviolet light and x-rays. The type of DNA damage produced depends on the type of mutagen. For example, UV light mostly damages DNA by producing pyrimidine dimers, which are cross-links between adjacent pyrimidine bases in a DNA strand.[37] On the other hand, oxidants such as free radicals or hydrogen peroxide produce multiple forms of damage, including base modifications, particularly of guanosine, as well as double-strand breaks.[38] It has been estimated that in each human cell, about 500 bases suffer oxidative damage per day.[39][40] Of these oxidative lesions, the most damaging are double-strand breaks, as they can produce point mutations, insertions and deletions from the DNA sequence, as well as chromosomal translocations.[41]
Many mutagens intercalate into the space between two adjacent base pairs. These molecules are mostly polycyclic, aromatic, and planar molecules and include ethidium, proflavin, daunomycin, doxorubicin and thalidomide. DNA intercalators are used in chemotherapy to inhibit DNA replication in rapidly-growing cancer cells.[42] In order for an intercalator to fit between base pairs, the bases must separate, distorting the DNA strand by unwinding of the double helix. These structural modifications inhibit transcription and replication processes, causing both toxicity and mutations. As a result, DNA intercalators are often carcinogens, with benzopyrene diol epoxide, acridines, aflatoxin and ethidium bromide being well-known examples.[43][44]
[edit] Overview of biological functions
DNA contains the genetic information that allows living things to function, grow and reproduce. This information is held in the sequence of pieces of DNA called genes. Genetic information in genes is transmitted through complementary base pairing. For example, when a cell uses the information in a gene, the DNA sequence is copied into a complementary RNA sequence in a process called transcription. Usually, this RNA copy is then used to make a matching protein sequence in a process called translation. Alternatively, a cell may simply copy its genetic information in a process called DNA replication. The details of these functions are covered in other articles, here we focus on the interactions that happen in these processes between DNA and other molecules.
[edit] Transcription and translation
T7 RNA polymerase producing a mRNA (green) from a DNA template (red and blue). The protein is shown as a purple ribbon. Image derived from PDB 1MSW.Further information: Genetic code, Transcription (genetics), Protein biosynthesis
A gene is a sequence of DNA that contains genetic information and can influence the phenotype of an organism. Within a gene, the sequence of bases along a DNA strand defines a messenger RNA sequence which then defines a protein sequence. The relationship between the nucleotide sequences of genes and the amino-acid sequences of proteins is determined by the rules of translation, known collectively as the genetic code. The genetic code consists of three-letter 'words' (called codons) formed from a sequence of three nucleotides (e.g. ACT, CAG, TTT). In transcription, the codons of a gene are copied into messenger RNA by RNA polymerase. This RNA copy is then decoded by a ribosome that reads the RNA sequence by base-pairing the messenger RNA to transfer RNA, which carries amino acids. There are 64 possible codons (4 bases in 3 places 43) that encode 20 amino acids. Most amino acids, therefore, have more than one possible codon. There are also three 'stop' or 'nonsense' codons signifying the end of the coding region, these are the UAA, UGA and UAG codons.
DNA replication
[edit] Replication
Further information: DNA replication
Cell division is essential for an organism to grow, but when a cell divides it must replicate the DNA in its genome so that the two daughter cells have the same genetic information as their parent. The double-stranded structure of DNA provides a simple mechanism for DNA replication. Here, the two strands are separated and then each strand's complementary DNA sequence is recreated by an enzyme called DNA polymerase. This enzyme makes the complementary strand by finding the correct base through complementary base pairing, and bonding it onto the original strand. All such DNA polymerases extend a DNA strand in a 5 prime to 3 prime direction.[45] In this way, the base on the old strand dictates which base appears on the new strand, and the cell ends up with an perfect copy of its DNA.
[edit] Genes and genomes
Further information: Cell nucleus, Gene, Non-coding DNA
DNA is located in the cell nucleus of eukaryotes, as well as small amounts in mitochondria and chloroplasts. In prokaryotes, the DNA is held within an irregularly shaped body in the cytoplasm called the nucleoid.[46] The DNA is usually in linear chromosomes in eukaryotes, and circular chromosomes in prokaryotes. In the human genome, there is approximately 3 billion base pairs of DNA arranged into 46 chromosomes.[47] The genetic information in a genome is held within genes. A gene is a unit of hereditary and is a region of DNA that influences a particular characteristic in an organism. Genes contain an open reading frame that can be transcribed, as well as regulatory sequences such as promoters and enhancers, which control the expression of the open reading frame.
In many species, only a small fraction of the total sequence of the genome encodes protein. For example, only about 1.5% of the human genome consists of protein-coding exons, with over 50% of human DNA consisting of non-coding repetitive sequences.[48] The reasons for the presence of so much non-coding DNA in eukaryotic genomes and the extraordinary differences in genome size ("C-value") among species represent a long-standing puzzle known as the "C-value enigma".[49]
Some non-coding DNA sequences play structural roles in chromosomes. Telomeres and centromeres typically contain few (if any) genes, but are important for the function and stability of chromosomes.[27][50] An abundant form of non-coding DNA in humans are pseudogenes, which are copies of genes that have been disabled by mutation.[51] These sequences are usually just molecular fossils, although they can occasionally serve as raw genetic material for the creation of new genes through the process of gene duplication and divergence.[52]
[edit] Interactions with proteins
All the functions of DNA depend on interactions with proteins. These protein interactions can either be non-specific, or the protein can only bind to a particular DNA sequence. Enzymes can also bind to DNA and of these, the polymerases that copy the DNA base sequence in transcription and DNA replication are particularly important.
[edit] DNA-binding proteins
Interaction of DNA with histones (shown in white, top). These proteins' basic amino acids (below left, blue) bind to the acidic phosphate groups on DNA (below right, red).Structural proteins that bind DNA are well-understood examples of non-specific DNA-protein interactions. Within chromosomes, DNA is held in complexes between DNA and structural proteins. These proteins organize the DNA into a compact structure called chromatin. In eukaryotes this structure involves DNA binding to a complex of small basic proteins called histones, while in prokaryotes multiple types of proteins are involved.[53] The histones form a disk-shaped complex called a nucleosome, which contains two complete turns of double-stranded DNA wrapped around its surface. These non-specific interactions are formed through basic residues in the histones making ionic bonds to the acidic sugar-phosphate backbone of the DNA, and are therefore largely independent of the base sequence.[54] Chemical modifications of these basic amino acid residues include methylation, phosphorylation and acetylation.[55] These chemical changes alter the strength of the interaction between the DNA and the histones, making the DNA more or less accessible to transcription factors and changing the rate of transcription.[56] Other non-specific DNA-binding proteins found in chromatin include the high-mobility group proteins, which bind preferentially to bent or distorted DNA.[57] These proteins are important in bending arrays of nucleosomes and arranging them into more complex chromatin structures.[58]
A distinct group of DNA-binding proteins are the single-stranded DNA-binding proteins. that specifically bind single-stranded DNA. In humans, replication protein A is the best-characterised member of this family and is essential for most processes where the double helix is separated, including DNA replication, recombination and DNA repair.[59] These binding proteins seem to stabilize single-stranded DNA and protect it from forming stem loops or being degraded by nucleases.
The lambda repressor helix-turn-helix transcription factor bound to its DNA target. Produced from PDB 1LMB.In contrast, other proteins have evolved to specifically bind particular DNA sequences. The most intensively studied of these are the various classes of transcription factors. These proteins control gene transcription. Each one of these proteins bind to one particular set of DNA sequences and thereby activates or inhibits the transcription of genes with these sequences close to their promoters. The transcription factors do this in two ways. Firstly, they can bind the RNA polymerase responsible for transcription, either directly or through other mediator proteins, this locates the polymerase at the promoter and allows it to begin transcription.[60] Alternatively, transcription factors can bind enzymes that modify the histones at the promoter, this will change the accessibility of the DNA template to the polymerase.[61]
As these DNA targets can occur throughout an organism's genome, changes in the activity of one type of transcription factor can affect thousands of genes.[62] Consequently, these proteins are often the targets of the signal transduction processes that mediate responses to environmental changes or cellular differentiation and development. The specificity of these transcription factors' interactions with DNA come from the proteins making multiple contacts to the edges of the DNA bases, allowing them to "read" the DNA sequence. Most of these base interactions are made in the major groove, where the bases are most accessible.[63]
The restriction enzyme EcoRV (green) in a complex with its substrate DNA. Created from PDB 1RVA.
[edit] DNA-modifying enzymes
[edit] Nucleases and ligases
Nucleases are enzymes that cut DNA strands by catalyzing the hydrolysis of the phosphodiester bonds. Nucleases that hydrolyse nucleotides from the ends of DNA strands are called exonucleases, while endonucleases cut within strands. The most frequently-used nucleases in molecular biology are the restriction endonucleases, which cut DNA at specific sequences. For instance, the EcoRV enzyme shown to the left recognizes the 6-base sequence 5'-GAT|ATC-3' and makes a cut at the vertical line. In nature, these enzymes protect bacteria against phage infection by digesting the phage DNA when it enters the bacterial cell, acting as part of the restriction modification system. In technology, these sequence-specific nucleases are used in molecular cloning and DNA fingerprinting.
Enzymes called DNA ligases can rejoin cut or broken DNA strands. Ligases are particularly important in lagging strand DNA replication, as they join together the short segments of DNA produced at the replication fork into a complete copy of the DNA template.
[edit] Topioisomerases and helicases
Topoisomerases are enzymes with both nuclease and ligase activity. These proteins change the amount of supercoiling in DNA. Some of these enzyme work by cutting the DNA helix and allowing one section to rotate, thereby reducing its level of supercoiling, the enzyme then seals the DNA break.[19] Other types of these enzymes are capable of cutting one DNA helix and then passing a second strand of DNA through this break, before rejoining the helix.[64] Topioisomerases are required for many processes involving DNA, such as DNA replication and transcription.[20]
Helicases are proteins that are a type of molecular motor. They use the chemical energy in adenosine triphosphate to break the hdrogen bonds between bases and unwind a DNA double helix into single strands.[65] These enzymes are essential for most processes where enzymes need to access the bases, such as DNA replication and transcription.
[edit] Polymerases
Polymerases are enzymes that synthesise polynucleotide chains from nucleoside triphosphates. They function by adding nucleotides onto the 3ˈ hydroxyl group of the previous nucleotide in the DNA strand. As a consequence, all polymerases work in a 5' to 3' direction.[10] In the active site of these enzymes, the nucleoside triphosphate substrate base-pairs to a single-stranded polynucleotide template: this allows polymerases to accurately synthesise the complementary strand of this template. Polymerases are classified depending of the type of template they use.
In DNA replication a DNA-dependent DNA polymerase make a DNA copy of a DNA sequence. Accuracy is vital in this process, so many of these polymerases have a proofreading activity. Here, the polymerase recognizes the occasional mistakes in the synthesis reaction by the lack of base pairing between the mismatched nucleotides. If a mismatch is detected, a 3' to 5' exonuclease activity is activated and the incorrect base removed.[66] In most organisms DNA polymerases function in a large complex called the replisome that contains multiple accessory subunits, such as the DNA clamp or helicases.[67]
RNA-dependent DNA polymerases are a specialised class of polymerases that copy the sequence of a RNA strand into DNA. They include reverse transcriptase, which is a viral enzyme involved in the infection of cells by retroviruses, and telomerase, which is required for the replication of telomeres.[68][26] Telomerase is an unusual polymerase because it contains its own RNA template as part of its structure.[27]
Transcription is carried out by a DNA-dependent RNA polymerase that copies the sequence of a DNA strand into RNA. To begin transcribing a gene, the RNA polymerase binds to a sequence of DNA called a promoter and separates the DNA strands. It then copies the gene sequence into a messenger RNA transcript until it reaches a region of DNA called the terminator, where it halts and detaches from the DNA. As with human DNA-dependent DNA polymerases, RNA polymerase II, the enzyme that transcribes most of the genes in the human genome, operates as part of a large protein complex with multiple regulatory and accessory subunits.[69]
[edit] Genetic recombination
Structure of the Holliday junction intermediate in genetic recombination. The four separate DNA strands are coloured red, blue, green and yellow. Produced from PDB 1M6GFurther information: Genetic recombination
Recombination involves the breakage and rejoining of two chromosomes (M and F) to produce two re-arranged chromosomes (C1 and C2).A DNA helix does not usually interact with other segments of DNA and in human cells the different chromosomes even occupy separate areas in the nucleus called "chromosome territories".[70] This physical separation of different chromosomes is important for the ability of DNA to function as a stable repository for information, as one of the few times chromosomes interact is when they recombine. Recombination is when two DNA helices break, swap a section and then rejoin. In eukaryotes this process usually occurs during meiosis, when the two sister chromatids are paired together in the center of the cell. Recombination allows chromosomes to exchange genetic information and produces new combinations of genes, which increases the efficiency of selection and can be important in the rapid evolution of new proteins.[71] Genetic recombination can also be involved in DNA repair, particularly in the cell's response to double-strand breaks.[72]
The most common form of recombination homologous recombination, where the two chromosomes involved share very similar sequences. Non-homologous recombination can be damaging to cells, as it can produce chromosomal translocations and genetic abnormalities. The recombination reaction is catalyzed by enzymes known as recombinases, such as Cre recombinase.[73] In the first step, the recombinase creates a nick in one strand of a DNA double helix, allowing the nicked strand to pull apart from its complementary strand and anneal to one strand of the double helix on the opposite chromatid. A second nick allows the strand in the second chromatid to pull apart and anneal to the remaining strand in the first helix, forming a structure known as a cross-strand exchange or a Holliday junction. The Holliday junction is a tetrahedral junction structure which can be moved along the pair of chromosomes, swapping one strand for another. The recombination reaction is then halted by cleavage of the junction and re-ligation of the released DNA.[74]
[edit] Uses in technology
[edit] Forensics
Further information: Genetic fingerprinting
Forensic scientists can use DNA in blood, semen, skin, saliva or hair at a crime scene to identify a perpetrator. This process is called using genetic fingerprinting or DNA profiling. In DNA profiling, the lengths of variable sections of repetitive DNA, such as short tandem repeats and minisatellites, are compared between people. DNA profiling was developed in 1984 by British geneticist Sir Alec Jeffreys.[75] It first used in forensic science to convict Colin Pitchfork in 1988 in the Enderby murders case.[76] People convicted of certain types of crimes may be required to provide a sample of DNA for a database. This has helped investigators solve old cases where only a DNA sample was obtained from the scene. This method is usually an extremely reliable technique for identifying a criminal.[77] However, identification can be complicated if the scene is contaminated with DNA from several people.[78]
[edit] Bioinformatics
Further information: Bioinformatics
Bioinformatics involves the manipulation, searching, and data mining DNA sequence data. The development of techniques to store and search DNA sequences have led to widely-applied advances in computer science, especially string searching algorithms and database theory. String searching or matching algorithms, which find an occurrence of a sequence of letters inside a larger sequence of letters, was developed to search for specific sequences of nucleotides.[79] In other applications such as text editors, even simple algorithms for this problem usually suffice, but DNA sequences cause these algorithms to exhibit near-worst-case behaviour due to their small number of distinct characters. The related problem of sequence alignment aims to identify homologous sequences and locate the specific mutations that make them distinct. These techniques, especially multiple sequence alignment, are used in studying phylogenetic relationships and protein function.
[edit] DNA and computation
Further information: DNA computing
DNA was first used in computing to solve the directed Hamiltonian path problem, an NP-complete problem.[80] DNA computing is advantageous over electronic computers in power use, space use, and efficiency, due to its ability to compute in a highly parallel fashion (see parallel computing). A number of other problems, including simulation of various abstract machines, the boolean satisfiability problem, and the bounded version of the travelling salesman problem, have since been analysed using DNA computing.[81] Due to its compactness, DNA also has a theoretical role in cryptography, where in particular it allows unbreakable one-time pads to be efficiently constructed and used.[82]
[edit] History and anthropology
Further information: Phylogenetics
Because DNA collects mutations over time, which are then inherited, it contains historical information and by comparing DNA sequences, geneticists can infer the evolutionary history of organisms, their phylogeny.[83] This field of phylogenetics is a powerful tool in evolutionary biology. If DNA sequences within a species are compared, population geneticists can learn the history of particular populations. This can be used in studies ranging from ecological genetics to anthropology (for example, DNA evidence is also being used to try to identify the Ten Lost Tribes of Israel).[84][85]
DNA has also been used to look at modern family relationships, such as establishing family relationships between the descendants of Sally Hemings and Thomas Jefferson. This usage is closely related to the use of DNA in criminal investigations detailed above. Indeed, some criminal investigations have been solved when DNA from crime scenes has matched relatives of the guilty individual.[86
Proteins are large organic compounds made of amino acids arranged in a linear chain and joined together between the carboxyl atom of one amino acid and the amine nitrogen of another. This bond is called a peptide bond. The sequence of amino acids in a protein is defined by a gene and encoded in the genetic code. Although this genetic code specifies 20 "standard" amino acids, the residues in a protein are often chemically altered in post-translational modification: either before the protein can function in the cell, or as part of control mechanisms. Proteins can also work together to achieve a particular function, and they often associate to form stable complexes.
Like other biological macromolecules such as polysaccharides and nucleic acids, proteins are essential parts of all living organisms and participate in every process within cells. Many proteins are enzymes that catalyze biochemical reactions, and are vital to metabolism. Other proteins have structural or mechanical functions, such as the proteins in the cytoskeleton, which forms a system of scaffolding that maintains cell shape. Proteins are also important in cell signaling, immune responses, cell adhesion, and the cell cycle. Protein is also a necessary component in our diet, since animals cannot synthesise all the amino acids and must obtain essential amino acids from food. Through the process of digestion, animals break down ingested protein into free amino acids that can be used for protein synthesis.
The name protein comes from the Greek πρώτα ("prota"), meaning "of primary importance" and were first described and named by Jöns Jakob Berzelius in 1838. However, their central role in living organisms was not fully appreciated until 1926, when James B. Sumner showed that the enzyme urease was a protein. The first protein structures to be solved included insulin and myoglobin; the first was by Sir Frederick Sanger who won a 1958 Nobel Prize for it, and the second by Max Perutz and Sir John Cowdery Kendrew in 1958.[1] Both proteins' three-dimensional structures were amongst the first determined by x-ray diffraction analysis; the myoglobin structure won the Nobel Prize in Chemistry for its discoverers.[2]
Biochemistry
Main articles: Amino acid and peptide bond
Resonance structures of the peptide bond that links individual amino acids to form a protein polymer.
Section of a protein structure showing serine and alanine residues linked together by peptide bonds. Carbons are shown in white and hydrogens are omitted for clarity.Proteins are linear polymers built from 20 different L-alpha-amino acids. All amino acids share common structural features including an alpha carbon to which an amino group, a carboxyl group, and a variable side chain are bonded. Only proline shows little difference in a fashion by containing an unusual ring to the N-end amine group, which forces the CO-NH amide sequence into a fixed conformation.[3] The side chains of the standard amino acids, detailed in the list of standard amino acids, have varying chemical properties that produce proteins' three-dimensional structure and are therefore critical to protein function. The amino acids in a polypeptide chain are linked by peptide bonds formed in a dehydration reaction. Once linked in the protein chain, an individual amino acid is called a residue and the linked series of carbon, nitrogen, and oxygen atoms are known as the main chain or protein backbone. The peptide bond has two resonance forms that contribute some double bond character and inhibit rotation around its axis, so that the alpha carbons are roughly coplanar. The other two dihedral angles in the peptide bond determine the local shape assumed by the protein backbone.
Due to the chemical structure of the individual amino acids, the protein chain has directionality. The end of the protein with a free carboxyl group is known as the C-terminus or carboxy terminus, while the end with a free amino group is known as the N-terminus or amino terminus.
There is some ambiguity between the usage of the words protein, polypeptide, and peptide. Protein is generally used to refer to the complete biological molecule in a stable conformation, while peptide is generally reserved for a short amino acid oligomers often lacking a stable 3-dimensional structure. However, the boundary between the two is ill-defined and usually lies near 20-30 residues.[4] Polypeptide can refer to any single linear chain of amino acids, usually regardless of length, but often implies an absence of a single defined conformation.
[edit] Synthesis
Main article: Protein biosynthesis
Proteins are assembled from amino acids using information encoded in genes. Each protein has its own unique amino acid sequence that is specified by the nucleotide sequence of the gene encoding this protein. The genetic code is a set of three-nucleotide sets called codons and each three-nucleotide combination stands for an amino acid, for example ATG stands for methionine. Because DNA contains four nucleotides, the total number of possible codons is 64; hence, there is some redundancy in the genetic code and some amino acids are specified by more than one codon. Genes encoded in DNA are first transcribed into pre-messenger RNA (mRNA) by proteins such as RNA polymerase. Most organisms then process the pre-mRNA (also known as a primary transcript) using various forms of post-transcriptional modification to form the mature mRNA, which is then used as a template for protein synthesis by the ribosome. In prokaryotes the mRNA may either be used as soon as it is produced, or be bound by a ribosome after having moved away from the nucleoid. In contrast, eukaryotes make mRNA in the cell nucleus and then translocate it across the nuclear membrane into the cytoplasm, where protein synthesis then takes place. The rate of protein synthesis is higher in prokaryotes than eukaryotes and can reach up to 20 amino acids per second.[5]
The process of synthesizing a protein from an mRNA template is known as translation. The mRNA is loaded onto the ribosome and is read three nucleotides at a time by matching each codon to its base pairing anticodon located on a transfer RNA molecule, which carries the amino acid corresponding to the codon it recognizes. The enzyme aminoacyl tRNA synthetase "charges" the tRNA molecules with the correct amino acids. The growing polypeptide is often termed the nascent chain. Proteins are always biosynthesized from N-terminus to C-terminus.
The size of a synthesized protein can be measured by the number of amino acids it contains and by its total molecular mass, which is normally reported in units of daltons (synonymous with atomic mass units), or the derivative unit kilodalton (kDa). Yeast proteins are on average 466 amino acids long and 53 kDa in mass.[4] The largest known proteins are the titins, a component of the muscle sarcomere, with a molecular mass of almost 3,000 kDa and a total length of almost 27,000 amino acids.[6]
[edit] Chemical synthesis
Short proteins can also be synthesized chemically in the laboratory by a family of methods known as peptide synthesis, which rely on organic synthesis techniques such as chemical ligation to produce peptides in high yield. Chemical synthesis allows for the introduction of non-natural amino acids into polypeptide chains, such as attachment of fluorescent probes to amino acid side chains. These methods are useful in laboratory biochemistry and cell biology, though generally not for commercial applications. Chemical synthesis is inefficient for polypeptides longer than about 300 amino acids, and the synthesized proteins may not readily assume their native tertiary structure. Most chemical synthesis methods proceed from C-terminus to N-terminus, opposite the biological reaction.
[edit] Structure of proteins
Main article: Protein structure
Three possible representations of the three-dimensional structure of the protein triose phosphate isomerase. Left: all-atom representation colored by atom type. Middle: "cartoon" representation illustrating the backbone conformation, colored by secondary structure. Right: Solvent-accessible surface representation colored by residue type (acidic residues red, basic residues blue, polar residues green, nonpolar residues white).Most proteins fold into unique 3-dimensional structures. The shape into which a protein naturally folds is known as its native state. Although many proteins can fold unassisted simply through the structural propensities of their component amino acids, others require the aid of molecular chaperones to efficiently fold to their native states. Biochemists often refer to four distinct aspects of a protein's structure:
Primary structure: the amino acid sequence
Secondary structure: regularly repeating local structures stabilized by hydrogen bonds. The most common examples are the alpha helix and beta sheet.[7] Because secondary structures are local, many regions of different secondary structure can be present in the same protein molecule.
Tertiary structure: the overall shape of a single protein molecule; the spatial relationship of the secondary structures to one another. Tertiary structure is generally stabilized by nonlocal interactions, most commonly the formation of a hydrophobic core, but also through salt bridges, hydrogen bonds, disulfide bonds, and even post-translational modifications. The term "tertiary structure" is often used as synonymous with the term fold.
Quaternary structure: the shape or structure that results from the interaction of more than one protein molecule, usually called protein subunits in this context, which function as part of the larger assembly or protein complex.
In addition to these levels of structure, proteins may shift between several related structures in performing their biological function. In the context of these functional rearrangements, these tertiary or quaternary structures are usually referred to as "conformations," and transitions between them are called conformational changes. Such changes are often induced by the binding of a substrate molecule to an enzyme's active site, or the physical region of the protein that participates in chemical catalysis.
Molecular surface of several proteins showing their comparative sizes. From left to right are: Antibody (IgG), Hemoglobin, Insulin (a hormone), Adenylate kinase (an enzyme), and Glutamine synthetase (an enzyme).Proteins can be informally divided into three main classes, which correlate with typical tertiary structures: globular proteins, fibrous proteins, and membrane proteins. Almost all globular proteins are soluble and many are enzymes. Fibrous proteins are often structural; membrane proteins often serve as receptors or provide channels for polar or charged molecules to pass through the cell membrane.
A special case of intramolecular hydrogen bonds within proteins, poorly shielded from water attack and hence promoting their own dehydration, are called dehydrons.
[edit] Structure determination
Discovering the tertiary structure of a protein, or the quaternary structure of its complexes, can provide important clues about how the protein performs its function. Common experimental methods of structure determination include X-ray crystallography and NMR spectroscopy, both of which can produce information at atomic resolution. Cryoelectron microscopy is used to produce lower-resolution structural information about very large protein complexes, including assembled viruses;[7] a variant known as electron crystallography can also produce high-resolution information in some cases, especially for two-dimensional crystals of membrane proteins.[8] Solved structures are usually deposited in the Protein Data Bank (PDB), a freely available resource from which structural data about thousands of proteins can be obtained in the form of Cartesian coordinates for each atom in the protein.
There are many more known gene sequences than there are solved protein structures. Further, the set of solved structures is biased toward those proteins that can be easily subjected to the experimental conditions required by one of the major structure determination methods. In particular, globular proteins are comparatively easy to crystallize in preparation for X-ray crystallography, which remains the oldest and most common structure determination technique. Membrane proteins, by contrast, are difficult to crystallize and are underrepresented in the PDB.[9] Structural genomics initiatives have attempted to remedy these deficiencies by systematically solving representative structures of major fold classes. Protein structure prediction methods attempt to provide a means of generating a plausible structure for a proteins whose structures have not been experimentally determined.
[edit] Cellular functions
Proteins are the chief actors within the cell, said to be carrying out the duties specified by the information encoded in genes.[4] With the exception of certain types of RNA, most other biological molecules are relatively inert elements upon which proteins act. Proteins make up half the dry weight of an E. coli cell, while other macromolecules such as DNA and RNA make up only 3% and 20% respectively.[10] The total complement of proteins expressed in a particular cell or cell type at a given time point or experimental condition is known as its proteome.
The enzyme hexokinase is shown as a simple ball-and-stick molecular model. To scale in the top right-hand corner are its two substrates, ATP and glucose.The chief characteristic of proteins that enables them to carry out their diverse cellular functions is their ability to bind other molecules specifically and tightly. The region of the protein responsible for binding another molecule is known as the binding site and is often a depression or "pocket" on the molecular surface. This binding ability is mediated by the tertiary structure of the protein, which defines the binding site pocket, and by the chemical properties of the surrounding amino acids' side chains. Protein binding can be extraordinarily tight and specific; for example, the ribonuclease inhibitor protein binds to human angiogenin with a sub-femtomolar dissociation constant (<10-15 M) but does not bind at all to its amphibian homolog onconase (>1 M). Extremely minor chemical changes such as the addition of a single methyl group to a binding partner can sometimes suffice to nearly eliminate binding; for example, the aminoacyl tRNA synthetase specific to the amino acid valine discriminates against the very similar side chain of the amino acid isoleucine.
Proteins can bind to other proteins as well as to small-molecule substrates. When proteins bind specifically to other copies of the same molecule, they can oligomerize to form fibrils; this process occurs often in structural proteins that consist of globular monomers that self-associate to form rigid fibers. Protein-protein interactions also regulate enzymatic activity, control progression through the cell cycle, and allow the assembly of large protein complexes that carry out many closely related reactions with a common biological function. Proteins can also bind to, or even be integrated into, cell membranes. The ability of binding partners to induce conformational changes in proteins allows the construction of enormously complex signaling networks.
[edit] Enzymes
Main article: Enzyme
The best-known role of proteins in the cell is their duty as enzymes, which catalyze chemical reactions. Enzymes are usually highly specific catalysts that accelerate only one or a few chemical reactions. Enzymes effect most of the reactions involved in metabolism and catabolism as well as DNA replication, DNA repair, and RNA synthesis. Some enzymes act on other proteins to add or remove chemical groups in a process known as post-translational modification. About 4,000 reactions are known to be catalyzed by enzymes.[11] The rate acceleration conferred by enzymatic catalysis is often enormous - as much as 1017-fold increase in rate over the uncatalyzed reaction in the case of orotate decarboxylase.[12]
The molecules bound and acted upon by enzymes are known as substrates. Although enzymes can consist of hundreds of amino acids, it is usually only a small fraction of the residues that come in contact with the substrate and an even smaller fraction - 3-4 residues on average - that are directly involved in catalysis.[13] The region of the enzyme that binds the substrate and contains the catalytic residues is known as the active site.
[edit] Cell signalling and ligand transport
A mouse antibody against cholera that binds a carbohydrate antigen.Many proteins are involved in the process of cell signaling and signal transduction. Some proteins, such as insulin, are extracellular proteins that transmit a signal from the cell in which they were synthesized to other cells in distant tissues. Others are membrane proteins that act as receptors whose main function is to bind a signaling molecule and induce a biochemical response in the cell. Many receptors have a binding site exposed on the cell surface and an effector domain within the cell, which may have enzymatic activity or may undergo a conformational change detected by other proteins within the cell.
Antibodies are protein components of adaptive immune system whose main function is to bind antigens, or foreign substances in the body, and target them for destruction. Antibodies can be secreted into the extracellular environment or anchored in the membranes of specialized B cells known as plasma cells. While enzymes are limited in their binding affinity for their substrates by the necessity of conducting their reaction, antibodies have no such constraints. An antibody's binding affinity to its target is extraordinarily high.
Many ligand transport proteins bind particular small biomolecules and transport them to other locations in the body of a multicellular organism. These proteins must have a high binding affinity when their ligand is present in high concentrations but must also release the ligand when it is present at low concentrations in the target tissues. The canonical example of a ligand-binding protein is haemoglobin, which transports oxygen from the lungs to other organs and tissues in all vertebrates and has close homologs in every biological kingdom.
Transmembrane proteins can also serve as ligand transport proteins that alter the permeability of the cell's membrane to small molecules and ions. The membrane alone has a hydrophobic core through which polar or charged molecules cannot diffuse. Membrane proteins contain internal channels that allow such molecules to enter and exit the cell. Many ion channel proteins are specialized to select for only a particular ion; for example, potassium and sodium channels often discriminate for only one of the two ions.
[edit] Structural proteins
Structural proteins confer stiffness and rigidity to otherwise fluid biological components. Most structural proteins are fibrous proteins; for example, actin and tubulin are globular and soluble as monomers but polymerize to form long, stiff fibers that comprise the cytoskeleton, which allows the cell to maintain its shape and size. Collagen and elastin are critical components of connective tissue such as cartilage, and keratin is found in hard or filamentous structures such as hair, nails, feathers, hooves, and some animal shells.
Other proteins that serve structural functions are motor proteins such as myosin, kinesin, and dynein, which are capable of generating mechanical forces. These proteins are crucial for cellular motility of single-celled organisms and the sperm of many sexually reproducing multicellular organisms. They also generate the forces exerted by contracting muscles.
[edit] Methods of study
Main article: Protein methods
As some of the most commonly studied biological molecules, the activities and structures of proteins are examined both in vitro and in vivo. In vitro studies of purified proteins in controlled environments are useful for learning how a protein carries out its function: for example, enzyme kinetics studies explore the chemical mechanism of an enzyme's catalytic activity and its relative affinity for various possible substrate molecules. By contrast, in vivo experiments on proteins' activities within cells or even within whole organisms can provide complementary information about where a protein functions and how it is regulated.
[edit] Protein purification
Main article: Protein purification
In order to perform in vitro analyses, a protein must be purified away from other cellular components. This process usually begins with cell lysis, in which a cell's membrane is disrupted and its internal contents released into a solution known as a crude lysate. The resulting mixture can be purified using ultracentrifugation, which fractionates the various cellular components into fractions containing soluble proteins; membrane lipids and proteins; cellular organelles, and nucleic acids. Precipitation by a method known as salting out can concentrate the proteins from this lysate. Various types of chromatography are then used to isolate the protein or proteins of interest based on properties such as molecular weight, net charge and binding affinity. The level of purification can be monitored using gel electrophoresis if the desired protein's molecular weight is known, by spectroscopy if the protein has distinguishable spectroscopic features, or by enzyme assays if the protein has enzymatic activity.
For natural proteins, a series of purification steps may be necessary to obtain protein sufficiently pure for laboratory applications. To simplify this process, genetic engineering is often used to add chemical features to proteins that make them easier to purify without affecting their structure or activity. Here, a "tag" consisting of a specific amino acid sequence, often a series of histidine residues (a "His-tag"), is attached to one terminus of the protein. As a result, when the lysate is passed over a chromatography column containing nickel, the histidine residues ligate the nickel and attach to the column while the untagged components of the lysate pass unimpeded.
[edit] Cellular localization
Proteins in different cellular compartments and structures tagged with green fluorescent protein.The study of proteins in vivo is often concerned with the synthesis and localization of the protein within the cell. Although many intracellular proteins are synthesized in the cytoplasm and membrane-bound or secreted proteins in the endoplasmic reticulum, the specifics of how proteins are targeted to specific organelles or cellular structures is often unclear. A useful technique for assessing cellular localization uses genetic engineering to express in a cell a fusion protein or chimera consisting of the natural protein of interest linked to a "reporter" such as green fluorescent protein (GFP). The fused protein's position within the cell can be cleanly and efficiently visualized using microscopy, as shown in the figure opposite.
Through another genetic engineering application known as site-directed mutagenesis, researchers can alter the protein sequence and hence its structure, cellular localization, and susceptibility to regulation, which can be followed in vivo by GFP tagging or in vitro by enzyme kinetics and binding studies.
[edit] Proteomics and bioinformatics
Main articles: Proteomics and Bioinformatics
The total complement of proteins present in a cell or cell type is known as its proteome, and the study of such large-scale data sets defines the field of proteomics, named by analogy to the related field of genomics. Key experimental techniques in proteomics include protein microarrays, which allow the detection of the relative levels of a large number of proteins present in a cell, and two-hybrid screening, which allows the systematic exploration of protein-protein interactions. The total complement of biologically possible such interactions is known as the interactome. A systematic attempt to determine the structures of proteins representing every possible fold is known as structural genomics.
The large amount of genomic and proteomic data available for a variety of organisms, including the human genome, allows researchers to efficiently identify homologous proteins in distantly related organisms by sequence alignment. Sequence profiling tools can perform more specific sequence manipulations such as restriction enzyme maps, open reading frame analyses for nucleotide sequences, and secondary structure prediction. From this data phylogenetic trees can be constructed and evolutionary hypotheses developed using special software like ClustalW regarding the ancestry of modern organisms and the genes they express. The field of bioinformatics seeks to assemble, annotate, and analyze genomic and proteomic data, applying computational techniques to biological problems such as gene finding and cladistics.
[edit] Structure prediction and simulation
Complementary to the field of structural genomics, protein structure prediction seeks to develop efficient ways to provide plausible models for proteins whose structures have not yet been determined experimentally. The most successful type of structure prediction, known as homology modeling, relies on the existence of a "template" structure with sequence similarity to the protein being modeled; structural genomics' goal is to provide sufficient representation in solved structures to model most of those that remain. Although producing accurate models remains a challenge when only distantly related template structures are available, it has been suggested that sequence alignment is the bottleneck in this process, as quite accurate models can be produced if a "perfect" sequence alignment is known.[14] Many structure prediction methods have served to inform the emerging field of protein engineering, in which novel protein folds have already been designed.[15] A more complex computational problem is the prediction of intermolecular interactions, such as in molecular docking and protein-protein interaction prediction.
The processes of protein folding and binding can be simulated using techniques derived from molecular dynamics, which increasingly take advantage of distributed computing as in the Folding@Home project. The folding of small alpha-helical protein domains such as the villin headpiece[16] and the HIV accessory protein[17] have been successfully simulated in silico, and hybrid methods that combine standard molecular dynamics with quantum mechanics calculations have allowed exploration of the electronic states of rhodopsins.[18]
[edit] Nutrition
Further information: Protein in nutrition
Most microorganisms and plants can biosynthesize all 20 standard amino acids, while animals must obtain some of the amino acids from the diet.[10] Key enzymes in the biosynthetic pathways that synthesize certain amino acids - such as aspartokinase, which catalyzes the first step in the synthesis of lysine, methionine, and threonine from aspartate - are not present in animals. The amino acids that an organism cannot synthesize on its own are referred to as essential amino acids. (This designation is often used to specifically identify those essential to humans.) If amino acids are present in the environment, most microorganisms can conserve energy by taking up the amino acids from the environment and downregulating their own biosynthetic pathways. Bacteria are often engineered in the laboratory to lack the genes necessary for synthesizing a particular amino acid, providing a selectable marker for the success of transfection, or the introduction of foreign DNA.
In animals, amino acids are obtained through the consumption of foods containing protein. Ingested proteins are broken down through digestion, which typically involves denaturation of the protein through exposure to acid and degradation by the action of enzymes called proteases. Ingestion of essential amino acids is critical to the health of the organism, since the biosynthesis of proteins that include these amino acids is inhibited by their low concentration. Amino acids are also an important dietary source of nitrogen. Some ingested amino acids, especially those that are not essential, are not used directly for protein biosynthesis. Instead, they are converted to carbohydrates through gluconeogenesis, which is also used under starvation conditions to generate glucose from the body's own proteins, particularly those found in muscle.
[edit] History
Further information: History of molecular biology
Proteins were recognized as a distinct class of biological molecules in the eighteenth century by Antoine Fourcroy and others, distinguished by the molecules' ability to coagulate or flocculate under treatments with heat or acid. Noted examples at the time included albumen from egg whites, blood, serum albumin, fibrin, and wheat gluten. Dutch chemist Gerhardus Johannes Mulder carried out elemental analysis of common proteins and found that nearly all proteins had the same empirical formula. The term "protein" to describe these molecules was proposed in 1838 by Mulder's associate Jöns Jakob Berzelius. Mulder went on to identify the products of protein degradation such as the amino acid leucine for which he found a (nearly correct) molecular weight of 131 Da.
The difficulty in purifying proteins in large quantities made them very difficult for early protein biochemists to study. Hence, early studies focused on proteins that could be purified in large quantities, e.g., those of blood, egg white, various toxins, and digestive/metabolic enzymes obtained from slaughterhouses. In the late 1950's, the Armour Hot Dog Co. purified 1 kg (= one million milligrams) of pure bovine pancreatic ribonuclease A and made it freely available to scientists around the world.
Linus Pauling is credited with the successful prediction of regular protein secondary structures based on hydrogen bonding, an idea first put forth by William Astbury in 1933. Later work by Walter Kauzman on denaturation, based partly on previous studies by Kaj Linderstrom-Lang, contributed an understanding of protein folding and structure mediated by hydrophobic interactions. In 1949 Fred Sanger correctly determined the amino acid sequence of insulin, thus conclusively demonstrating that proteins consisted of linear polymers of amino acids rather than branched chains, colloids, or cyclols. The first atomic-resolution structures of proteins were solved by X-ray crystallography in the 1960's and by NMR in the 1980's. As of 2006, the Protein Data Bank has nearly 40,000 atomic-resolution structures of proteins. In more recent times, cryo-electron microscopy of large macromolecular assemblies and computational protein structure prediction of small protein domains are two methods approaching atomic resolution.