Last week’s post started out as just an example to use in a post about DNA in genealogy. The example grew and grew until it became a post of its own, so this week I’ll backtrack and just discuss a bit about DNA. This post has in turn grown out of control and I’ll have to cut off the end so there will be at least one more.
I’m not going to dive into the depths. The amount of material is staggering and probably best studied on a “need to know” basis. Personally, I have an SNP that causes a null reading at marker DYS439—important to my genealogical problem, gibberish to the vast majority who will never need to understand that statement. I’ll try instead to write about some things to think about in general.
Definitions and Disclaimers
Different tests and methods are used depending on the goal of the DNA testing, so it is important to know exactly what is being discussed. What I mean when I talk about using DNA testing for genealogy is DNA testing that can help to understand your pedigree several generations back. I don’t mean tests that attempt to identify ethnicity. They may be fascinating but they are a different endeavor. I also don’t mean tests for adoption research or paternity. What is relevant for a few generations is clearly relevant for one generation but when only one generation separates those involved, the techniques available are not things that we can do in cases when the ancestors in question are many generations back. The status of the paper trail is also going to be quite different between a paternity test and a hunt for a great-great grandfather.
Disclaimer: I am not a geneticist, nor have I ever played one on TV. I do find it fascinating, so this will be, to some extent, me thinking “out loud” as I work on patching the holes in my knowledge of genetics. I hope that this information is accurate. If it isn’t, it shows some of the difficulties in finding accurate and up to date answers to questions. Anyone who detects the faintly rotten smell of old information, please let me know. DNA testing is a very rapidly changing field and there is much information out there in cyberspace without a freshness date.
One frustration I need to mention is that explanations of DNA testing seem to always skip the math. I like to see probability distributions and understand the variables being used. I like to see the different mutation rates for different markers and know what the errors on those rates are. I like to know what kinds of simulations are being used. Often one reads something that makes sense on the surface but with a little thought, it isn’t clear exactly how it should be interpreted.
Not all DNA is Created Equal
When it comes to how DNA is passed down from parents to children, it seems we can put the DNA into four different classes, each with its own inheritance behavior.
Nearly all of the DNA in our cells is stored in the cell’s nucleus. There is one exception and it gives us the first of the four classes.
Mitochondria are organelles (subunits of a cell that perform specific duties). Mitochondria are unusual in that they have their own DNA. Mitochondria are passed on only from mother to child (there may be extraordinarily rare cases when a father also passes on mitochondria). The mitochondrial DNA is one class.
The vast majority of human DNA is contained in the cell nucleus and is grouped into pairs of chromosomes. Humans have twenty-three such pairs. In twenty-two of those pairs the two chromosomes are of the same type and are nearly identical. One pair is special because it consists of either two nearly identical X chromosomes in females or an X and a Y chromosome in males. Because only males have a Y chromosome, it is only passed along the paternal line. That makes it special. It is the second class.
Unlike the Y chromosome, the twenty-two “normal” pairs of chromosomes, called autosomes, consist of one copy from the mother and one from the father. Nevertheless, if you examine any one of these chromosomes in one person, it will not be identical to any chromosome found in either parent. A crude analogy is that an autosome is like a deck of cards. A pair of autosomes is like two decks of cards. When it comes time to pass along an autosome, the decks are cut and the top of one deck is exchanged for the top of the other. One of those new decks is the autosome that is passed on by that parent. For each child the thecut procedure is repeated. No two decks will be exactly alike. This is the third class.
The last source is the X chromosome. It behaves in a way that is somewhere in between the Y chromosome and the autosomes. A man cannot mix his X chromosomes because he has only one. All his daughters will receive his X chromosome just as his sons receive his Y chromosome. A woman has two X chromosomes and all her offspring receive a mixed X chromosome from her. The case of a man’s X chromosome being passed to his daughters is not quite analogous to the case of the Y. His daughters may be carrying exact copies of his X chromosome but that is only half of their total xDNA. Looking at a daughter’s xDNA doesn’t tell you what her father’s xDNA is in the same way that looking at a son’s yDNA tells you what his father’s yDNA is. In the next generation, his daughters will shuffle his X chromosome with the X chromosome they got from their mother and therefore, his X chromosome does not travel down the generations the way his Y chromosome does.
yDNA, Mutations and Patterns
Beyond being passed only from father to son, the Y chromosome is special because it never goes through recombination, the shuffling process that happens to the other chromosomes. In a simplified situation without mutations, a male’s yDNA would be exactly the same as his father’s. Because that is true generation after generation, testing a son’s yDNA would be the same as testing his father’s or his father’s father’s father’s or any number of fathers back in time.
Of course, mutations do occur and in genealogical testing they are used to advantage. They provide a “genetic clock” that can allow an estimate of how many generations separate a pair of males from their most recent common ancestor along their paternal lines.
For simplicity, if I pick the ridiculously high probability of mutation of 50% per generation for one genetic marker, we can do some calculations. I would have a 50% chance of having the DNA at that marker being different from the same marker in my father’s yDNA. My son would have a 50% chance of showing a difference in that marker from what I have. If we compared a large number of grandson’s to their paternal grandfathers, we would find that 25% matched their grandfathers at that marker and 75% would not match. We’d find that because half would have had no mutation occur between grandfather and father. If we look just at that half of the grandsons, half of them would also show no mutation between father and son. That gives one quarter of the full sample with no mutation after two generations. There would be three other possibilities that all would be a quarter of the sample and would all show a difference between grandson and grandfather. One quarter would have a mutation between son and father but none between father and grandfather, one quarter would have a mutation only between father and grandfather and one quarter would have a mutation at both steps. Those three quarters add up to 75% of the grandsons having that marker different from their paternal grandfathers.
What one really wants to do with DNA testing is to go the other direction. That is, not calculate what a test result might be but to take a result and calculate what it might mean. In reality, mutation rates are much lower and more than one marker is tested. Slower mutation rates mean that the probability of any one marker matching between father and son is extremely high—nothing like the 50% that I used to keep the numbers simple. Using multiple markers achieves two things. First, saying that two men match at one marker is not particularly meaningful. That can happen by random chance. Using too few markers leads to what are known as false positives—it would look like two men had a common male ancestor in recent times when in fact their ancestry was totally different. A whole pattern of markers that match is much more significant. Think of finding a scrap of cloth at a crime scene. If all the detective could say in court is that she saw a red scrap of cloth at the scene and the suspect owned a torn red shirt, it would not be a very believable match on its own. If on the other hand, the scrap showed a complex plaid in unusual colors and the suspect had a torn shirt with exactly that plaid pattern, it would be much, much more indicative of possible guilt.
The other reason to use many markers is to increase the chances of finding a mutation. The greater the chance of finding mutations, the more accurate the genetic clock can be in estimating the number of generations. This is, as far as I understand it, significant practical difference between yDNA testing and mtDNA testing for genealogical purposes.
mtDNA and Genetic Clocks—Shouldn’t You Have Mutated by Now?
Much of what is true for yDNA is true for mtDNA. Both are transmitted only along a path that involves a single gender. Neither is shuffled. That means that if it were not for mutations, a child’s mtDNA would match the mother perfectly just as a son’s yDNA would match the father perfectly.
I shouldn’t skip over one difference between the two tests that every beginner in this business needs to know. That difference is in who can be tested. Anyone can take an mtDNA test that will tell them about their mother’s mtDNA. Only males can take a yDNA test that will tell them about their father’s yDNA. Never before have the brothers of genealogists been so popular.
The difference I really want to think about is mutation rate. As far as I understand, mutations occur more frequently in yDNA than in mtDNA, so the genetic clocks carry different information. You can think of it like this. Imagine you have absolutely no sense of time and you are going to be in an experiment. There are two rooms with two very different clocks. You get to be in each room for a short amount of time, not more than two minutes, and you will be asked how long you were in the room once you leave. The catch is that you aren’t allowed to look at the room’s clock. How would you know how long you were in the room? You could count the ticks you hear. If you know the clock in the first room ticks once per second and you count ten ticks then you can say that you were in the room for between just over nine seconds and just under eleven seconds. On the two-minute scale of the experiment, that is pretty accurate even without looking. The clock in the other room only ticks once per minute. Now you have a problem. If the experimenters put you in the room for only ten seconds, you might hear one tick, you might not. If you hear a tick, what does it mean? You might have been in the room only for the fraction of a second that it took the clock to tick or you might have been in the room for almost the maximum two minutes. That is not very good accuracy. The scale at which the clock runs was not appropriate for the timescale that you were interested in. If you repeated the experiment but were put in the room for a few hours, you’d quickly lose track of the fast ticking clock but you could give a decent answer about how long you were in the room with the slow ticking clock. Different genetic clocks show this kind of different timescale.
That was an extreme illustration of the kind of difference that one encounters between the yDNA and the mtDNA clocks. The mutation rate of yDNA means that it can give some discrimination on the scale of the number of generations that we frequently run into in genealogy. In the case of mtDNA, the hypervariable regions change more slowly. That makes them a great clock for long stretches of time but for shorter stretches it means the clock has to be used differently. I’ve seen it written over and over that if two people have a match in the mtDNA they probably have a recent common ancestor in their maternal lines. If they don’t quite match, then it is hard to really say much about how many generations back the common ancestor might be. Not being very sure how to interpret that statement, I looked for some timescale related numbers.
Using some numbers from Family Tree DNA:
- A match at Hypervariable Region 1 means a 50% chance that two people have a most recent common maternal ancestor within about 1,300 years, so since 700 AD.
- A match in both hypervariable regions means a 50% chance of that most recent common maternal ancestor having lived since 1300 AD.
- A match on the full genomic sequence means a 50% chance of the common maternal ancestor falling within the last 5 generations. An impressive improvement and not bad for a timescale but the probability of 50% is not very high. Nevertheless, it does mean that half the time the match will be in a reasonable number of generations. At a 90% confidence level the most recent common maternal ancestor is within 16 generations. Notice that these levels and generations don’t scale in a simple way. Increasing confidence by 40% increases the number of included generations by 11. Also note that because this is the maternal line, if one really needs to extend the paper trail out that number of generations, it will be quite a challenge. On the other hand, the bigger the challenge the bigger the potential reward, but it is something to keep in mind—not all paper trails are created equal.
If I look at Family Tree DNA’s numbers for a very fine grained test of yDNA with 111 markers, a perfect match implies 2 generations of separation at a 50% confidence level. From that starting point every marker that differs implies about one more generation of separation at a 50% confidence level. This test is more markers than I had heard of being used until I started to prepare this post and it helps make an important distinction. The accuracy of any test designed to find matches between things depends on two factors—the technology used for the test and the amount of information that is theoretically available. In the case of yDNA the number of available markers seems to be increasing all the time. That implies that the limiting factor is still the technology. Because it is now possible to test the full mtDNA sequence, my understanding is that for mtDNA the limiting factor is now the information itself.
One thing I have not run across in the yDNA numbers is the other end of a time interval. I would think that at some small number of differences it begins to be possible to say that two people probably have a most recent common paternal ancestor more distantly related than a certain relationship.
To be Continued
This is where I have to cut off my genes (so to speak). Next time—triangulation, autosomal DNA and taking ancestral attendance.