Recapitulation

We are currently in the middle of a series on counting principles. At the conclusion of the previous post, I had promised you that we would look at the idea of derangements. However, I realized that we need to lay some ground work before we get to derangements. When we count, how do we ensure that we are not double counting any element? In other words, “How do we ensure that we count every item without counting any item more than once?” Anyone who is involved in any kind of statistical study needs to pay attention to this so that each data point is given the same weightage. So let me introduce you to the inclusion-exclusion principle.

Basics with Two Categories

We begin with the most basic set-up with all items classified into two categories (say A and B). Since items can have multiple classifications, we can represent this using the Venn diagram below:

The Venn diagram has three regions labelled 1, 2, and 3. Set A includes regions 1 and 2, while set B includes regions 2 and 3. If we just add the number of items in sets A and B, we can see that the central region (2) will be counted twice. In other words, the items that belong in both sets will be double counted. To avoid this, we need to subtract the items that belong in the intersection of the two sets.

This can be written as

The last term on the right side of the equation recognizes that some items have been ‘included’ more than they need to be and explicitly ‘excludes’ them from the count. This adjustment by excluding what was included too many times is what gives the principle its name.

Counting with Three Categories

Of course, we know that demographics includes many more than two categories. So how would the inclusion-exclusion change if we had three categories? The Venn diagram below indicates the situation.

As shown in the diagram, there are now 7 distinct regions. Set A includes regions 1, 2, 4, and 5. Set B includes regions 2, 3, 4, and 6. And set C includes regions 4, 5, 6, and 7. If we add sets A, B, and C, we will count regions 2, 5, and 6 twice and region 4 thrice. We can proceed as before and exclude the intersection regions of two sets. But since region 4 belongs to A∩B, B∩C, and C∩A, we will have excluded region 4 thrice. This means that regions 1, 2, 3, 5, 6, and 7 are now counted once, but region 4 is not counted. Of course, it is a simple matter to add it back to include it. This is represented as

We have encountered the terms like the ones in red before in the case where we had only two categories, A and B, and had to compensate for overcounting the regions in the intersection of the two sets. The term in green is a new term, introduced because, by compensating for the overcounting of the regions in the intersections of pairs of sets, we ended up undercounting the intersection of all three sets.

Extension to Four Categories

What would happen if there were more than 3 categories? Unfortunately, a 2 dimensional representation will no longer suffice. So let us label the regions as follows: 1: only A; 2: only B; 3: only C; 4: only D; 5: only A and B; 6: only A and C; 7: only A and D; 8: only B and C; 9: only B and D; 10: only C and D; 11: only A, B, and C; 12: only A, B, and D; 13: only A, C, and D; 14: only B, C, and D; and 15: A, B, C, and D.

Now set A includes regions 1, 5, 6, 7, 11, 12, 13, and 15. Similarly set B includes regions 2, 5, 8, 9, 11, 12, 14, and 15. Set C includes regions 3, 6, 8, 10, 11, 13, 14, and 15, And D includes regions 4, 7, 9, 10, 12, 13, 14, and 15. Now, when we just add A, B, C, and D, the regions 5 to 10 will be counted twice, regions 11, 12, 13, and 14 will be counted thrice, and region 15 four times. The over counting for regions 5 to 10 would need an ‘exclusion’ by subtracting the intersection of pairs of sets. However, this would mean that the regions that involve the intersection of exactly three sets (i.e. 11, 12, 13, and 14) would have been ‘excluded’ three times over, meaning that they would not be ‘included’ anymore. It would also mean that the region 15 has been counted negative 2 times! This can be adjusted by now ‘including’ these four regions. However, this means that region 15 is now ‘included’ 2 times. Hence, we make a final adjustment and subtract the intersection of all four sets. This is represented as

Here the red terms represent the adjustments for regions 5 to 10, the green ones for regions 11 to 14 and the blue one for region 15.

Generalization

If we continue in this manner, adding more categories, we will be able to demonstrate that

This can be written in a more compact form as

I recommend you take a close look at how the last two equations are actually the same as we will need such ‘parsing’ of formulas in future posts!

What we have seen here is that we can recursively ‘exclude’ intersections with greater number of sets and in so doing avoid the twin traps of overcounting and undercounting. With this foundation, in the next post we will look at a way of counting ‘derangements’. Stay grounded till then.

Posted in

Leave a comment