Tuesday, August 22, 2017

Excel Finding and Removing Duplicate Values

Removing Duplicate Values in an Excel List

We've all encountered the situation. We're
scrubbing data, fully expecting a list of 43,137 clients, and we end up with 43,228 lines of data.

Using Conditional Formatting in Excel, we are able to highlight the duplicate values. This doesn't, however, accomplish our end goal of getting rid of garbage that clearly exists. Here's a quick step-by-step on how we can remove those duplicate rows of data painlessly - without any VBA programming.

The biggest caveat to bring up - be on board with what we are calling a duplicate. "Rob" is not "Robert" nor "Bob" or "Bobbie". "Robert " with an inconvenient space after the name is not the same as "Robert" with no space. In discussing duplicate data, here we are talking a complete duplicate, not what we'll call human 'close enough' data.

With that out of the way...

I do enjoy playing with data a bit. In some cases, we're not interested in getting rid of the duplicates, but just to know if we have duplicates.

How to know if we have duplicates

There's an extraordinarily easy way to find duplicates in Excel (that I've only just learned of recently myself).
1 - Select the range of interest
2 - In the Home tab of the Excel ribbon (Excel 2016), click Conditional Formatting -> Highlight Cells Rules -> Duplicate Values
3 - Pick your absolute favorite format in the list in the presented dropdown box
4 - Click OK
5 - VoilĂ 

Pretty straightforward, right? Easy stuff. What if, however, we want to know how many times a value occurs in a list? For this, conditional formatting just isn't enough. Lucky for us, Excel does have some nice functions in there. In this case, we'll use the COUNTIF function.

Counting Duplicate Values

1 - Load your list that (potentially) has duplicate values.
Figure 1
2 - In a convenient cell next to the first line of your data, enter, "=COUNTIF(A:A,A2)" (Figure 1). This presumes the list of data is in the first column, A, and we are in column B. If this is not precisely the case with your data, make the adjustments to the column reference (A:A) and the criteria reference (A2). Press enter.
3 - Copy this formula to the full length of the list.

In the cells we have just copied down, we'll have the count of each value, and how many times that value occurs in the list - thus the name of the function, COUNTIF.

All this is great information, but not what we're after. We want to remove duplicate values.

Finding and Removing Duplicate Values

Going back to our original list, we may have gone through the exercises above, and decided that, indeed, we have duplicates, and we really do simply want those duplicates removed.

1 - Sort the list ascending. This will cause all the duplicate values to be sorted together.
2 - In a convenient cell next to our list, enter, "=A2=A1". This presumes the actual top of the list is in A2 (because, perhaps we have a header line in A1, as is in Figure 2). What we are asking Excel to do, is see whether the value in A2 is equal to the value in A1. If it is equal, it will return TRUE. If it is not equal, it will return FALSE.
Figure 3
3 - Copy this formula the entire length of the list. You will now have a column filled with TRUE and FALSE. You will note that each FALSE is at the FIRST unique value before duplicates. (Figure 3). If there are no duplicates, that unique value is still reported as FALSE. Otherwise stated, each FALSE represents a unique value in the list. Each TRUE is a duplication.
4 - We want to make these TRUE and FALSE cells values. We will see why in a moment. Copy the range of cells that contain our formula.
5 - Paste Special right where you are - we're pasting right back to the cells we just copied. Select the Values option button. Click OK. We now have a bunch of TRUE and FALSE's that are not formulas, they are just TRUE or FALSE.


6 - Sort this entire range (both your original list and the TRUE and FALSE list) by this TRUE/FALSE column.
7 - All of the unique items sort to the top of the list (FALSE), and all the duplicate items fall to the bottom of the list (TRUE).
8 - Delete all the rows at the bottom of the list that are TRUE. You can now (also) remove the TRUE/FALSE column that we put in place for this purpose. Congratulations! You've killed all those duplicates.

No comments:

Post a Comment

Excel Finding and Removing Duplicate Values

Removing Duplicate Values in an Excel List We've all encountered the situation. We're scrubbing data, fully expecting a list of...