Tuesday, August 22, 2017

Excel Finding and Removing Duplicate Values

Removing Duplicate Values in an Excel List

We've all encountered the situation. We're
scrubbing data, fully expecting a list of 43,137 clients, and we end up with 43,228 lines of data.

Using Conditional Formatting in Excel, we are able to highlight the duplicate values. This doesn't, however, accomplish our end goal of getting rid of garbage that clearly exists. Here's a quick step-by-step on how we can remove those duplicate rows of data painlessly - without any VBA programming.

The biggest caveat to bring up - be on board with what we are calling a duplicate. "Rob" is not "Robert" nor "Bob" or "Bobbie". "Robert " with an inconvenient space after the name is not the same as "Robert" with no space. In discussing duplicate data, here we are talking a complete duplicate, not what we'll call human 'close enough' data.

With that out of the way...

I do enjoy playing with data a bit. In some cases, we're not interested in getting rid of the duplicates, but just to know if we have duplicates.

How to know if we have duplicates

There's an extraordinarily easy way to find duplicates in Excel (that I've only just learned of recently myself).
1 - Select the range of interest
2 - In the Home tab of the Excel ribbon (Excel 2016), click Conditional Formatting -> Highlight Cells Rules -> Duplicate Values
3 - Pick your absolute favorite format in the list in the presented dropdown box
4 - Click OK
5 - VoilĂ 

Pretty straightforward, right? Easy stuff. What if, however, we want to know how many times a value occurs in a list? For this, conditional formatting just isn't enough. Lucky for us, Excel does have some nice functions in there. In this case, we'll use the COUNTIF function.

Counting Duplicate Values

1 - Load your list that (potentially) has duplicate values.
Figure 1
2 - In a convenient cell next to the first line of your data, enter, "=COUNTIF(A:A,A2)" (Figure 1). This presumes the list of data is in the first column, A, and we are in column B. If this is not precisely the case with your data, make the adjustments to the column reference (A:A) and the criteria reference (A2). Press enter.
3 - Copy this formula to the full length of the list.

In the cells we have just copied down, we'll have the count of each value, and how many times that value occurs in the list - thus the name of the function, COUNTIF.

All this is great information, but not what we're after. We want to remove duplicate values.

Finding and Removing Duplicate Values

Going back to our original list, we may have gone through the exercises above, and decided that, indeed, we have duplicates, and we really do simply want those duplicates removed.

1 - Sort the list ascending. This will cause all the duplicate values to be sorted together.
2 - In a convenient cell next to our list, enter, "=A2=A1". This presumes the actual top of the list is in A2 (because, perhaps we have a header line in A1, as is in Figure 2). What we are asking Excel to do, is see whether the value in A2 is equal to the value in A1. If it is equal, it will return TRUE. If it is not equal, it will return FALSE.
Figure 3
3 - Copy this formula the entire length of the list. You will now have a column filled with TRUE and FALSE. You will note that each FALSE is at the FIRST unique value before duplicates. (Figure 3). If there are no duplicates, that unique value is still reported as FALSE. Otherwise stated, each FALSE represents a unique value in the list. Each TRUE is a duplication.
4 - We want to make these TRUE and FALSE cells values. We will see why in a moment. Copy the range of cells that contain our formula.
5 - Paste Special right where you are - we're pasting right back to the cells we just copied. Select the Values option button. Click OK. We now have a bunch of TRUE and FALSE's that are not formulas, they are just TRUE or FALSE.


6 - Sort this entire range (both your original list and the TRUE and FALSE list) by this TRUE/FALSE column.
7 - All of the unique items sort to the top of the list (FALSE), and all the duplicate items fall to the bottom of the list (TRUE).
8 - Delete all the rows at the bottom of the list that are TRUE. You can now (also) remove the TRUE/FALSE column that we put in place for this purpose. Congratulations! You've killed all those duplicates.

Monday, August 21, 2017

Excel VBA Function - Convert Number to Text

Converting Number to Text

I had run across a question in a user group not too long ago, asking about how to change an arbitrary number to text (such as, "101" to "one hundred one"). Up for the challenge, here below is a short function for Excel VBA that will do just this.

If you are interested in this post strictly for the code, you are welcome to skip to the code. If you are using this code, please include the credit lines.

Figure 1
In dissecting the problem, we can draw from our own experiences on how to handle things. Beginning from zero, we know each single digit (0,1,2,3,4,5,6,7,8,9) has a name, as follows (see Figure 1).

The programmer in us might suggest this is all we need per digit, and that may be true, if we were working in something other than our common decimal system. If we were working in binary, for example, we would suggest the binary number, "10" would be called, "one zero", not what we would say in common English with regard to decimal numbers, "ten". This said, we'll need to continue on with the matrix for our "teen" numbers, 10 through 19 (Figure 2):
Figure 2

 With this tackled, we need to consider the remaining 'tens' from 20-90. The ones place, we have handled in Figure 1, so we are only concerned with how we name our 'tens' (Figure 3):

So far so good. Using what we've just considered, we can write all numbers from 0 through 99.



Figure 3
Moving to the 'hundreds' place, we realize that this digit is simply what we would call the first digit, with 'hundred' after. i.e. "100" would be written as "one hundred"; "900" would be "nine hundred". "876" would be "eight hundred seventy six". With this knowledge, we know we can leverage what we've already considered for digits 1 through 9.

We can now write all numbers from 0 (zero) through 999 (nine hundred ninety nine).

Moving now into the thousands place, we have a different pattern to consider. It is very common to separate 'large' numbers into three digit groups separated by commas (in English). These sets of three numbers are called "periods". I looked it up.

A number such as "123" only has one 'period' - we are in the simplest case possible. A number such as "123,456" has two periods - the thousands period and the 'units' period. We do not name the unit period in common speech - it is assumed (we do not say "one hundred twenty three thousand four hundred fifty six unit" - we would say "one hundred twenty three thousand four hundred fifty six").
Armed with this, let's throw down some code. In a blank Excel sheet, press Alt-F11. This should open up the Macro screen. On the new sheet you just made, right click and Insert -> Module.



Public Function DBLTOTEXT(num As Variant, Optional ord As Integer = 0) As String
    Dim majscale() As String, tens() As String, ones() As String
    ones = Split(",one,two,three,four,five,six,seven,eight,nine,ten,eleven,twelve,thirteen,fourteen,fifteen,sixteen,seventeen,eighteen,nineteen",",")
    tens = Split(",ten,twenty,thirty,forty,fifty,sixty,seventy,eighty,ninety", ",")
    majscale = Split(",thousand,million,billion,trillion,quadrillion,quintillion,sextillion,septillion",",")
   
    Dim i As Variant, h As String, t As String, o As String
    Dim u, l As Variant, x As String, n As String

    i = num
   
    Select Case i
        Case 0 To 19
            o = ones(i)
        Case 20 To 99
            t = tens(Int(i / 10))
            o = DBLTOTEXT(i - Int(i / 10) * 10)
        Case 100 To 999
            h = ones(Int(i / 100)) & " hundred"
            t = DBLTOTEXT(i - Int(i / 100) * 100)
        Case Else
            u = Int(i / 1000)
            l = i - Int(i / 1000) * 1000
            x = DBLTOTEXT(u, ord + 1) & " " & DBLTOTEXT(l)
    End Select
    n = Trim(IIf(h = "", "", h & " ") & IIf(t = "", "", t & " ") & IIf(o = "", "", o & " "))
    If n <> "" Then n = n & " " & majscale(ord)
    DBLTOTEXT = Trim(IIf(x = "", "", x & " ") & n)
End Function


Placing this in a Macro in Excel will make it available to the spreadsheet. Close the Macro window and in a blank cell in Excel, type:

=DBLTOTEXT("1023")

If all is well, you should end up with the result of the function, "one thousand twenty three".

A bit of an aside, but ended up being the real question originally (leave it to me to solve a non-existent problem) was assistance in converting dollars and cents to verbiage. This additional function will do just that:



Public Function DOLLTOTEXT(dollars As Double)
    Dim i As Double, f As Double
    i = Int(dollars)
    f = Round(dollars - i, 2) * 100
    DOLLTOTEXT = DBLTOTEXT(i) & " dollars" & IIf(f > 0, " and " & DBLTOTEXT(f) & " cents", "")
End Function


In part 2, we'll begin dissecting the code here and figure out what's going on.


Excel Finding and Removing Duplicate Values

Removing Duplicate Values in an Excel List We've all encountered the situation. We're scrubbing data, fully expecting a list of...