Let's talk sentence-casing in Excel

Posted Fri, May 22 2009 16:28 by Nate Oliver

Good day,

So we probably agree that Excel and/or the VBA IDE aren't the ideal locations to do your word processing, however, it's done there, either way. Both Excel and VBA (the latter being based on VB6) provide us with some tools to help, in terms of lower-casing, upper-casing, and proper-casing (every first letter is capitalized). But what about sentence-casing?

There's no solid native way, that I know of, to do this in Excel, so, we're going to roll our own, naturally. I'm going to post two algorithms I have written in the past to look at this, and keep in mind the intent, to deal with multi-sentence capitalization issues, deal with the varying forms of "I", whether it's I'm, it's, etc... What I don't attempt to speak to is proper nouns, that's beyond what my scope was.

Here is my first take on what I call sCase():


Private Declare Function IsCharLower Lib "user32" Alias "IsCharLowerA" ( _
    ByVal cChar As Byte) As Long

Private Declare Function CharUpperBuffB Lib "user32" Alias "CharUpperBuffA" ( _
    lpsz As Byte, _
    ByVal cchLength As Long) As Long

Public Function sCase(ByRef strIn As String) As String
Dim bArr() As Byte, I As Long, i2 As Long
If strIn = vbNullString Then Exit Function
Let bArr = StrConv(strIn, vbFromUnicode)
CharUpperBuffB bArr(0), 1
For I = 1 To UBound(bArr)
    Select Case bArr(I)
        Case 105
            If Not I = UBound(bArr) Then
                Select Case bArr(I + 1)
                    Case 32, 33, 39, 44, 46, 58, 59, 63, 148, 160
                        If bArr(I - 1) = 32 Then _
                            CharUpperBuffB bArr(I), 1
                End Select
            ElseIf bArr(I - 1) = 32 Or bArr(I - 1) = 160 Then _
                CharUpperBuffB bArr(I), 1
            End If
        Case 33, 46, 58, 63
            For i2 = I + 1 To UBound(bArr)
                If IsCharLower(bArr(i2)) Then
                    CharUpperBuffB bArr(i2), 1
                    I = i2:   Exit For
                End If
                If bArr(i2) <> 32 And bArr(i2) <> 33 And bArr(i2) <> 46 _
                    And bArr(i2) <> 63 And bArr(i2) <> 160 Then
                    I = i2:   Exit For
                End If
            Next
    End Select
Next
sCase = StrConv(bArr, vbUnicode)
End Function


And here is my 2nd:


Public Function sCase(ByRef strIn As String) As String
Dim bArr() As Byte, i As Long, i2 As Long
If strIn = vbNullString Then Exit Function
Let bArr = strIn
Select Case bArr(0)
    Case 97 To 122
        bArr(0) = bArr(0) - 32
End Select
For i = 2 To UBound(bArr) Step 2
    Select Case bArr(i)
        Case 105
            If Not i = UBound(bArr) - 1 Then
                Select Case bArr(i + 2)
                    Case 32, 33, 39, 44, 46, 58, 59, 63, 148, 160
                        If bArr(i - 2) = 32 Then _
                            bArr(i) = bArr(i) - 32
                End Select
            ElseIf bArr(i - 2) = 32 Then _
                bArr(i) = bArr(i) - 32
            End If
        Case 33, 46, 58, 63
            For i2 = i + 2 To UBound(bArr) Step 2
                Select Case bArr(i2)
                    Case 97 To 122
                        bArr(i2) = bArr(i2) - 32
                        i = i2:   Exit For
                End Select
                Select Case bArr(i2)
                    Case 32, 33, 46, 63, 160
                    Case Else
                        i = i2:   Exit For
                End Select
            Next
    End Select
Next
sCase = bArr
End Function


Why would I have two versions? Good question. The 2nd version is a lot faster, perhaps not to the eye, but %-wise. But it has an issue, it's very insensitive to international languages, I suspect it's safe for most US-based text strings. The first UDF, which requires the API calls (2nd does not), attempts to be a little more sensitive to international languages.

Note that my understanding of both English and various international languages is limited in scope, e.g., I tested the first UDF with limited amounts of French, etc... But these might not be perfect - think of them as a start, or approach.

Right, so how do you call these? Here are examples:

Sub testTime()
Debug.Print sCase(LCase$("À LA CARTE")) 'Something
Debug.Print LCase$("À LA CARTE") 'Something
Debug.Print sCase("À LA CARTE") 'Nothing
End Sub


And:


Sub testTime()
Debug.Print sCase(LCase$("hello? erm, i'M only testing, eh. indeed, " & _
    "i am inquisitve."))
Debug.Print sCase(LCase$("how old?! 22 Years."))
Debug.Print sCase(LCase$("how old?! twenty-two Years."))
Debug.Print sCase(LCase$("hmmmm.... wOrking?!?! sam i am. yes-no? " & _
    "isn't i'm isn't."))
Debug.Print sCase(LCase$("THE DAY WAS SUNNY AND I WORE A HAT.PETER WAS THERE "))
Debug.Print sCase(LCase$("no WorRies, i'm ONLY testIng!         yes-no?"))
Debug.Print sCase(LCase$("mY fRiend & i"))
Debug.Print sCase(LCase$("iiiiiiiiiiiiii"))
Debug.Print sCase(LCase$("   T. toast %T i @"))
Debug.Print sCase(LCase$("re: sentences."))
Debug.Print sCase("hello? thought i'd test this for McDonald. NOTHING.")
End Sub


You can also call these in the Workbook, should they be available there, the way you'd call any normal UDF, e.g.,

=scase(A1)

Note my use of LCase(). I tried to bake in some flexibility, in terms of what your data might look like, e.g., It's generally intended for you to lower-case your String, before calling sCase(), but there might be scenarios where your proper nouns are in good shape, and you don't want to do that.

You might also note that I'm using a linear search against a Byte Array, and that this might appear to be slow. Give it a shot, time-wise, before fully coming to that conclusion. I've seen alternative approaches to this problem with Regular Expressions (RegExp) that can't quite reach the performance levels of these algorithms, nor are they as effective. Just my experience.

If you want to dig into Byte Arrays, I'd recommend starting with Bruce McKinney's thoughts:

http://vb.mvps.org/hardcore/html/whatisunicode.htm
http://brucem.mystarband.net/mckinney2b.htm#12

Some of my work, here, was assisted with these resources.

Filed under: , ,

Comments

# re: Let's talk sentence-casing in Excel

Tuesday, June 02, 2009 12:29 AM by fzz

If you want to speed optimize, use another Long variable, assign UBound(bArr) to it, then use that variable rather than UBound calls. Seems VBA doesn't provide constant expression optimizations, i.e., it calls UBound on every iteration even though bArr is invariant once initialized.

Special handling of the English pronoun I isn't technically part of sentence capitalization. You're doing it because you're assuming (as you stated) that you expect the string passed to sCase to have been passed through LCase. Me, I'd figure sentence capitalization should be part of a grammar checker rather than a standalone function. Then one could include logic to ignore lists, e.g.,

i. This is the first item in the list.

ii. This is the second item in the list.

where the "i." in the first line shouldn't be capitalized. On the other hand, the following prose is correct English if overdramatic.

Who was it holding the knife? I.

Ain't English wonderful?!

# re: Let's talk sentence-casing in Excel

Tuesday, June 02, 2009 11:48 AM by Nate Oliver

I'm curious about your comment on UBound() being called on each iteration, as it's not documented as such:

msdn.microsoft.com/.../zxkf5z4b(vs.71).aspx

"The iteration values start, end, and step are evaluated only once, before the loop begins. If your statement block changes end or step, these changes do not affect the iteration of the loop. In the preceding example, the UBound function is called only when the For statement is first executed. If the statement block had subsequently changed the length of the array A, the For loop would still use the original length as the end value, because it does not call UBound again."

# re: Let's talk sentence-casing in Excel

Tuesday, June 02, 2009 2:05 PM by Nate Oliver

Okay, let's see if we agree with this example:

Sub foo()

Dim lngArr() As Long, i As Long

Dim j As Long

ReDim lngArr(1 To 30)

For i = LBound(lngArr) To UBound(lngArr)

   Let j = j + 1

   ReDim Preserve lngArr(1 To 30 + j)

Next

Debug.Print j, UBound(lngArr)

End Sub

Note the counter is 30 iterations, while the upper boundary of lngArr has morphed to 60.

That would seem to indicate Ubound() isn't being called over and over again, on each iteration, or j should be greater than it is?

# re: Let's talk sentence-casing in Excel

Tuesday, June 02, 2009 5:02 PM by fzz

OK, I was wrong about the for loop bounds, but you are calling it within the outer For loop, so you are calling it repeatedly. Those calls aren't optimized.

For i = 2 To UBound(bArr) Step 2

 :

           If Not i = UBound(bArr) - 1 Then

             :

           For i2 = i + 2 To UBound(bArr) Step 2

             :

           Next

 :

Next

On a tangent, if you like added readability, why not include the variable name in the Next statement?

# re: Let's talk sentence-casing in Excel

Tuesday, June 02, 2009 6:30 PM by Nate Oliver

Okay, that is true - my inner-loop UBound() calls aren't optimized. That probably takes almost no time at all, but I agree, as your saying, a Variable would be faster.

I've never used Variable names in my For/Next Loops. I see what you're saying, does that make me a hypocrite?

I've always just used tabbing to determine which Loop I'm stuck in. You might be right, by the way - I probably am being arbitrary.

# re: Let's talk sentence-casing in Excel

Wednesday, June 03, 2009 2:23 PM by Peter Thornton

Yeah manipulating strings via a byte array is very efficient. But as you say there are pitfalls with non English characters and in particular with Unicode wide characters and a non English codepage system.

AFAIK there's never any ambiguity with ASCII 0-127. I'm fairly sure AINSII 128-255 is consitent in Latin languages. So with these caveats, simply checking if the 2nd byte is zero will exclude the majority of characters we need to avoid (also speeds up the loop considerably).

All but one of the characters 224-255 are lower case Latin characters that can be treated the same way as "a-z"

I haven't checked your logic, and not withstanding whether or not it's correct to cater for things like i/I, the following is your routine modified to cater for the things I mentioned. Oh yes, might as well assign UBound -1 to a variable!

Public Function sCase2(ByRef strIn As String) As String

Dim i As Long, i2 As Long, ub As Long

Dim bArr() As Byte

   If Len(strIn) = 0 Then Exit Function

   bArr = strIn

   ub = UBound(bArr) - 1    ' we're not looping the last byte

   If bArr(1) = 0 Then  ' < only if zero

       Select Case bArr(0)

       Case 97 To 122, 224 To 246, 248 To 255  ' some more lcase characters

           bArr(0) = bArr(0) - 32

       End Select

   End If

   For i = 2 To ub Step 2

       If bArr(i + 1) = 0 Then  ' only look at letters with 2nd byte = 0

           Select Case bArr(i)

           Case 105

               If Not i = ub Then

                   Select Case bArr(i + 2)

                   Case 32, 33, 39, 44, 46, 58, 59, 63, 148, 160

                       If bArr(i - 2) = 32 Then _

                          bArr(i) = bArr(i) - 32

                   End Select

               ElseIf bArr(i - 2) = 32 Then _

                      bArr(i) = bArr(i) - 32

               End If

           Case 33, 46, 58, 63

               For i2 = i + 2 To ub Step 2

                   Select Case bArr(i2)

                   Case 97 To 122, 193 To 246, 248 To 255

                       bArr(i2) = bArr(i2) - 32

                       i = i2: Exit For

                   End Select

                   Select Case bArr(i2)

                   Case 32, 33, 46, 63, 160

                   Case Else

                       i = i2: Exit For

                   End Select

               Next

           'Else

              ' it's chr(128-159) or a wide character

           End If

       End Select

   Next

   sCase2 = bArr

End Function

# re: Let's talk sentence-casing in Excel

Wednesday, June 03, 2009 2:38 PM by Nate Oliver

Hi Peter, thanks for joining in the fray! I'll be sure to test that out.

I'm pretty sure the I/i deal is pretty much limited to English. I struggle to recall throughout my French classes where I'd have an upper-case I, as such.

I don't know if it's appropriate, but it was part of the original question that I was attempting to answer... I seem to recall someone familiar showing up in the same thread, now that I think about it. <g>

# re: Let's talk sentence-casing in Excel

Wednesday, July 29, 2009 8:30 AM by James_B

Dude, I love your stuff--both on this blog and elsewhere.  Anyway, just to say, this SentenceCase function is turning out to be useful in the extreme: (when I first read this thread, I doubted this).

For my own purposes, I've had to modify (butcher?) the start and end of it so that it can handle arrays (that can be written back to a worksheet).  There's doubtless a more optimised way of doing such a thing, but this does the job (I think) as a quick workaround.  Cheers again, James.

P.S.  Shout  if I've made some stupid error in the below

Public Function SCase(ByRef sIn As Variant) As Variant

  Dim e As Long

  If IsEmpty(sIn) Then Exit Function

  If TypeName(sIn) = "String" Then sIn = Array(sIn)

  For e = LBound(sIn) To UBound(sIn)

     Dim i As Long, ii As Long, UB As Long

     Dim bArr() As Byte

     If LenB(sIn(e)) = 0 Then Exit For

     bArr = sIn(e)

     UB = UBound(bArr) - 1 'since we're not looping the last byte

     If bArr(1) = 0 Then '<-- only if zero

        Select Case bArr(0)

        Case 97 To 122, 224 To 246, 248 To 255 'some more lcase characters

           bArr(0) = bArr(0) - 32

        End Select

     End If

     For i = 2 To UB Step 2

        If bArr(i + 1) = 0 Then 'only look at letters with 2nd byte = 0

           Select Case bArr(i)

           Case 105

              If i <> UB Then

                 Select Case bArr(i + 2)

                 Case 32, 33, 39, 44, 46, 58, 59, 63, 148, 160

                    If bArr(i - 2) = 32 Then

                       bArr(i) = bArr(i) - 32

                    End If

                 End Select

              ElseIf bArr(i - 2) = 32 Then

                 bArr(i) = bArr(i) - 32

              End If

           Case 33, 46, 58, 63

              For ii = i + 2 To UB Step 2

                 Select Case bArr(ii)

                 Case 97 To 122, 193 To 246, 248 To 255

                    bArr(ii) = bArr(ii) - 32

                    i = ii: Exit For

                 End Select

                 Select Case bArr(ii)

                 Case 32, 33, 46, 63, 160

                    'Do nothing

                 Case Else

                    i = ii: Exit For

                 End Select

              Next ii

           End Select

        End If

     Next

     sIn(e) = CStr(bArr)

  Next e

  SCase = sIn

End Function

Leave a Comment

(required) 
(required) 
(optional)
(required)