Counting Words in a String Using Anonymous Types
Posted
Mon, Aug 17 2009 11:10
by
Deborah Kurata
This may be more of a homework assignment for a programming class than something you would do in your applications, but it is a good example of using anonymous types, which are new in .Net 3.5 in both VB and C#.
[To begin with an overview of anonymous types, start here.]
NOTE: Be sure to set a reference to System.Text.RegularExpressions.
In C#:
// Sample string
string sampleText = @"That that is, is.
That that is not, it not.
Is that it? It is.";
// Convert to lower case and convert double-spaces to a single space
sampleText = sampleText.ToLower();
sampleText = Regex.Replace(sampleText, @"\s+", " ");
string[] separators = new string[4] {" ", ".", ",", "?"};
string[] wordArray = sampleText.Split(separators,
StringSplitOptions.RemoveEmptyEntries);
// Sort the result
Array.Sort(wordArray);
// Using an anonymous type
var query = from string w in wordArray.Distinct()
select new {Word = w,
Count = wordArray.Count(wordToCount => wordToCount == w)};
foreach (var item in query)
Debug.WriteLine(item.Word + ": " + item.Count);
In VB:
NOTE: For the VB code, be sure to also set a reference to System.Xml, System.Xml.Linq, and System.Core.
' Sample string
Dim sampleText As String = <string>That that is, is.
That that is not, it not.
Is that it? It is.</string>.Value
' Convert to lower case and convert double-spaces to a single space
sampleText = sampleText.ToLower
sampleText = Regex.Replace(sampleText, "\s+", " ")
Dim separators() As String = {" ", ".", ",", "?"}
Dim wordArray() As String = sampleText.Split(separators, _
StringSplitOptions.RemoveEmptyEntries)
' Sort the result
Array.Sort(wordArray)
' Using an anonymous type
Dim query = From w As String In wordArray.Distinct _
Select New With {.Word = w, _
.Count = wordArray.Count(Function(wordToCount) wordToCount = w)}
For Each item In query
Debug.WriteLine(item.Word & ": " & item.Count)
Next
This code first builds a sample string. (Anyone recognize what movie this string came from?)
The C# code uses a verbatim string literal (@) to ensure that the string is interpreted verbatim. In VB, the code uses the XML literals feature new in .Net 3.5 to build a sample string.
The code converts the string to lower case so that the word count counts “The” and “the” as the same word. It then removes excess spaces, linefeeds, and other white-space characters.
It uses the string Split method to convert the string to an array of words and then sorts the words. If your string includes other punctuation marks, you will need to add them to the separators array.
The code uses LINQ to find the unique set of words and their counts. The District method is used to process only unique words from the list of words. This prevents duplicate words in the list.
The select new syntax defines an anonymous type to build a type comprised of the word itself and its count. You can define any desired properties of an anonymous type by adding them within the { }, separated by commas. In this example, two properties are defined: Word and Count. The Word property is the unique word. The Count property is the count of those words within the list. The Count property uses a Lambda expression to count the words.
Each item of the anonymous type is then displayed to the Debug window as follows:
is: 5
it: 3
not: 2
that: 5
This lists the word and the number of times it occurs in the string.
Enjoy!
P.S. (Edited 8/19/09) Though it does not demonstrate anonymous types, Eric Smith provided a *very* concise technique for counting words in a string using regular expressions and lambda expressions (see Comments below). I updated the code slightly to include the OrderBy and I provided the VB version of the code:
In C#:
foreach (var g in Regex.Matches(sampleText.ToLower(), @"\w+")
.Cast<Match>()
.GroupBy(m => m.Value)
.OrderBy(m => m.Key))
Debug.WriteLine(g.Key + ": " + g.Count());
In VB:
For Each g In Regex.Matches(sampleText.ToLower(), "\w+") _
.Cast(Of Match)() _
.GroupBy(Function(m) m.Value) _
.OrderBy(Function(m) m.Key)
Debug.WriteLine(g.Key & ": " & g.Count())
Next