zVarchar example
As mentioned in a previous entry, I decided to try to build a compressed text data type as an example of a UDT. So here it is. I built a datatype called zVarchar. To keep things as a reasonable demo, I didn't incorporate complex compression, just a simple run length encoding scheme and built it in VB so most should be able to read it. But it's interesting. If you want to try it, start a new SQL Server project in VS2005, add a User Defined Type and include the code below. After building and deploying it, you can do the following:
CREATE TABLE GregTest(
RecID int IDENTITY(1,1),
TextValue zVarchar
)
I then tried storing values in it. On small varchar values, the overhead of the UDT meant the data was actually larger than a varchar. But, I tried some examples with large varchar values (around 1000 characters per row and about 40% compressible) and got good results. My table using the zVarchar data type was around 50% of the size of a table using varchar, with 10,000 rows in the table. To keep it interesting, I added some additional functions to allow access to the compressed value, the length (both compressed and uncompressed), etc.
You can then execute queries like:
SELECT TextValue.CompressedLength() FROM GregTest
SELECT AVG(TextValue.CompressionPercentage()) FROM GregTest
Enjoy!
Imports System
Imports System.Data.Sql
Imports System.Data.SqlTypes
Imports System.Text
<Serializable()> _
<SqlUserDefinedType(Format.SerializedDataWithMetadata, MaxByteSize:=8000)> _
Public Class zVarchar
Implements INullable
Private ValueIsNull As Boolean = True
Private CurrrentValue As String = ""
Public Overrides Function ToString() As String
If Me.ValueIsNull Then
Return "NULL"
Else
Return Decompress(Me.CurrrentValue)
End If
End Function
Public ReadOnly Property IsNull() As Boolean Implements INullable.IsNull
Get
Return Me.ValueIsNull
End Get
End Property
Public Function CompressedLength() As Integer
If Me.ValueIsNull Then
Return 0
Else
Return Me.CurrrentValue.Length
End If
End Function
Public Function CompressedValue() As String
If Me.ValueIsNull Then
Return "NULL"
Else
Return Me.CurrrentValue
End If
End Function
Public Function DeCompressedLength() As Integer
If Me.ValueIsNull Then
Return 0
Else
Return zVarchar.Decompress(Me.CurrrentValue).Length
End If
End Function
Public Function CompressionPercentage() As Decimal
If Me.ValueIsNull Then
Return 0
Else
Dim compressedLength As Integer
Dim decompressedLength As Integer
compressedLength = Me.CurrrentValue.Length
decompressedLength = Me.DeCompressedLength
If decompressedLength = 0 Then
Return 0
Else
Return Decimal.Round(CType(compressedLength, Decimal) * 100 / CType(decompressedLength, Decimal), 2)
End If
End If
End Function
Public Shared ReadOnly Property Null() As zVarchar
Get
Dim newNullValue As zVarchar = New zVarchar
Return newNullValue
End Get
End Property
Public Const EscapeCharacter As Char = "\"c
Public Const MaximumRepeatValue As Integer = 250
Public Shared Function Parse(ByVal s As SqlString) As zVarchar
If s.IsNull Or s.Value.ToLower().Equals("null") Then
Dim newNullValue As New zVarchar
Return newNullValue
Else
Dim zValue As New StringBuilder
Dim objectToReturn As New zVarchar
objectToReturn.ValueIsNull = False
objectToReturn.CurrrentValue = zVarchar.Compress(s.ToString())
Return objectToReturn
End If
End Function
Public Shared Function Compress(ByVal ValueToCompress As String) As String
Dim compressedValue As New StringBuilder
Dim lastValue As Char
Dim valueCount As Integer = 0
Dim currentValue As Char
Dim characterCounter As Integer
For characterCounter = 0 To (ValueToCompress.Length - 1)
currentValue = CType(ValueToCompress.Substring(characterCounter, 1), Char)
If currentValue = zVarchar.EscapeCharacter Then
If valueCount > 0 Then ' we have something to output first
If valueCount < 3 Then
compressedValue.Append(lastValue, valueCount)
Else
compressedValue.Append(zVarchar.EscapeCharacter)
compressedValue.Append(lastValue)
compressedValue.Append(Chr(valueCount))
End If
valueCount = 0
End If
compressedValue.Append(zVarchar.EscapeCharacter, 2)
compressedValue.Append(Chr(1))
Else ' if we don't have an escape character
If valueCount = 0 Then
lastValue = currentValue
valueCount = 1
Else ' could be a repeated character
If lastValue = currentValue Then
If valueCount > zVarchar.MaximumRepeatValue Then ' needs to still fit in a byte
compressedValue.Append(zVarchar.EscapeCharacter)
compressedValue.Append(lastValue)
compressedValue.Append(Chr(valueCount))
valueCount = 0
End If
valueCount += 1
Else ' different character
' output the cached value first
If valueCount < 3 Then
compressedValue.Append(lastValue, valueCount)
Else
compressedValue.Append(zVarchar.EscapeCharacter)
compressedValue.Append(lastValue)
compressedValue.Append(Chr(valueCount))
End If
lastValue = currentValue
valueCount = 1
End If
End If
End If
Next characterCounter
If valueCount > 0 Then ' we have something cached still to output
If valueCount < 3 Then
compressedValue.Append(lastValue, valueCount)
Else
compressedValue.Append(zVarchar.EscapeCharacter)
compressedValue.Append(lastValue)
compressedValue.Append(Chr(valueCount))
End If
End If
Return compressedValue.ToString()
End Function
Public Shared Function Decompress(ByVal ValueToDecompress As String) As String
Dim decompressedValue As New StringBuilder
Dim characterCounter As Integer = 0
Do While characterCounter < ValueToDecompress.Length
If CType(ValueToDecompress.Substring(characterCounter, 1), Char) = zVarchar.EscapeCharacter Then
decompressedValue.Append(CType(ValueToDecompress.Substring(characterCounter + 1, 1), Char), Asc(ValueToDecompress.Substring(characterCounter + 2, 1)))
characterCounter += 2
Else
decompressedValue.Append(ValueToDecompress.Substring(characterCounter, 1))
End If
characterCounter += 1
Loop
Return decompressedValue.ToString()
End Function
End Class