Imagine that we have a table Cars with a PK Id (nvarchar(8)) based on the following regex pattern [1-9][0-9]{1,2}[A-Z]{2,3}[0-9]{2}. The last two numbers are a region code.
The increasement is done like this: 10AA[region], 11AA[region], …, 999AA[region], 10AB[region], …, 999ZZ[region], 10AAA[region], …, 999ZZZ[region]
This table can be huge.
In a first time, we will write a LINQ To Entities query to get regions with 3 letters:
var q1 = (from c in context.Cars
where c.Id.Contains("AAA")
select c.Id.Substring(c.Id.Length - 2)).Distinct();
The SQL generated query is the following:
SELECT
[Distinct1].[C1] AS [C1]
FROM ( SELECT DISTINCT
SUBSTRING([Extent1].[ Id ], ((LEN([Extent1].[ Id ])) - 2) + 1, (LEN([Extent1].[ Id ])) - ((LEN([Extent1].[ Id ])) - 2)) AS [C1]
FROM [dbo].[Cars] AS [Extent1]
WHERE [Extent1].[ Id ] LIKE N'%AAA%'
) AS [Distinct1] If I write it myself, I would probably write this:
SELECT DISTINCT
RIGHT(Id, 2) AS [C1]
FROM [Cars]
WHERE [ Id ] LIKE N'%AAA%'
Note that the execution plan is the same!
Imagine that the DBA wants that I use the RIGHT function. With EF4, we can do it.
The Right doesn’t exist in C# but it does in ESQL. So we will use CSDL Functions.
In the CSDL, we will add the following code:
<Function Name="GetRegion" ReturnType="String">
<Parameter Name="car" Type="Self.Car" />
<DefiningExpression>
Right(car.Id, 2)
</DefiningExpression>
</Function>
Then,we will add an extension method to be able to use it in our LINQ To Entities queries:
public static class CarExtension
{
[EdmFunction("CarsModel", "GetRegion")]
public static string GetRegion(this Car car)
{
throw new NotImplementedException("Only used by LINQ To Entities");
}
}
I now can write my query like this:
var q1 = (from c in context.Cars
where c.Id.Contains("AAA")
select c.GetRegion()).Distinct();
This LINQ query is translated as the following SQL query:
SELECT
[Distinct1].[C1] AS [C1]
FROM ( SELECT DISTINCT
RIGHT([Extent1].[ Id ], 2) AS [C1]
FROM [dbo].[Cars] AS [Extent1]
WHERE [Extent1].[ Id ] LIKE N'%AAA%'
) AS [Distinct1]
Cool!
Now I want to get the last Id per region. I don’t think that there is a clean way to do it with SQL. Indeed, I don’t think that we can use Regex in SQL Server and we need it to determine which is the last id.
So to do this, I will use a LINQ To Object query. The “normal” way can be to write this:
var qL2E = from c in context.Cars
orderby c.GetRegion()
select c.Id;
var qL2O = from id in qL2E.AsEnumerable()
group id by id.Substring(id.Length - 2) into g
select new
{
Region = g.Key,
Id = (from id in g
let letters = Regex.Match(id, "[A-Z]{2,3}")
orderby letters.Length descending, letters descending, Regex.Match(id, "^[0-9]{2,3}") descending
select id).FirstOrDefault()
}.ToDictionary(id => id.Region, id => id.Id); However, with this way, we will probably have an OutOfMemoryException. Indeed, with this query, all the Cars records are loaded in memory and we can have hundreds millions of rows.
In a first time, we will try a dichotomous approach.
var ids = (from c in context.Cars
select c.GetRegion()).Distinct().AsEnumerable().ToDictionary(region => region, region => GetLastId(region, "", (new[] { "" }.Union(Enumerable.Range('A', 26).Select(i => char.ConvertFromUtf32(i)))), 0));
private static string GetLastId(string region, string letters, IEnumerable<string> possibleChars, int index)
{
if (!possibleChars.Skip(1).Any()) // Count() == 1
{
letters += possibleChars.First();
if (++index == 3)
{
using (var context = new CarsContainer())
{
return
(from c in context.Cars
where c.Id.EndsWith(letters + region)
orderby c.Id.IndexOf(letters) descending, c.Id descending
select c.Id).FirstOrDefault();
}
}
return GetLastId(region, letters, Enumerable.Range('A', 26).Select(i => char.ConvertFromUtf32(i)), index);
}
else
{
var possibleCharsList = possibleChars.ToList();
int middle = possibleCharsList.Count / 2;
using (var context = new CarsContainer())
{
string idBeginning = string.Format("10{0}{1}{2}", letters, possibleCharsList[middle], "AA".Substring(index));
if ((from c in context.Cars
where c.Id.StartsWith(idBeginning) && c.Id.EndsWith(region)
select c.Id).Any())
return GetLastId(region, letters, possibleCharsList.Skip(middle), index);
return GetLastId(region, letters, possibleCharsList.Take(middle), index);
}
}
}
The execution of this query is very slow (26 minutes and 18 seconds in my test with 562 012 347 records).
So what can we do?
The CPU is used by the DB at 100% and the slowly is only because of the DB. So it useless to imagine to parallelize our code.
We can imagine to resize the Data Base server but it is not the goal of this post.
In our case, the DBA (I specify that I am not a DBA) will probably want to change the Cars table schema in order to split into 3 columns the Id. However, in a lot of cases, it can be interesting to keep the Id column, which keeps the PK. Indeed, imagine that we have some existing applications. The idea is to add these columns with no impact on them.
So we will have four columns (all not nullable) :
To set them, we will use CLR functions:
public class CarsFunction
{
[SqlFunction]
public static SqlInt16 GetNumber(SqlString id)
{
return new SqlInt16(short.Parse(Regex.Match(id.Value, "^[0-9]{2,3}").Value));
}
[SqlFunction]
public static SqlString GetLetters(SqlString id)
{
return new SqlString(Regex.Match(id.Value, "[A-Z]{2,3}").Value);
}
[SqlFunction]
public static SqlInt16 GetRegion(SqlString id)
{
return new SqlInt16(short.Parse(Regex.Match(id.Value, "[0-9]{2}$").Value));
}
}
Then, in the DB, we will register the assembly and we will create some SQL functions from previous ones:
CREATE ASSEMBLY CarsFunctionsAssembly
FROM 'D:\documents\visual studio 2010\Projects\CarsFunctions\CarsFunctions\bin\Debug\CarsFunctions.dll'
GO
CREATE FUNCTION GetNumber(@id AS nvarchar(8)) RETURNS smallint AS EXTERNAL NAME CarsFunctionsAssembly.CarsFunction.GetNumber
GO
CREATE FUNCTION GetLetters(@id AS nvarchar(8)) RETURNS nvarchar(3) AS EXTERNAL NAME CarsFunctionsAssembly.CarsFunction.GetLetters
GO
CREATE FUNCTION GetRegion(@id AS nvarchar(8)) RETURNS smallint AS EXTERNAL NAME CarsFunctionsAssembly.CarsFunction.GetRegion
Ok, we have the functions, so we can be back on the new columns creation. Instead of creating “normal” columns, we will use computed columns. We can do it with the SSMS designer by setting the Computed Column Specification Formula to dbo.GetNumber(Id) (resp dbo.GetLetters(Id), dbo.GetRegion(Id)). In our case, the idea isn’t to calculate them each time but to keep their values. So we will set the “Is Persisted” property to true.
But there is an issue: the function has to be deterministic. Our function is. However, because it is a CLR function, SQL Server can’t know it. So we will tell it explicitly.
To do this, we will change the CLR functions code like this:
[SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static SqlInt16 GetNumber(SqlString id)
{
return new SqlInt16(short.Parse(Regex.Match(id.Value, "^[0-9]{2,3}").Value));
}
[SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static SqlString GetLetters(SqlString id)
{
return new SqlString(Regex.Match(id.Value, "[A-Z]{2,3}").Value);
}
[SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static SqlInt16 GetRegion(SqlString id)
{
return new SqlInt16(short.Parse(Regex.Match(id.Value, "[0-9]{2}$").Value));
}
Now, we can persist the three columns.
Be careful, if we create an edmx from the DB, we will have a StoreGeneratedPattern="Computed" on the columns Number, Letters and Region. It implies that in each update, EF, we reload the properties Number, Letters et Region. In our case, the calculation is only done on the key. With Entity Framework, it is not possible to change the value of the key. So, it is useful to change the StoreGeneratedPattern from Computed to Identity.
Cool, we have fixed the INSERT issue.
Let’s now concentrate on the goal: our query.
We don’t need a dichotomous approach, a simple LINQ To Entities query is enough:
var ids = (from c in context.Cars
group c by c.Region into g
select new
{
Region = g.Key,
LastId = (from c in g
orderby c.Letters.Length descending, c.Letters descending, c.Number descending
select c.Id).FirstOrDefault()
}).AsEnumerable().ToDictionary(region => region.Region, region => region.LastId);
With this, our query runs in 1 minute and 42 seconds instead of 26 minutes and 18 seconds!
It’s better but it isn’t finished yet.
In the DB, we will set the Cars table cluster on the Region column (instead of the Id column).
In this case, the query execution falls to 12 seconds!
However, be careful! It’s really great in our sample but in the real life, don’t forget to study all the DB query on the table Cars before changing the cluster.
So the DBA role is important in the success of the project.
Now, imagine that we want to keep in cache the last id per region. To do it, we can use a static Dictionary. But there are two issues:
- Imagine a N-Tiers scenario. If our application is deployed on more than one frontal server, it’s a shame to have to initialize the cache for each of them. Moreover, we have to maintain each of them with the last values
- If another application adds or removes a car in the DB, how to maintain our cache with the last values?
To fix the first issue, we will Velocity. For the second, we will use SQLDependency.
Velocity is a distributed cache which allows a lot of very interesting things like load balancing. This distributed cache will be shared by all the frontal servers. It means that we won”t have to manage one cache per server.
What do we have to do to use Velocity? First point: download it and install it.
When the installation is done, we will add the following references in our project:
- CacheBaseLibrary.dll
- CASBase.dll
- CASMain.dll
- ClientLibrary.dll
- FabricCommon.dll
Then, we have to start the Velocity cluster. In this CTP, all administration commands are written with the console “Administration Tool - Microsoft Distributed Cache”. Then we write the following command line:
start-cachecluster
Then, in the config file, we wiil add these lines:
<configSections>
<section name="dataCacheClient" type="Microsoft.Data.Caching.DataCacheClientSection, CacheBaseLibrary" allowLocation="true" allowDefinition="Everywhere"/>
</configSections>
<dataCacheClient deployment="routing">
<localCache isEnabled="true" sync="TTLBased" ttlValue="60000"/>
<hosts>
<host name="MATTHIEU-PRO" cachePort="22233" cacheHostName="DistributedCacheService"/>
<host name="MATTHIEU-LAPTOP1" cachePort="22233" cacheHostName="DistributedCacheService"/>
<host name="MATTHIEU-LAPTOP2" cachePort="22233" cacheHostName="DistributedCacheService"/>
<host name="MATTHIEU-PC1" cachePort="22233" cacheHostName="DistributedCacheService"/>
<host name="MATTHIEU-PC2" cachePort="22233" cacheHostName="DistributedCacheService"/>
</hosts>
</dataCacheClient>
Now we can imagine to query the cache in our application (server tier).
return new DataCacheFactory().GetDefaultCache().GetObjectsInRegion("LastImmatPerRegion").ToDictionary(keyValuePair => short.Parse(keyValuePair.Key), keyValuePair => (string)keyValuePair.Value);
With the data in the cache, the query executes in 30 ms!
Now, we have to fix the last point: initializing the cache and maintaining it.
To initialize it, it’s very easy. We just have to use the above query:
var cache = new DataCacheFactory().GetDefaultCache();
try
{
cache.RemoveRegion(LAST_IMMAT_PER_REGION);
}
catch
{
} cache.CreateRegion(LAST_IMMAT_PER_REGION, false);
foreach (var car in from c in context.Cars
group c by c.Region into g
select new
{
Region = g.Key,
LastId = (from c in g
orderby c.Letters.Length descending, c.Letters descending, c.Number descending
select c.Id).FirstOrDefault()
})
cache.Put(car.Region.ToString(), car.LastId, LAST_IMMAT_PER_REGION); Now to maintain the cache with the last values, it’s harder. My idea is to use a SQLDependency. The issue is the fact that we can’t, a priori, get the modifications with SQLDependency and, of course, I don’t want to refresh all my cache. So my idea is to add a Trigger on the Cars table which will fill a new table (for updates).
To begin, we will create this new table: CarsModificationsTmp with six columns:
- TmpId (int Identity(1,1), PK),
- CarId (nvarchar(8), not nullable)
- Number (smallint, not nullable)
- Letters (nvarchar(3), not nullable)
- Region (smallint, not nullable),
- Deleted (bit, not nullable, default=0)
Then, we have to define triggers on Cars table:
CREATE TRIGGER CarsInserted
ON Cars
FOR INSERT
AS
BEGIN
DECLARE @Id AS nvarchar(8)
DECLARE @Number AS smallint
DECLARE @Letters AS nvarchar(3)
DECLARE @Region AS smallint
SELECT @Id = Id, @Number = Number, @Letters = Letters, @REGION = Region FROM Inserted
INSERT INTO CarsModificationsTmp(CarId, Number, Letters, Region) VALUES(@Id, @Number, @Letters, @Region)
END
GO
CREATE TRIGGER CarsDeleted
ON Cars
FOR DELETE
AS
BEGIN
DECLARE @Region AS smallint
DECLARE @Id AS nvarchar(8)
DECLARE @Number AS smallint
DECLARE @Letters