March 2009 - Posts

Memory Leak with delegates and workflow foundation

Recently after Load Testing my open source project Dropthings, I encountered a lot of memory leak. I found lots of Workflow Instances and Linq Entities were left in memory and never collected. After profiling the web application using .NET Memory Profiler, it showed the real picture:

image

It shows you that instances of the several types are being created but not being removed. You see the “New” column has positive value, but the “Remove” column has 0. That means new instances are being created, but not removed. Basically the way you do Memory Profiling is, you take two snapshots. Say you take one snapshot when you first visit your website. Then you do some action on the website that results in allocation of objects. Then you take another snapshot. When you compare both snapshots, you can see how many instances of classes were created between these two snapshots and how many were removed. If they are not equal, then you have leak. Generally in web application many objects are created on every page hit and the end of the request, all those objects are supposed to be released. If they are not released, then we have a problem. But that’s the scenario for desktop applications because in a desktop application, objects can remain in memory until app is closed. But you should know best from the code which objects were supposed to go out of scope and get released.

For beginners, leak means objects are being allocated but not being freed because someone is holding reference to the objects. When objects leak, they remain in memory forever, until the process (or app domain) is closed. So, if you have a leaky website, your website is continuously taking up memory until it runs out of memory on the web server and thus crash. So, memory leak is a bad – it prevents you from running your product for long duration and requires frequent restart of app pool.

So, the above screenshot shows Workflow and Linq related classes are not being removed, and thus leaking. This means somewhere workflow instances are not being released and thus all workflow related objects are remaining. You can see the number is same 48 for all workflow related objects. This is a good indication that, almost every instance of workflow is leaked because there were total 48 workflows created and ran. Moreover it indicates we have a leak from a top Workflow instance level, not in some specific Activity or somewhere deep in the code.

As the workflows use Linq stuff, they held reference to the Linq stuffs and thus the Linq stuffs leaked as well. Sometimes you might be looking for why A is leaking. But you actually end up finding that since B was holding reference to A and B was leaking and thus A was leaking as well. This is sometimes tricky to figure out and you spend a lot of time looking at the wrong direction.

Now let me show you the buggy code:

ManualWorkflowSchedulerService manualScheduler = 
workflowRuntime.GetService<ManualWorkflowSchedulerService>(); WorkflowInstance instance = workflowRuntime.CreateWorkflow(workflowType, properties); instance.Start(); EventHandler<WorkflowCompletedEventArgs> completedHandler = null; completedHandler = delegate(object o, WorkflowCompletedEventArgs e) { if (e.WorkflowInstance.InstanceId == instance.InstanceId) // 1. instance { workflowRuntime.WorkflowCompleted -= completedHandler; // 2. terminatedhandler // copy the output parameters in the specified properties dictionary Dictionary<string,object>.Enumerator enumerator =
e.OutputParameters.GetEnumerator(); while( enumerator.MoveNext() ) { KeyValuePair<string,object> pair = enumerator.Current; if( properties.ContainsKey(pair.Key) ) { properties[pair.Key] = pair.Value; } } } }; Exception x = null; EventHandler<WorkflowTerminatedEventArgs> terminatedHandler = null; terminatedHandler = delegate(object o, WorkflowTerminatedEventArgs e) { if (e.WorkflowInstance.InstanceId == instance.InstanceId) // 3. instance { workflowRuntime.WorkflowTerminated -= terminatedHandler; // 4. completeHandler Debug.WriteLine( e.Exception ); x = e.Exception; } }; workflowRuntime.WorkflowCompleted += completedHandler; workflowRuntime.WorkflowTerminated += terminatedHandler; manualScheduler.RunWorkflow(instance.InstanceId);

Can you spot the code where it leaked?

I have numbered the lines in comment where the leak is happening. Here the delegate is acting like a closure and those who are from Javascript background know closure is evil. They leak memory unless very carefully written. Here the delegate keeps a reference to the instance object. So, if somehow delegate is not released, the instance will remain in memory forever and thus leak. Now can you find a situation when the delegate will not be released?

Say the workflow completed. It will fire the completeHandler. But the completeHandler will not release the terminateHandler. Thus the terminateHandler remains in memory and it also holds reference to the instance. So, we have a leaky delegate leaking whatever it is holding onto outside it’s scope. Here the only thing outside the scope if the instance, which it is tried to access from the parent function.

Since the workflow instance is not released, all the properties the workflow and all the activities inside it are holding onto remains in memory. Most of the workflows and activities expose public properties which are Linq Entities. Thus the Linq Entities remain in memory. Now Linq Entities keep a reference to the DataContext from where it is produced. Thus we have DataContext remaining in memory. Moreover, DataContext keeps reference to many internal objects and metadata cacahe, so they remain in memory as well.

So, the correct code is:

ManualWorkflowSchedulerService manualScheduler = 
workflowRuntime.GetService<ManualWorkflowSchedulerService>(); WorkflowInstance instance = workflowRuntime.CreateWorkflow(workflowType, properties); instance.Start(); var instanceId = instance.InstanceId; EventHandler<WorkflowCompletedEventArgs> completedHandler = null; completedHandler = delegate(object o, WorkflowCompletedEventArgs e) { if (e.WorkflowInstance.InstanceId == instanceId) // 1. instanceId is a Guid { // copy the output parameters in the specified properties dictionary Dictionary<string,object>.Enumerator enumerator =
e.OutputParameters.GetEnumerator(); while( enumerator.MoveNext() ) { KeyValuePair<string,object> pair = enumerator.Current; if( properties.ContainsKey(pair.Key) ) { properties[pair.Key] = pair.Value; } } } }; Exception x = null; EventHandler<WorkflowTerminatedEventArgs> terminatedHandler = null; terminatedHandler = delegate(object o, WorkflowTerminatedEventArgs e) { if (e.WorkflowInstance.InstanceId == instanceId) // 2. instanceId is a Guid { x = e.Exception; Debug.WriteLine(e.Exception); } }; workflowRuntime.WorkflowCompleted += completedHandler; workflowRuntime.WorkflowTerminated += terminatedHandler; manualScheduler.RunWorkflow(instance.InstanceId); // 3. Both delegates are now released
workflowRuntime.WorkflowTerminated -= terminatedHandler; workflowRuntime.WorkflowCompleted -= completedHandler;

There are two changes – in both delegates, the instanceId variable is passed, instead of the instance. Since instanceId is a Guid, which is a struct type data type, not a class, there’s no issue of referencing. Structs are copied, not referenced. So, they don’t leak memory. Secondly, both delegates are released at the end of the workflow execution, thus releasing both references.

In Dropthings, I am using the famous CallWorkflow Activity by John Flanders, which is widely used to execute one Workflow from another synchronously. There’s a CallWorkflowService class which is responsible for synchronously executing another workflow and that has similar memory leak problem. The original code of the service is as following:

public class CallWorkflowService : WorkflowRuntimeService
{
    #region Methods

    public void StartWorkflow(Type workflowType,Dictionary<string,object> inparms, 
Guid caller,IComparable qn) { WorkflowRuntime wr = this.Runtime; WorkflowInstance wi = wr.CreateWorkflow(workflowType,inparms); wi.Start(); ManualWorkflowSchedulerService ss =
wr.GetService<ManualWorkflowSchedulerService>(); if (ss != null) ss.RunWorkflow(wi.InstanceId); EventHandler<WorkflowCompletedEventArgs> d = null; d = delegate(object o, WorkflowCompletedEventArgs e) { if (e.WorkflowInstance.InstanceId ==wi.InstanceId) { wr.WorkflowCompleted -= d; WorkflowInstance c = wr.GetWorkflow(caller); c.EnqueueItem(qn, e.OutputParameters, null, null); } }; EventHandler<WorkflowTerminatedEventArgs> te = null; te = delegate(object o, WorkflowTerminatedEventArgs e) { if (e.WorkflowInstance.InstanceId == wi.InstanceId) { wr.WorkflowTerminated -= te; WorkflowInstance c = wr.GetWorkflow(caller); c.EnqueueItem(qn, new Exception("Called Workflow Terminated",
e.Exception), null, null); } }; wr.WorkflowCompleted += d; wr.WorkflowTerminated += te; } #endregion Methods }

As you see, it has that same delegate holding reference to instance object problem. Moreover, there’s some queue stuff there, which requires the caller and qn parameter passed to the StartWorkflow function. So, not a straight forward fix.

I tried to rewrite the whole CallWorkflowService so that it does not require two delegates to be created per Workflow. Then I took the delegates out. Thus there’s no chance of closure holding reference to unwanted objects. The result looks like this:

public class CallWorkflowService : WorkflowRuntimeService
{
    #region Fields

    private EventHandler<WorkflowCompletedEventArgs> _CompletedHandler = null;
    private EventHandler<WorkflowTerminatedEventArgs> _TerminatedHandler = null;
    private Dictionary<Guid, WorkflowInfo> _WorkflowQueue = 
new Dictionary<Guid, WorkflowInfo>(); #endregion Fields #region Methods public void StartWorkflow(Type workflowType,Dictionary<string,object> inparms,
Guid caller,IComparable qn) { WorkflowRuntime wr = this.Runtime; WorkflowInstance wi = wr.CreateWorkflow(workflowType,inparms); wi.Start(); var instanceId = wi.InstanceId; _WorkflowQueue[instanceId] = new WorkflowInfo { Caller = caller, qn = qn }; ManualWorkflowSchedulerService ss =
wr.GetService<ManualWorkflowSchedulerService>(); if (ss != null) ss.RunWorkflow(wi.InstanceId); } protected override void OnStarted() { base.OnStarted(); if (null == _CompletedHandler) { _CompletedHandler = delegate(object o, WorkflowCompletedEventArgs e) { var instanceId = e.WorkflowInstance.InstanceId; if (_WorkflowQueue.ContainsKey(instanceId)) { WorkflowInfo wf = _WorkflowQueue[instanceId]; WorkflowInstance c = this.Runtime.GetWorkflow(wf.Caller); c.EnqueueItem(wf.qn, e.OutputParameters, null, null); _WorkflowQueue.Remove(instanceId); } }; this.Runtime.WorkflowCompleted += _CompletedHandler; } if (null == _TerminatedHandler) { _TerminatedHandler = delegate(object o, WorkflowTerminatedEventArgs e) { var instanceId = e.WorkflowInstance.InstanceId; if (_WorkflowQueue.ContainsKey(instanceId)) { WorkflowInfo wf = _WorkflowQueue[instanceId]; WorkflowInstance c = this.Runtime.GetWorkflow(wf.Caller); c.EnqueueItem(wf.qn,
new Exception("Called Workflow Terminated", e.Exception),
null, null); _WorkflowQueue.Remove(instanceId); } }; this.Runtime.WorkflowTerminated += _TerminatedHandler; } } protected override void OnStopped() { _WorkflowQueue.Clear(); base.OnStopped(); } #endregion Methods #region Nested Types private struct WorkflowInfo { #region Fields public Guid Caller; public IComparable qn; #endregion Fields } #endregion Nested Types }

After fixing the problem, another Memory Profile result showed the leak is gone:

image

As you see, the numbers vary, which means there’s no consistent leak. Moreover, looking at the types that remains in memory, they look more like metadata than instances of classes.  So, they are basically cached instances of metadata, not instances allocated during workflow execution which are supposed to be freed. So, we solved the memory leak!

Now you know how to write anonymous delegates without leaking memory and how to run workflow without leaking them. Basically, the principle theory is – if you are referencing some outside object from an anonymous delegate, make sure that object is not holding reference to the delegate in some way, may be directly or may be via some child objects of its own. Because then you have a circular reference. If possible, do not try to access objects e.g. instance inside an anonymous delegate that is declared outside the delegate. Try accessing instrinsic data types like int, string, DateTime, Guid etc which are not reference type variables. So, instead of referencing to an object, you should declare local variables e.g. instanceId that gets the value of properties (e.g. instance.InstanceId) from the object and then use those local variables inside the anonymous delegate.

DotNetKicks Image
Posted by omar with 1 comment(s)
Filed under: , , ,

Optimize ASP.NET Membership Stored Procedures for greater speed and scalability

Last year at Pageflakes, when we were getting millions of hits per day, we were having query timeout due to lock timeout and Transaction Deadlock errors. These locks were produced from aspnet_Users and aspnet_Membership tables. Since both of these tables are very high read (almost every request causes a read on these tables) and high write (every anonymous visit creates a row on aspnet_Users), there were just way too many locks created on these tables per second. SQL Counters showed thousands of locks per second being created. Moreover, we had queries that would select thousands of rows from these tables frequently and thus produced more locks for longer period, forcing other queries to timeout and thus throw errors on the website.

If you have read my last blog post, you know why such locks happen. Basically every table when it grows up to hold millions of records and becomes popular goes through this trouble. It’s just a part of scalability problem that is common to database. But we rarely take prevention about it in our early design.

The solution is simple, you should either have WITH (NOLOCK) or SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED before SELECT queries. Either of this will do. They tell SQL Server not to hold any lock on the table while it is reading the table. If some row is locked while the read is happening, it will just ignore that row. When you are reading a table thousand times per second, without these options, you are issuing lock on many places around the table thousand times per second. It not only makes read from table slower, but also so many lock prevents insert, update, delete from happening timely and thus queries timeout. If you have queries like “show the currently online users from last one hour based on LastActivityDate field”, that is going to issue such a wide lock that even other harmless select queries will timeout. And did I tell you that there’s no index on LastActivityDate on aspnet_Users table?

Now don’t blame yourself for not putting either of these options on your every stored proc and every dynamically generated SQL from the very first day. ASP.NET developers made the same mistake. You won’t see either of these used in any of the stored procs used by ASP.NET Membership. For example, the following stored proc gets called whenever you access Profile object:

ALTER PROCEDURE [dbo].[aspnet_Profile_GetProperties]
@ApplicationName
nvarchar(256),
@UserName nvarchar(256),
@CurrentTimeUtc datetime
AS
BEGIN

DECLARE
@ApplicationId uniqueidentifier
SELECT
@ApplicationId = NULL
SELECT
@ApplicationId = ApplicationId FROM
dbo.aspnet_Applications WHERE LOWER(@ApplicationName) = LoweredApplicationName
IF (@ApplicationId IS NULL)
RETURN

DECLARE
@UserId uniqueidentifier
DECLARE
@LastActivityDate datetime
SELECT
@UserId = NULL

SELECT
@UserId = UserId, @LastActivityDate = LastActivityDate
FROM dbo.aspnet_Users
WHERE ApplicationId = @ApplicationId AND LoweredUserName = LOWER(@UserName)

IF (@UserId IS NULL)
RETURN
SELECT TOP
1 PropertyNames, PropertyValuesString, PropertyValuesBinary
FROM dbo.aspnet_Profile
WHERE UserId = @UserId

IF (@@ROWCOUNT > 0)
BEGIN
UPDATE
dbo.aspnet_Users
SET LastActivityDate=@CurrentTimeUtc
WHERE UserId = @UserId
END
END

There are two SELECT operations that hold lock on two very high read tables – aspnet_Users and aspnet_Profile. Moreover, there’s a nasty UPDATE statement. It tries to update the LastActivityDate of a user whenever you access Profile object for the first time within a http request.

This stored proc alone is enough to bring your site down. It did to us because we are using Profile Provider everywhere. This stored proc was called around 300 times/sec. We were having nightmarish slow performance on the website and many lock timeouts and transaction deadlocks. So, we added the transaction isolation level and we also modified the UPDATE statement to only perform an update when the LastActivityDate is over an hour. So, this means, the same user’s LastActivityDate won’t be updated if the user hits the site within the same hour.

So, after the modifications, the stored proc looked like this:

ALTER PROCEDURE [dbo].[aspnet_Profile_GetProperties]
@ApplicationName
nvarchar(256),
@UserName nvarchar(256),
@CurrentTimeUtc datetime
AS
BEGIN
-- 1. Please no more locks during reads
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;

DECLARE @ApplicationId uniqueidentifier
--SELECT @ApplicationId = NULL
--SELECT @ApplicationId = ApplicationId FROM dbo.aspnet_Applications
WHERE LOWER(@ApplicationName) = LoweredApplicationName
--IF (@ApplicationId IS NULL)
-- RETURN

-- 2. No more call to Application table. We have only one app dude!
SET @ApplicationId = dbo.udfGetAppId()

DECLARE @UserId uniqueidentifier
DECLARE
@LastActivityDate datetime
SELECT
@UserId = NULL

SELECT
@UserId = UserId, @LastActivityDate = LastActivityDate
FROM dbo.aspnet_Users
WHERE ApplicationId = @ApplicationId AND LoweredUserName = LOWER(@UserName)

IF (@UserId IS NULL)
RETURN
SELECT TOP
1 PropertyNames, PropertyValuesString, PropertyValuesBinary
FROM dbo.aspnet_Profile
WHERE UserId = @UserId

IF (@@ROWCOUNT > 0)
BEGIN
-- 3. Do not update the same user within an hour
IF DateDiff(n, @LastActivityDate, @CurrentTimeUtc) > 60
BEGIN
-- 4. Use ROWLOCK to lock only a row since we know this query
-- is highly selective
UPDATE dbo.aspnet_Users WITH(ROWLOCK)
SET LastActivityDate=@CurrentTimeUtc
WHERE UserId = @UserId
END
END
END

The changes I made are numbered and commented. No need for further explanation. The only tricky thing here is, I have eliminate call to Application table just to get the ApplicationID from ApplicationName. Since there’s only one application in a database (ever heard of multiple applications storing their user separately on the same database and the same table?), we don’t need to look up the ApplicationID on every call to every Membership stored proc. We can just get the ID and hard code it in a function.

CREATE FUNCTION dbo.udfGetAppId()
RETURNS uniqueidentifier
WITH EXECUTE AS
CALLER
AS
BEGIN
RETURN CONVERT
(uniqueidentifier, 'fd639154-299a-4a9d-b273-69dc28eb6388')
END;

This UDF returns the ApplicationID that I have hardcoded copying from the Application table. Thus it eliminates the need for quering on the Application table.

Similarly you should do the changes in all other stored procedures that belong to Membership Provider. All the stroc procs are missing proper locking, issues aggressive lock during update and too frequent updates than practical need. Most of them also try to resolve ApplicationID from ApplicationName, which is unnecessary when you have only one web application per database. Make these changes and enjoy lock contention free super performance from Membership Provider!

kick it on DotNetKicks.com

Posted by omar with 8 comment(s)

Linq to SQL solve Transaction deadlock and Query timeout problem using uncommitted reads

When your database tables start accumulating thousands of rows and many users start working on the same table concurrently, SELECT queries on the tables start producing lock contentions and transaction deadlocks. This is a common problem in any high volume website. As soon as you start getting several concurrent users hitting your website that results in SELECT queries on some large table like aspnet_users table that are also being updated very frequently, you end up having one of these errors:

Transaction (Process ID ##) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.

Or,

Timeout Expired. The Timeout Period Elapsed Prior To Completion Of The Operation Or The Server Is Not Responding.

The solution to these problems are – use proper index on the table and use transaction isolation level Read Uncommitted or WITH (NOLOCK) in your SELECT queries. So, if you had a query like this:

SELECT * FORM aspnet_users 
where ApplicationID =’xxx’ AND LoweredUserName = 'someuser'

You should end up having any of the above errors under high load. There are two ways to solve this:

SET TRANSACTION LEVEL READ UNCOMMITTED;
SELECT * FROM aspnet_Users 
WHERE ApplicationID =’xxx’ AND LoweredUserName = 'someuser'

Or use the WITH (NOLOCK):

SELECT * FROM aspnet_Users WITH (NOLOCK) 
WHERE ApplicationID =’xxx’ AND LoweredUserName = 'someuser'

The reason for the errors are that since aspnet_users is a high read and high write table, during read, the table is partially locked and during write, it is also locked. So, when the locks overlap on each other from several queries and especially when there’s a query that’s trying to read a large number of rows and thus locking large number of rows, some of the queries either timeout or produce deadlocks.

Linq to Sql does not produce queries with the WITH (NOLOCK) option nor does it use READ UNCOMMITTED. So, if you are using Linq to SQL queries, you are going to end up with any of these problems on production pretty soon when your site becomes highly popular.

For example, here’s a very simple query:

using (var db = new DropthingsDataContext())
{
    var user = db.aspnet_Users.First();
    var pages = user.Pages.ToList();
}

DropthingsDataContext is a DataContext built from Dropthings database.

When you attach SQL Profiler, you get this:

image

You see none of the queries have READ UNCOMMITTED or WITH (NOLOCK).

The fix is to do this:

using (var db = new DropthingsDataContext2())
{
    db.Connection.Open();
    db.ExecuteCommand("SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;");

    var user = db.aspnet_Users.First();
    var pages = user.Pages.ToList();
}

This will result in the following profiler output

image

As you see, both queries execute within the same connection and the isolation level is set before the queries execute. So, both queries enjoy the isolation level.

Now there’s a catch, the connection does not close. This seems to be a bug in the DataContext that when it is disposed, it does not dispose the connection it is holding onto.

In order to solve this, I have made a child class of the DropthingsDataContext named DropthingsDataContext2 which overrides the Dispose method and closes the connection.

   class DropthingsDataContext2 : DropthingsDataContext, IDisposable
    {
        public new void Dispose()
        {
            if (base.Connection != null)
                if (base.Connection.State != System.Data.ConnectionState.Closed)
                {
                    base.Connection.Close();
                    base.Connection.Dispose();
                }

            base.Dispose();            
        }
    }

This solved the connection problem.

There you have it, no more transaction deadlock or lock contention from Linq to SQL queries. But remember, this is only to eliminate such problems when your database already has the right indexes. If you do not have the proper index, then you will end up having lock contention and query timeouts anyway.

There’s one more catch, READ UNCOMMITTED will return rows from transactions that have not completed yet. So, you might be reading rows from transactions that will rollback. Since that’s generally an exceptional scenario, you are more or less safe with uncommitted read, but not for financial applications where transaction rollback is a common scenario. In such case, go for committed read or repeatable read.

There’s another way you can achieve the same, which seems to work, that is using .NET Transactions. Here’s the code snippet:

using (var transaction = new TransactionScope(
    TransactionScopeOption.RequiresNew,
    new TransactionOptions()
    {
        IsolationLevel = IsolationLevel.ReadUncommitted,
        Timeout = TimeSpan.FromSeconds(30)
    }))
{
    using (var db = new DropthingsDataContext())
    {
        var user = db.aspnet_Users.First();
        var pages = user.Pages.ToList();

        transaction.Complete();
    }
}

Profiler shows a transaction begins and ends:

image

The downside is it wraps your calls in a transaction. So, you are unnecessarily creating transactions even for SELECT operations. When you do this hundred times per second on a web application, it’s a significant over head.

Some really good examples of deadlocks are given in this article:

http://www.code-magazine.com/article.aspx?quickid=0309101&page=2

I highly recommend it.

kick it on DotNetKicks.com
Posted by omar with 11 comment(s)