Chapter 15. Processing with worker roles

published book

This chapter covers

Scaling the backend
Processing messages
Using the service management APIs to control your application

In Azure there are two roles that run your code. The first, the web role, has already been discussed. It plays the role of the web server, communicating with the outside world. The second role is the worker role. Worker roles act as backend servers—you might use one to run asynchronous or long-running tasks. Worker roles are usually message based and will usually receive these messages by polling a queue or some other shared storage. Like web roles, you can have multiple deployments of code running in different worker roles. Each deployment can have as many instances as you would like running your code (within your Azure subscription limits).

It’s important to remember that a worker role is a template for a server in your application. The service model for your application defines how many instances of that role need to be started in the cloud. The role definition is similar to a class definition, and the instances are like objects.

If your system has Windows services or batch jobs, they can easily be ported to a worker role. For example, many systems have a series of regularly scheduled backend tasks. These might process all the new orders each night at 11 p.m. Or perhaps you have a positive pay system for banking, and it needs to connect to your bank each day before 3 p.m., except for banking holidays.

The worker role is intended to be started up and left running to process messages. You’ll likely want to dynamically increase and decrease the number of instances of your worker role to meet the demand on your system, as it increases and decreases throughout your business cycle.

When you create worker roles, you’ll want to keep in mind that Windows Azure doesn’t have a job scheduler facility, so you might need to build your own. You could easily build a scheduling worker role that polls a table representing the schedule of work to do. As jobs need to be executed, it could create the appropriate worker instance, pass it a job’s instructions, and then shut down the instance when the work is completed. You could easily do this with the service management APIs, which are discussed in chapter 18.

We’re going to start off by building a simple service using a worker role. Once we have done that we’ll change it several times, to show you the options you have in communicating with your worker role instances.

15.1. A simple worker role service

When it’s all said and done, working with worker roles is quite easy. The core of the code for the worker role is the normal business code that gets the work done. There isn’t anything special about this part of a worker role. It’s the wrapper or handler around the business code that’s interesting. There are also some key concepts you’ll want to pay attention to, in order to build a reliable and manageable worker role.

In this section, we’ll show you how to build a basic worker role service. You have to have some way to communicate with the worker role, so we’ll first send messages to the worker through a queue, showing you how to poll a queue. (We won’t go too deep into queues, because they’re covered thoroughly in chapter 16.) We’ll then upgrade the service so you can use inter-role communication to send messages to your service.

We’ll use the term service fairly loosely when we’re talking about worker roles. We see worker roles as providing a service to the rest of the application, hopefully in a decoupled way. We don’t necessarily mean to imply the use of WS-* and Web Service protocols, although that’s one way to communicate with the role.

Let’s roll up our sleeves and come up with a service that does something a little more than return a string saying “Hello World.” In the next few sections, we’ll build a new service from scratch.

15.1.1. No more Hello World

Because Hello World doesn’t really cut it as an example this late in any book, we’re going to build a service that reverses strings. This is an important service in any business application, and the string-reversal industry is highly competitive.

There will be two parts to this service. The first part will be the code that actually does the work of reversing strings—although it’s composed of some unique intellectual property, it isn’t very important in our example. This is the business end of the service. The other part of the service gets the messages (requests for work) into the service. This second part can take many shapes, and which design you use comes down to the architectural model you want to support. Some workers never receive messages; they just constantly poll a database, or filesystem, and process what they find.

To build this string-reversal service you need to open up Visual Studio 2010 and start a new cloud project. For this project, add one worker role, and give it the name Worker-Process String, as shown in figure 15.1.

Figure 15.1. To build the service, you’ll start with a worker role. It’ll do all of the work and make it easy to scale as your business grows, especially during the string-reversal peak season.

At the business end will be our proprietary and award-winning algorithm for reversing strings. We intend to take the string-reversal industry by storm and really upset some industry captains. The core method will be called ReverseString, and it will take a string as its only parameter. You can find the secret sauce in the following listing. Careful, don’t post it on a blog or anything.

Listing 15.1. The magical string-reversal method

private string ReverseString(string originalString) {    int lengthOfString = originalString.Length;    char[]reversedBuffer = new char[lengthOfString];    for (int i = 0; i < lengthOfString; i++)    {       reversedBuffer [i] = originalString[lengthOfString - 1 - i];    }    return new string(reversedBuffer); }

The code in the previous listing is fairly simple—it’s normal .NET code that you could write on any platform that supports .NET (mobile, desktop, on-premises servers, and so on), not just for the cloud. The method declares a character array to be a buffer that’s the same length as the original string (because our R&D department has discovered that every reversed string is exactly as long as the original string). It then loops over the string, taking characters off the end of the original string and putting them at the front of the buffer, moving along the string and the buffer in opposite directions. Finally, the string in the buffer is returned to the caller.

For this example, we’ll put this business logic right in the WorkerRole.cs class. Normally this code would be contained in its own class, and would be referenced into the application. You can do that later if you want, but we want to keep the example simple so you can focus on what’s important.

We’ve chosen to put this service in a worker in the cloud so that we can dynamically scale how many servers we have running the service, based on usage and demand. We don’t want to distract our fledgling company from writing the next generation of string-reversal software with the details and costs of running servers.

If you ran this project right now, you wouldn’t see anything happen. The cloud simulator on your desktop would start up, and the worker role would be instantiated, but nothing would get done. By default, the worker service comes with an infinite polling loop in the Run method. This Run method is what is called once the role instance is initialized and is ready to run. We like that they called it Run, but calling it DoIt would have been funnier.

Now that you have your code in the worker role, how do you access it and use it? The next section will focus on the two primary ways you can send messages to a worker role instance in an active way.

15.2. Communicating with a worker role

Worker roles can receive the messages they need to process in either a push or a pull way. Pushing a message to the worker instance is an active approach, where you’re directly giving it work to do. The alternative is to have the role instances call out to some shared source to gather work to do, in a sense pulling in the messages they need. When pulling messages in, remember that there will possibly be several instances pulling in work. You’ll need a mechanism similar to what the Azure Queue service provides to avoid conflicts between the different worker role instances that are trying to process the same work.

Keep in mind the difference between roles and role instances, which we covered earlier. Although it’s sometimes convenient to think of workers as a single entity, they don’t run as a role when they’re running, but as one or more instances of that role. When you’re designing and developing your worker roles, keep this duality in mind. Think of the role as a unit of deployment and management, and the role instance as the unit of work assignment. This will help reduce the number of problems in your architecture.

One advantage that worker roles have over web roles is that they can have as many service endpoints as they like, using almost any transport protocol and port. Web roles are limited to HTTP/S and can have two endpoints at most. We’ll use the worker role’s flexibility to provide several ways to send it messages.

We’ll cover three approaches to sending messages to a worker role instance:

A pull model, where each worker role instance polls a queue for work to be completed
A push model, where a producer outside Azure sends messages to the worker role instance
A push model, where a producer inside the Azure application sends messages to the worker role instance

Let’s look first at the pull model.

15.2.1. Consuming messages from a queue

The most common way for a worker role to receive messages is through a queue. This will be covered in depth in chapter 16 (which is on messaging with the queue), but we’ll cover it briefly here.

The general model is to have a while loop that never quits. This approach is so common that the standard worker role template in Visual Studio provides one for you. The role instance tries to get a new message from the queue it’s polling on each iteration of the loop. If it gets a message, it’ll process the message. If it doesn’t, it’ll wait a period of time (perhaps 5 seconds) and then poll the queue again.

The core of the loop calls the business code. Once the loop has a message, it passes the message off to the code that does the work. Once that work is done, the message is deleted from the queue, and the loop polls the queue again.

while (true)
{
      CloudQueueMessage msg = queue.GetMessage();
      if (msg != null)
      {
          DoWorkHere(msg);
          queue.DeleteMessage(msg);
      }
      else
      {
           Thread.Sleep(5000);
      }
}

You might jump to the conclusion that you could easily poll an Azure Table for work instead of polling a queue. Perhaps you have a property in your table called Status that defaults to new. The worker role could poll the table, looking for all entities whose Status property equals new. Once a list is returned, the worker could process each entity and set their Status to complete. At its base, this sounds like a simple approach.

Unfortunately, this approach is a red herring. It suffers from some severe drawbacks that you might not find until you’re in testing or production because they won’t show up until you have multiple instances of your role running.

The first problem is of concurrency. If you have multiple instances of your worker role polling a table, they could each retrieve the same entities in their queries. This would result in those entities being processed multiple times, possibly leading to status updates getting entangled. This is the exact concurrency problem the Azure Queue service was designed to avoid.

The other, more important, issue is one of recoverability and durability. You want your system to be able to recover if there’s a problem processing a particular entity. Perhaps you have each worker role set the status property to the name of the instance to track that the entity is being worked on by a particular instance. When the work is completed, the instance would then set the status property to done. On the surface, this approach seems to make sense. The flaw is that when an instance fails during processing (which will happen), the entity will never be recovered and processed. It’ll remain flagged with the instance name of the worker processing the item, so it’ll never be cleared and will never be picked up in the query of the table to be processed. It will, in effect, be “hung.” The system administrator would have to go in and manually reset the status property back to new. There isn’t a way for the entity to be recovered from a failure and be reassigned to another instance.

It would take a fair amount of code to overcome the issues of polling a table by multiple consumers, and in the end you’d end up having built the same thing as the Azure Queue service. The Queue service is designed to play this role, and it removes the need to write all of this dirty plumbing code. The Queue service provides a way for work to be distributed among multiple worker instances, and to easily recover that work if the instance fails. A key concept of cloud architecture is to design for failure recoverability in an application. It’s to be expected that nodes go down (for one reason or another) and will be restarted and recovered, possibly on a completely different server.

Queues are the easiest way to get messages into a worker role, and they’ll be discussed in detail in the next chapter. Now, though, we’ll discuss inter-role communication, which lets a worker role receive a message from outside of Azure.

15.2.2. Exposing a service to the outside world

Web roles are built to receive traffic from outside of Azure. Their whole point in life is to receive messages from the internet (usually from a browser) and respond with some message (usually HTML). The great thing is that when you have multiple web role instances, they’re automatically enrolled in a load balancer. This load balancer automatically distributes the load across the different instances you have running.

Worker roles can do much the same thing, but because you aren’t running in IIS (which isn’t available on a worker role), you have to host the service yourself. The only real option is to build the service as a WCF service.

Our goal is to convert our little string-reversal method into a WCF service, and then expose that externally so that customers can call the service. The first step is to remove the loop that polls the queue and put in some service plumbing. When you host a service in a worker role, regardless of whether it is for external or internal use, you need to declare an endpoint. How you configure this endpoint will determine whether it allows traffic from sources internal or external to the application. The two types of endpoints are shown in figure 15.2. If it’s configured to run externally, it will use the Azure load balancers and distribute service calls across all of the role instances running the server, much like how the web role does this. We’ll look at internal service endpoints in the next section.

Figure 15.2. Worker roles have two ways of exposing their services. The first is as an input service—these are published to the load balancer and are available externally (role 0). The second is as an internal service, which isn’t behind a load balancer and is only visible to your other role instances (role 1).

The next step in the process is to define the endpoint. You can do this the macho way in the configuration of the role, or you can do it in the Visual Studio Properties window. If you right-click on the Worker-Process String worker role in the Azure project and choose Properties, you’ll see the window in figure 15.3.

Figure 15.3. Adding an external service endpoint to a worker role. This service endpoint will be managed by Azure and be enrolled in the load balancer. This will make the service available outside of Azure.

Name the service endpoint StringReverseService and set it to be an input endpoint, using TCP on port 2202. There’s no need to use any certificates or security at this time.

After you save these settings, you’ll find the equivalent settings in the ServiceConfiguration.csdef file:

<Endpoints>
 <InputEndpoint name="StringReverserService" protocol="tcp" port="2202" />
</Endpoints>

You might normally host your service in IIS or WAS, but those aren’t available in a worker role. In the future, you might be able to use Windows Server AppFabric, but that isn’t available yet, so you’ll have to do this the old-fashioned way. You’ll have to host the WCF service using ServiceHost, which is exactly that, a host that will act as a container to run your service in. It will contain the service, manage the endpoints and configuration, and handle the incoming service requests.

Next you need to add a method called StartStringReversalService. This method will wire up the service to the ServiceHost and the endpoint you defined. The contents of this method are shown in the following listing.

Listing 15.2. The `StartStringReversalService` method wires up the service

Listing 15.2 is an abbreviated version of the real method, shortened so that it fits into the book better. We didn’t take out anything that’s super important. We took out a series of trace commands so we could watch the startup and status of the service. We also abbreviated some of the error handling, something you would definitely want to beef up in a production environment.

Most of this code is normal for setting up a ServiceHost. You first have to tell the service host the type of the service that’s going to be hosted . In this case, it’s the ReverseStringTools type.

When you go to add the service endpoint to the service host, you’re going to need three things, the ABCs of WCF: address, binding, and contract. The contract is provided by your code, IReverseString, and it’s a class file that you can reference to share service contract information (or use MEX like a normal web service). The binding is a normal TCP binary binding, with all security turned off. (We would only run with security off for debug and demo purposes!)

Then the address is needed. You can set up the address by referencing the service endpoint from the Azure project. You won’t know the real IP address the service will be running under until runtime, so you’ll have to build it on the fly by accessing the collection of endpoints from the RoleEnvironment.CurrentRoleInstance.InstanceEndpoints collection . The collection is a dictionary, so you can pull out the endpoint you want to reference with the name you used when setting it up—in this case, StringReverserService. Once you have a reference to the endpoint, you can access the IP address that you need to set up the service host.

After you have that wired up, you can start the service host. This will plug in all the components, fire them up, and start listening for incoming messages. This is done with the Open method .

Once the service is up, you’ll want the main execution thread to sleep forever so that the host stays up and running. If you didn’t include the sleep loop , the call pointer would fall out of the method, and you’d lose your context, losing the service host. At this point, the worker role instance is sitting there, sleeping, whereas the service host is running, listening for and responding to messages.

We wired up a simple WPF test client, as shown in figure 15.4, to see if our service is working. There are several ways you could write this test harness. If you’re using .NET 4, it’s very common to use unit tests to test your services instead of an interactive WPF client. Your other option would be to use WCFTestClient.exe, which comes with Visual Studio.

Figure 15.4. A simple client that consumes our super string-reversing service. The service is running in a worker role, running in Azure, behind the load balancers. kltpzyxM! kltpzyxM! kltpzyxM!

Exposing public service endpoints is useful, but there are times when you’ll want to expose services for just your use, and you don’t want them made public. In this case, you’ll want to use inter-role communication, which we’ll look at next.

15.2.3. Inter-role communication

Exposing service input endpoints, as we just discussed, can be useful. But many times, you just need a way to communicate between your role instances. Usually you could use a queue, but at times there might be a need for direct communication, either for performance reasons or because the process is synchronous in nature.

You can enable communication directly from one role instance to another, but there are some issues you should be aware of first. The biggest issue is that you’ll have direct access to an individual role instance, which means there’s no separation that can deal with load balancing. Similarly, if you’re communicating with an instance and it goes down, your work is lost. You’ll have to write code to handle this possibility on the client side.

To set up inter-role communication, you need to add an internal endpoint in the same way you add an input endpoint, but in this case you’ll set the type to Internal (instead of Input), as shown in figure 15.5. The port will automatically be set to dynamic and will be managed for you under the covers by Azure.

Figure 15.5. You can set up an internal endpoint in the same way you set up an external endpoint. In this case, though, your service won’t be load balanced, and the client will have to know which service instance to talk to.

Using an internal endpoint is a lot like using an external endpoint, from the point of view of your service. Either way, your service doesn’t know about any other instances running the service in parallel. The load balancing is handled outside of your code when you’re using an external endpoint, and internal endpoints don’t have any available load balancing. This places the choice of which service instance to consume on the shoulders of the service consumer itself.

Most of the work involved with internal endpoints is handled on the client side, your service consumer. Because there can be a varying number of instances of your service running at any time, you have to be prepared to decide which instance to talk to, if not all of them. You also have to be wily enough to not call yourself if calling the service from a sibling worker role instance.

You can access the set of instances running, and their exposed internal endpoints, with the RoleEnvironment static class:

foreach (var instance in RoleEnvironment.CurrentRoleInstance.Role.Instances)
{
   if (instance != RoleEnvironment.CurrentRoleInstance)
      SendMessage(instance.InstanceEndpoints["MyServiceEndpointName"]);
}

The preceding sample code loops through all of the available role instances of the current role. As it loops, it could access a collection of any type of role in the application, including itself. So, for each instance, the code checks to see if that instance is the instance the code is running in. If it isn’t, the code will send that instance a message. If it’s the same instance, the code won’t send it a message, because sending a message to oneself is usually not productive.

All three ways of communicating with a worker role have their advantages and disadvantages, and each has a role to play in your architecture:

Use a queue for complete separation of your instances from the service consumers.
Use input endpoints to expose your service publicly and leverage the Azure load balancer.
Use internal endpoints for direct and synchronous communication with a specific instance of your service.

Now that we’ve covered how you can communicate with a worker role, we should probably talk about what you’re likely to want to do with a worker role.

15.3. Common uses for worker roles

Worker roles are blank slates—you can do almost anything with them. In this section, we’re going to explore some common, and some maybe not-so-common, uses for worker roles.

The most common use is to offload work from the frontend. This is a common architecture in many applications, in the cloud or not. We’ll also look at how to use multithreading in roles, how to simulate a worker role, and how to break a large process into connected smaller pieces.

15.3.1. Offloading work from the frontend

We’re all familiar with the user experience of putting products into a shopping cart and then checking out with an online retailer. You might have even bought this book online. How retailers process your cart and your order is one of the key scenarios for how a worker role might be used in the cloud.

Many large online retailers split the checkout process into two pieces. The first piece is interactive and user-facing. You happily fill your shopping cart with lots of stuff and then check out. At that time, the application gathers your payment details, gives you an order number, and tells you that the order has been processed. Then it emails all of this so you can have it all for your records. This is the notification email shown in figure 15.6.

Figure 15.6. The typical online retailer will process a customer’s order in two stages. The first saves the cart for processing and immediately sends back a thank you email with an order number. Then the backend servers pick up the order and process it, resulting in a final email with all of the real details.

After the customer-facing work is done, the backend magic kicks in to complete the processing of the order. You see, when the retailer gave you an order number, they were sort of fibbing. All they did was submit the order to the backend processing system via a message queue and give you the order number that can be used to track it. One of the servers that are part of the backend processing group picks up the order and completes the processing. This probably involves charging the credit card, verifying inventory, and determining the ability to ship according to the customer’s wishes. Once this backend work is completed, a second email is sent to the customer with an update, usually including the package tracking number and any other final details. This is the final email shown in figure 15.6.

By breaking the system into two pieces, the online retailer gains a few advantages. The biggest is that the user’s experience of checking out is much faster, giving them a nice shopping experience. This also takes a lot of load off of the web servers, which should be simple HTML shovels. Because only a fraction of shoppers actually check out (e-tailers call this the conversion rate), it’s important to be able to scale the web servers very easily. Having them broken out makes it easy to scale them horizontally (by adding more servers), and makes it possible for each web server to require only simple hardware. The general strategy at the web server tier is to have an army of ants, or many low-end servers.

This two-piece system also makes it easier to plan for failure. You wouldn’t want a web server to crash while processing a customer’s order and lose the revenue, would you?

This leaves the backend all the time it needs to process the orders. Backend server farms tend to consist of fewer, larger servers, when compared to the web servers. Although you can scale the number of backend servers as well, you won’t have to do that as often, because you can just let the flood of orders back up in the queue. As long as your server capacity can process them in a few hours, that’s OK.

Azure provides a variety of server sizes for your instances to run on, and sometimes you’ll want more horsepower in one box for what you’re doing. In that case, you can use threading on the server to tap that entire horsepower.

15.3.2. Using threads in a worker role

There may be times when the work assigned to a particular worker role instance needs multithreading, or the ability to process work in parallel by using separate threads of execution. This is especially true when you’re migrating an existing application to the Azure platform. Developing and debugging multithreaded applications is very difficult, so deciding to use multithreading isn’t a decision you should make lightly.

The worker role does allow for the creation and management of threads for your use, but as with code running on a normal server, you don’t want to create too many threads. When the number of threads increases, so does the amount of memory in use. The context-switching cost of the CPU will also hinder efficient use of your resources. You should limit the number of threads you’re using to two to four per CPU core.

A common scenario is to spin up an extra thread in the background to process some asynchronous work. Doing this is OK, but if you plan on building a massive computational engine, you’re better off using a framework to do the heavy lifting for you. The Parallel Extensions to .NET is a framework Microsoft has developed to help you parallelize your software. The Parallel Extensions to .NET shipped as part of .NET 4.0 in April of 2010.

Although we always want to logically separate our code to make it easier to maintain, sometimes the work involved doesn’t need a lot of horsepower, so we may want to deploy both the web and the worker sides of the application to one single web role.

15.3.3. Simulating worker roles in a web role

Architecting your application into discrete pieces, some of which are frontend and some of which are backend, is a good thing. But there are times when you need the logical separation, but not the physical separation. This might be for speed reasons, or because you don’t want to pay for a whole worker role instance when you just need some lightweight background work done.

Maintaining Logical Separation

If you go down this path, you must architect your system so you can easily break it out into a real worker role later on as your needs change. This means making sure that while you’re breaking the physical separation, you’re at least keeping the logical separation. You should still use the normal methods of passing messages to that worker code. If it would use a queue to process messages in a real worker instance, it should use a queue in the simulated worker instance as well. Take a gander at figure 15.7 to see what we mean. At some point, you’ll need to break the code back out to a real worker role, and you won’t want to have to rewrite a whole bunch of code.

Figure 15.7. You can simulate a worker role in your web role if it’s very lightweight.

Be aware that the Fabric Controller will be ignorant of what you’re doing, and it won’t be able to manage your simulated worker role. If that worker role code goes out of control, it will take down the web instance it’s running in, which could cascade to a series of other problems. You’ve been warned.

If you’re going to do this, make sure to put the worker code into a separate library so that primary concerns of the web instance aren’t intermingled with the concerns of the faux worker instance. You can then reference that library and execute it in its own thread, passing messages to it however you would like. This will also make it much easier to split it out into its own real worker role later.

Utilizing Background Threads

The other issue is getting a background thread running so it can execute the faux worker code. An approach we’ve worked with is to launch the process on a separate thread during the Session_Start event of the global.asax. This will fire up the thread once when the web app is starting up, and leave it running.

Our first instinct was to use the Application_Start event, but this won’t work. The RoleManager isn’t available in the Application_Start event, so it’s too early to start the faux worker.

We want to run the following code:

Thread t = new Thread(new ThreadStart(FauxWorkerSample.Start));
t.Start();

Putting the thread start code in the Session_Start event has the effect of trying to start another faux worker every time a new ASP.NET session is started, which is whenever there’s a new visitor to the website. To protect against thousands of background faux workers being started, we use the Singleton pattern. This pattern will make sure that only one faux worker is started in that web instance.

When we’re about to create the thread, we check a flag in the application state to see if a worker has already been created:

object obj = Application["FauxWorkerStarted"];

if (obj == null)
{
     Application["FauxWorkerStarted"] = true;
     Thread t = new Thread(new ThreadStart(FauxWorkerSample.Start));
     t.Start();
}

If the worker hasn’t been created, the flag won’t exist in the application state property bag, so it will equal null in that case. If this is the first session, the thread will be created, pointed at the method we give it (FauxWorkerSample.Start in this case), and it will start processing in the background.

When you start it in this manner, you’ll have access to the RoleManager with the ability to write to the log, manage system health, and act like a normal worker instance. You could adapt this strategy to work with the OnStart event handler in your webrole.cs file. This might be a cleaner place to put it, but we wanted to show you the dirty work around here.

Our next approach is going to cover how best to handle a large and complex worker role.

15.3.4. State-directed workers

Sometimes the code that a worker role runs is large and complex, and this can lead to a long and risky processing time. In this section, we’ll look at a strategy you can use to break this large piece down into manageable pieces, and a way to gain flexibility in your processing.

As we’ve said time and time again, worker roles tend to be message-centric. The best way to scale them is by having a group of instances take turns consuming messages from a queue. As the load on the queue increases, you can easily add more instances of the worker role. As the queue cools off, you can destroy some instances.

In this section, we’ll look at why large worker roles can be problematic, how we can fix this problem, and what the inevitable drawbacks are. Let’s start by looking at the pitfalls of using a few, very large workers.

The Problem

Sometimes the work that’s needed on a message is large and complicated, which leads to a heavy, bloated worker. This heaviness also leads to a brittle codebase that’s difficult to work with and maintain because of the various code paths and routing logic.

A worker that takes a long time to process a single request is harder to scale and can’t process as many messages as a group of smaller workers. A long-running unit of work also exposes your system to more risk. The longer an item takes to be processed, the more likely it is that the work will fail and have to be started over. This is no big deal if the processing takes 3 seconds, but if it takes 20 minutes or 20 hours, you have a significant cost to failure.

This problem can be caused by one message being very complex to process, or by a batch of messages being processed as a group. In either case, the unit of work being performed is large, and this raises risk. This problem is often called the “pig in a python” problem (as shown in figure 15.8), because you end up with one large chunk of work moving through your systems.

Figure 15.8. The “pig in a python” problem can often be seen in technology and business. It’s when a unit of work takes a long time to complete, like when a python eats a pig. It can take months for the snake to digest the pig, and it can’t do much of anything else during that timeframe.

We need a way to digest this work a little more gracefully.

The Solution

The best way to digest this large pig is to break the large unit of work into a set of smaller processes. This will give you the most flexibility when it comes to scaling and managing your system. But you want to be careful that you don’t break the processes down to sizes that are too small. At this level, the latency of communicating with the queue and other storage mechanisms in very chatty ways may introduce more overhead than you were looking for.

When you analyze the stages of processing on the message, you’ll likely conceive of several stages to the work. You can figure this out by drawing a flow diagram of the current bloated worker code. For example, when processing an order from an e-commerce site, you might have the following stages:

Validate the data in the order.
Validate the pricing and discount codes.
Enrich the order with all of the relevant customer data.
Validate the shipping address.
Validate the payment information.
Charge the credit card.
Verify that the products are in stock and able to be shipped.
Enter the shipping orders into the logistics system for the distribution center.
Record the transaction in the ERP system.
Send a notification email to the customer.
Sit back and profit.

You can think of each state the message goes through as a separate worker role, connected together with a queue for each state. Instead of one worker doing all of the work for a single order, it only processes one of the states for each order. The different queues represent the different states the message could have. Figure 15.9 compares a big worker that performs all of the work, to a series of smaller workers that break the work out (validating, shipping, and notifying workers).

Figure 15.9. A monolithic worker role compared to a state-driven worker role. The big worker completes all the work in one step, leading to the “pig in a python” problem of being harder to maintain and extend as needed. Instead, we can break the process into a series of queues and workers, each dedicated to servicing a specific state or stage of the work to be done.

There might also be some other processing states you want to plan for. Perhaps one for really bad orders that need to be looked at by a real human, or perhaps you have platinum-level customers who get their orders processed and shipped before normal run-of-the-mill customers. The platinum orders would go into a queue that’s processed by a dedicated pool of instances.

You could even have a bad order routed to an Azure table. A customer service representative could then access that data with a CRM application or a simple InfoPath form, fix the order, and resubmit it back into the proper queue to continue being processed. This process is called repair and resubmit, and it’s an important element to have in any enterprise processing engine.

You won’t be able to put the full order details into the queue message—there won’t be enough room. The message should contain a complete work ticket, representing where the order data can be found (perhaps via an order ID), as well as some state information, and any information that would be useful in routing the message through the state machine. This might include the service class of the customer, for example—platinum versus silver.

As the business changes over time, and it will, making changes to how the order is processed is much easier than trying to perform heart surgery on your older, super complicated, and bloated work role code. They don’t say spaghetti code for nothing. For example, you might need to add a new step between steps 8 and 9 in our previous list. You could simply create a new queue and a new worker role to process that queue. Then the worker role for the state right before the new one would need to be updated to point to the new queue. Hopefully the changes to the existing parts of the system can be limited to configuration changes.

Even Cooler—Make the State Worker Role its Own Azure Service

How you want to manage your application in the cloud should be a primary consideration in how you structure the Visual Studio solution. Each solution becomes a single management point. If you want to manage different pieces without affecting the whole system, those should be split out into separate solutions.

In this scenario, it would make sense to separate each state worker role to its own service in Azure, which would further decouple them from each other. This way, when you need to restart one worker role and its queue, you won’t affect the other roles.

In a more dynamic organization, you might need to route a message through these states based on some information that’s only available at runtime. The routing information could be stored in a table, with rules for how the flow works, or by simply storing the states and their relationships in the cloud service configuration file. Both of these approaches would let you update how orders were processed at runtime without having to change code. We’ve done this when orders needed different stages depending on what was in the order, or where it was going. In one case, if a controlled substance was in the order, the processing engine had to execute a series of additional steps to complete the order.

This approach is often called a poor man’s service bus because it uses a simple way of connecting the states together, and they’re fairly firm at runtime. If you require a greater degree of flexibility in the workflow, you would want to look at the Itinerary^[1] pattern. This lets the system build up a schedule of processing stops based on the information present at runtime. These systems can get a little more complicated, but they result in a system that’s more easily maintained when there’s a complex business process.

¹ For more information on the Itinerary pattern, see the Microsoft Application Architecture Guide from Patterns & Practices at Microsoft. It can be found at http://apparchguide.codeplex.com.

Oops, it’s Not Nirvana

As you build this out, you’ll discover a drawback. You now have many more running worker roles to manage. This can create more costs, and you still have to plan for when you eventually will swallow a pig. If your system is tuned for a slow work day, with one role instance per state, and you suddenly receive a flood of orders, the large amount of orders will move down the state diagram like a pig does when it’s eaten by a python. This forces you to scale up the number of worker instances at each state.

Although this flexibility is great, it can get expensive. With this model, you have several pools of instances instead of one general-purpose pool, which results in each pool having to increase and then decrease as the pig (the large flood of work) moves through the pipeline. In the case of a pig coming through, this can lead to a stall in the state machine as each state has to wait for more instances to be added to its pool to handle the pig (flood of work). This can be done easily using the service management APIs, but it takes time to spin up and spin down instances—perhaps 20 minutes.

The next step to take, to avoid the pig in a python problem, is to build your worker roles so that they’re generic processors, all able to process any state in the system. You would still keep the separate queues, which makes it easier to know how many messages are in each state.

You could also condense the queues down to one, with each message declaring what state the order is in as part of its data, but we don’t like this approach because it leads to favoritism for the most recent orders placed in the processors, and it requires you to restart all of your generic workers when you change the state graph. You can avoid this particular downfall by driving the routing logic with configuration and dependency injection. Then you would only need to update the configuration of the system and deploy a new assembly to change the behavior of the system.

The trick to gaining both flexibility and simplicity in your architecture is to encapsulate the logic for each state in the worker, separating it so it’s easily maintainable, while pulling them all together so there’s only one pool of workers. The worker, in essence, becomes a router. You can see how this might work in figure 15.10. Each message is routed, based on its state and other runtime data, to the necessary state processor. This functions much like a factory. Each state would have a class that knows how to process that state. Each state class would implement the same interface, perhaps IorderProcessStage. This would make it easy for the worker to instantiate the correct class based on the state, and then process it. Most of these classes would then send the message back to the generic queue, with a new state, and the cycle would start again.

Figure 15.10. By moving to a consolidated state-directed worker, we’ll have one queue and one worker. The worker will act as a router, sending each inbound message to the appropriate module based on the message’s state and related itinerary. This allows us to have one large pool of workers, but makes it easier to manage and decompose our bulky process.

There are going to be times when you’re working with both web and worker roles and you’re either importing legacy code that needs access to a local drive, or what you’re doing requires it. That’s why we’ll discuss local storage next.

15.4. Working with local storage

There are times when the code you’re working with will need to read from and write to the local filesystem. Windows Azure allows for you to request and access a piece of the local disk on your role instance.

You can create this space by using the configuration of your role. You won’t have control over the path of the directory you’re given access to, so you should make sure that the file path your code needs to access is part of your configuration. A hardcoded path will never remain accurate in the cloud environment.

We recommend that you only use local storage when you absolutely have to, because of some limitations we’ll cover later in this section. You’ll likely need to use local storage the most when you’re migrating to the cloud existing frameworks or applications that require local disk access.

15.4.1. Setting up local storage

You can configure the local storage area you need as part of your role by adding a few simple lines of configuration to your role. The tag we’re going to work with is the LocalStorage tag. It will tell the Fabric Controller to allocate local file storage space on each server the role instance is running on.

In the configuration element, you need to name the storage space. This name will become the name of the folder that’s reserved for you. You’ll need to define how much filesystem space you’ll need. The current limit is 20 GB per role instance, with a minimum of 1 MB.

<LocalResources>
      <LocalStorage name="FilesUploaded" cleanOnRoleRecycle="false" sizeInMB="15" />
      <LocalStorage name="VirusScanPending" cleanOnRoleRecycle="true" sizeInMB="5" />
</LocalResources>

You can declare multiple local storage resources, as shown in the preceding code snippet. It’s important that the local file storage only be used for temporary, unimportant files. The local file store isn’t replicated or preserved in any way. If the instance fails and it’s moved by the Fabric Controller to a new server, the local file store isn’t preserved, which means any files that were present will be lost.

Tip

There is one time when the local file storage won’t be lost, and that’s when the role is recycled, either as part of a service management event on your part, or when the Fabric Controller is responding to a minor issue with your server. In these cases, if you’ve set the cleanOnRoleRecyle parameter to false, the current files will still be there when your instance comes back online.

Instances may only access their own local storage. An instance may not access another instance’s storage. You should use Azure BLOB storage if you need more than one instance to access the same storage area.

Now that you’ve defined your local storage, let’s look at how you can access it and work with it.

15.4.2. Working with local storage

Working with files in local storage is just like working with normal files. When your role instance is started, the agent creates a folder with the name you defined in the configuration in a special area on the C: drive on your server. Rules are put in place to make sure the folder doesn’t exceed its assigned quota for size. To start using it, you simply need to get a handle for it.

To get a handle to your local storage area, you need to use the GetLocalResource method. You’ll need to provide the name of the local resource you defined in the service definition file. This will return a LocalResource object:

public static LocalResource uploadFolder = RoleEnvironment.GetLocalResource("FilesUploaded");

After you have this reference to the local folder, you can start using it like a normal directory. To get the physical path, so you can check the directory contents or write files to it, you would use the uploadFolder reference from the preceding code.

string rootPathName = uploadFolder.RootPath;

In the sample code provided with this book, there’s a simple web role that uses local storage to store uploaded files. Please remember that this is just a sample, and that you wouldn’t normally persist important files to the local store, considering its transient nature. You can view the code we used to do this in listing 15.3. When calling the RootPath method in the local development fabric, Brian’s storage is located here:

C:\Users\brprince\AppData\Local\dftmp\s0\deployment(32)\res\deployment(32).AiA_15___Local_Storage_post_pdc.LocalStorage_WebRole.0\directory\FilesUploaded\

When we publish this little application to the cloud, it returns the following path:

C:\Resources\directory\0c28d4f68a444ea380288bf8160006ae.LocalStorage_WebRole.FilesUploaded\

Listing 15.3. Working with local file storage

Now that we know where the files will be stored, we can start working with them. In the sample application, we have a simple file-upload control . When the web page is loaded, we write out the local file path to the local storage folder that we’ve been assigned . Once the file is uploaded, we store it in the local storage and write out its filename and path . We then write the file back out to the browser using normal file APIs to do so. Our example code was designed to work only with text files, to keep things simple.

The local storage option is great for volatile local file access, but it isn’t durable and may disappear on you. If you need durable storage, look at Azure storage or SQL Azure. If you need shared storage that’s super-fast, you should consider the Windows Server AppFabric distributed cache. This is a peer-to-peer caching layer that can run on your roles and provide a shared in-memory cache for your instances to work with.

15.5. Summary

In this chapter, we’ve looked at how you can process work in the background with the worker role in Azure. The worker role is an important tool for the cloud developer. It lets you do work when there isn’t a user present, whether because you’ve intentionally separated the background process from the user (in the case of a long-running checkout process) or because you’ve broken your work into a discrete service that will process messages from a queue.

Worker roles scale just like web roles, but they don’t have a built-in load balancer like web roles do. You’ll usually aggregate worker roles behind a queue, with each instance processing messages from the queue, thereby distributing the work across the group. This gives you the flexibility to increase or decrease the number of worker instances as the need arises.

It’s quite possible to have an Azure application consist of only worker roles. You could have some on-premises transaction systems report system activity (such as each time a sale is made) to a queue in the cloud. The worker role would be there to pick up the report and merge the data into the reporting system. This allows you to keep the bulk of your application on-premises, while moving the computing-intensive backend operations to the cloud. A more robust way of doing this would be to connect the on-premises system with the cloud system using the Windows Azure platform AppFabric Service Bus, which is discussed in chapter 17.

In this chapter we talked a lot about how to work with worker roles, and how to get messages to them. One of the key methods for doing that is to use an Azure queue. We’ll work closely with queues in the next chapter.