Wednesday, April 2, 2008

Scaling webapps

DISCLAIMER: I am not an expert in web apps and simply have been doing some research on them. Given that, my claims and concerns may be completely unfounded and simply a result of my ignorance of web app design, if so, please let me know.

I have been doing a little bit of research on webapps and I'm not sure how I feel about the standard model. The standard model, as I understand it, works by every incoming http request results in some number of DB requests where the DB is the holder of all state information. The web app itself, whether it be in PHP, or Rails, etc is stateless. I certainly don't doubt that for a large majority of applications this works fine. Take something like the yellow pages which is a Rails application according to this. The standard model seems quite reasonable for this. You have a page that is fairly static, people are mostly doing requests for data which requires walking through some large database and the results are being displayed and not much is happening again until the user initiates another page request.

Web apps are changing though. Today, we expect our web apps to look and feel like native applications, which means snappy responses and updates of it even without the user explicitly reloading the page. We have things like AJAX and Comet-style serverpush. In the end, this means more requests are going to the http server even when the user is not doing anything. The Comet style is pretty neat but from what I can see, polling with AJAX is the most widely used way. My concern is how well the standard model is going to scale as web apps are required to act more and more like native applications. The reason for my concern is because this method is simply polling the DB for state changes on every request and in every other situation polling quite clearly has not scaled. This is the exact reason Comet exists.

For the sake of simplicity, let's imagine that the web app is AJAX with polling. Some portion of the page is going to be asking for updates every 5 seconds. The standard model would have us querying the DB every 5 seconds for that request to see if the state has changed. If you have a couple thousand people on this page that is a lot of work. If the component of the page is changing often for a majority of the people then it's not too big of a deal but if it's not changing often for a majority we are doing a lot of worthless DB calls. I'm sure we would design our database so that this call is hopefully very lightweight but how well does that really scale in the end?

Now, the reason we want our wep app (such as something in Rails) to be stateless is because for each request we can be pushed around to a separate VM. So if we hold onto session state information there is a good chance we might not even get a chance to use it on the next request. On top of that, if you are load balancing between several hosts, you may not even be in a good place to share state information between VMs.

I'm sure almost all of this is not news to anyone who has made it this far in the post. What can be done to make a webapp scale better? I'm not sure but here is my suggestion and hopefully someone with more experience than myself can come back and say if it's a horrible idea or not. Let's say I want to make a web app that will be handling something like instant messaging such as google talk or meebo (which are both using Comet).

For starters, I want this to be fairly real time, when I send a message to someone, I want it to get to them ASAP. Secondly, there will be a lot of interaction between users. Clearly, page-by-page viewing is not going to work here. People can't be refreshing their page every few seconds to see if there is an update. AJAX with polling or Comet are clearly your two choices currently. How should this look after the HTTP request comes in though?

Here is how I am suggesting it should:
Each request should have some sort of session ID. Each session ID will be associated with a process. By process here, it could be an Erlang process or some OS process written in some other language, whatever. The point here is, for a particular session ID, its request will get forwarded to the same process every time for the life time of its session. This way, the process can store all the state information. We don't have to do worry about sharing state information in something like memcache. The process would have some sort of timeout value so they die off eventually. At worst case, we are back to the standard model where if a user does multiple requests that happen to have a pause between them longer than the timeout we now have to initiate a new process and it has to do the DB calls to initiate itself. In the best case though, we are consistently getting sent to the same process which is holding onto our data. The upside to this method is the process can also do things specific for that user such as opening up other connections. For instance, in the instant messaging example, a user logs in, they get a session ID and a process created for them somewhere that will be mapped to this session ID. The process opens up a connection to an event server for that user so it can listen for IMs and push IM's out. We now have an application that is quite event based on all sides of it. We won't be hitting the DB too often. Given that we don't really care where this process lives, we can also scale it out to multiple machines and not have to worry about replicating the same data over many machines because we don't know where the users request will go to.

Downsides?
Certainly. Clearly if we have 2000 people using our IM application at once, we need to have 2000 processes a live. If we wrote this in Erlang where each process for a session ID maps to an Erlang process (sorry about the terminology here all)? That's childsplay. We could host this application on one machine! But not everyone wants to write things in Erlang, so what about them? If I were in this situation, I would probably have some machines each running some amount of applications that will be doing some I/O multiplexing to handle this. So you would have some amount of OS processes and each one can handle some amount of session ID's. Pretty standard for something that isn't Erlang but needs to handle a bunch of things at once. If you are into Python, Twisted comes to mind. I'm sure other languages have their own way of doing this.

Another edge case here, and I think this is probably not too hard to deal with, is what if an event comes in (such as an IM) and the process times out due to lack of activity? You could have each event have a timeout and if it is not ACK'd in that time it gets saved to a DB and the next process that is created for that user picks it up as part of its initiation.


To reiterate, I do not have much experience in webapps, I simply have been doing some research and this is the impression that I have gotten. Are my concerns valid? Is the system I described what people are already going to or is it broken? How would someone write Meebo or google talk? Let me know.