I have next to no experience dealing with high volume transactional websites and recently came across this interesting question. I am interested in knowing where the bottlenecks in a Java web application would occur under high load (thousands of requests per second). If someone could give me a high level approach to thinking about the following question, that would be great!
The only thing I've come up with is to use memcached to cache the database look-ups but I don't know how to calculate the amount of time each request will take and therefore how many requests per second the system might be able to handle.
Question: Internet-scale applications must be designed to process high volumes of transactions. Describe a design for a system that must process on average 30,000 HTTP requests per second. For each request, the system must perform a look-up into a dictionary of 50 million words, using a key word passed in via the URL query string. Each response will consist of a string containing the definition of the word (100 bytes or less).
Describe the major components of the system, and note which components should be custom-built and which components could leverage third-party applications. Include hardware estimates for each component. Please note that the design should include maximum performance at minimum hardware / software-licensing costs.
Document the rationale in coming up with the estimates.
Describe how the design would change if the definitions are 10 kilobytes each.
The first thing I would do is question the numbers. English has about 170,000 words in common use. Add all the other common languages and you would have no more than a couple of million. If this is not the case you could cache the most common words in a fast cache and the less common words in a slower cache. Even at 30K request per second, it would take about 30 minutes to get every unqiue word.
Basically, there is no point designing an large system if the numbers are not real.
On a 64-bit JVM this fits easily. 50 million * (100 + overhead) is about 10 GB (overhead is like to be high as you need to have the key and index the data) A 12 GB server costs about $2,500.
The problem is like to be the number of requests. You will need to have multiple machines but as other posters have suggested the numbers are highly unlikely to be real. I don't image this service would be as expensive as facebook, but you are likely to need tens to hundreds of servers to support this many requests.