Nowadays, 5 simultaneous uploads of files in the 250MB-3GB range may be considered medium to low load, but it is also the most usual load in web services that are not offered by big parties like Twitter or Google. Let’s see what happened in such a real case scenario
What went right
It really helps to have a detailed and clear log of your application (including all its scripts), so you can reconstruct a problem scenario and troubleshoot it easily.
Once the problem is diagnosed, document it either in the source code or in the user interface. Sometimes, the user can’t reach their goal because it is not clear enough how can they get there, or the user interface is not easy enough.
Think of all processes as long processes. While in a lighly-loaded server a computing-intensive process may take 15 seconds, under medium-load it might take much longer. And users, with good sense, asume that if the page needs more than 5 seconds to load, there is some problem. So, at the very least, provide a progress bar in all processes. This way users will be able to make more precise assumptions about the state of the process.The task progress should also be available through the user landing page, so he can check easily even if the status window vanishes.
Interface the computer-demanding tasks with Python scripts. The scripts are easier to modify and deploy, specially in production.
Know your application server, so you can fine tune its settings (for example, memory usage, concurrency…)
Listen to your users and try to offer them as much help as they need.
A good monitoring system (like Nagios) with the appropriate probes (related to the specific needs of your application) will notify you of the failures as they happen, so usually you will be able to solve them before the users notice.
Automate all the tasks you can. Scripts are the best documentation of the task steps, and they ensure everything goes smoothly in all the environments (development, testing, production…). Three obvious examples are the build process (with ant it is pretty easy to automate JSP compilation, JAR build, deployment…), the database queries (that are not part of the application itself) and backups (you have backups of everything, don’t you?)
(Obviously) Version control of everything. Your application will probably depend on more than its Java source code, and if any of these dependencies is missing or wrong the application may fail too. This includes the packages and scripts your application relies on (so you can revert a package update in case of need)
Best documentation is clean code. But it also helps having at hand diagrams of key parts, like the database E/R model.
What could go better
Sometimes there is not much time to integrate an automated web testing system that really matches your application, but keep in mind it is the only real way to test it. «Manual tests» also take time and cover only this much.
One important part of the testing protocol is the load test. When the application goes to production, it might face way more demand that you initially thought. In web applications, it is important to also include the client speed into the equation. In your local environment everything might seem perfect and cute, and then a user with a really slow connection (yes, there still are users who suffer 5KB/s connections) faces all kind of timeouts (the web server, the application server, the application itself…)
Even if you have a good server, eventually it will fall under the pressure. The only real weapon is to distribute your application across many servers. Then, for bonus points you may add a load balancer. Sadly, the distributed approach must be thought of from the beginning of the development process. If you arrive at the maintenance phase, it will probably be quite hard to refactor the code from single instance to multiple instance.
If possible, another good idea is to submit the processing to different computing servers. Usually it is not that hard to enjoy this kind of load distribution. And the sooner you integrate this idea, the more tested it will be.
Ideally, there should not be single points of failure: replicate also the web server, the file server, the data center itself… It is far from easy to setup and mantain such an architecture, but as the old chinese saying goes, «there’s always a tiny step» that you can do to get closer to the ideal.