One of the most important aspects of any architecture is performance. Performance tuning is an art unto itself with entire volumes of data dedicated to tools and methods of tuning a UNIX Operating System. The column will not specifically go into detail about the tools themselves (although they will be mentioned) but instead, considerations, methods and practices; more of a why and what sort of approach.
All commercial software aside (for there is too much to catalogue here) the most common UNIX performance monitoring tools can be summed up in a list:
Most of the aforementioned tools can be found on nearly every system. Surprisingly, the man pages for them are very good from platform to platform (I have personally used the man pages for the tools that were on GNU/Linux, HP, SCO and Sun systems) and provide more than enough information to get you started [ 1 ].
Performance tuning most often comes in the form of a bottleneck creeping up when you are not looking. Despite all of our efforts to discover and eliminate performance issues before they become readily apparent, most often performance related problems cannot be tracked down until they are readily apparent in some fashion [ 2 ]. In a way this is paradoxical and might lead one to the conclusion that attempting to preemptively strike against possible bottlenecks is a waste of time, not true, for there will be the occasion you do beat the bottleneck to the punch and take my word for it, the reward is well worth the effort. Still, putting that aside, we are most often under duress when performance tuning. Not the best possible situation for conducive work, however, it is the most likely scenario. Like you have heard and seen so many times before, when a bottleneck pops up it is important to remain calm and not let the stress factor of others (users, managers, whomever) interfere with what you are doing.
Duress can lend to your making hasty decisions and getting caught up in finger pointing (most often defensively) or other nefarious activities that take away valuable time that should be spent combating a potential problem.
Contradiction in terms? I think not. Efficient bottlenecks pervade modern computing to the point of disgust. Before I spin off into a rant, allow me to explain. An efficient bottleneck is one which is hidden by hardware, while this may not make any sense, it must be put into the correct light, the best way to do so is by examples. Let us say, for the sake of argument, you have just installed a new volume group and a logical volume. It is RAID level 5 and several disks in size. After awhile this new volume starts to develop performance problems. You spend time dutifully hunting the problem and find two culprits. One is natural growth, the amount of users has increased nearly four times over a very short period of time. Simple I/O problems. The other is a poorly written program (either written internally, purchased, downloaded - whatever). For the first problem, there simply is nothing you can do, your solution (for arguments sake) is simply to loadshare.
That is acceptable
After the fix is in place, both bottlenecks disappear. You are happy, you can now move onto the next problem at hand right?
Make no mistake about it, you still have a problem. Your bad program is still a bottleneck and chances are you will see it again sooner than the added user problem. Now, instead of having two serious bottlenecks, you have eliminated one and made the other bottleneck very efficient. Letting it lie is wrong. Attack it while it is not readily apparent.
Hardware tossing is related to efficient bottlenecks; an attitude which is also pervasive in modern technology. The chuck hardware at it train of thought works something like so;
"Get rid of the performance problem initially by upgrading the hardware. If a problem recurs, we might hold the hardware vendor responsible."
While that may be an extreme case, it is not far from the truth. Thinking that way is inherently unscientific and merits being dragged out into the street and shot. Yes, there comes a time in all systems when the system is the bottleneck. Most often, however, a bottleneck can be alleviated with hard work and deep thought, not senseless expenditure. The real crux of the matter is proper planning and life cycle management. Knowing when a system will require an upgrade means bottlenecks should be anomalies not related to slow hardware.
Okay - example: Recently I finished transferring the last of a set of systems to newer architectures overall (hardware, software and even firmware). While it was somewhat overdue, the transfer was natural in accordance with the growth patterns of the systems. I also know that in three years time from the transfer, those very systems will require some minor upgrades, again, all in accordance to natural growth. Every single bottleneck I have encountered was not due to slow or outdated hardware, most often it was misuse or bad programming (discussed below).
Preventative bottleneck hunting comes in the form of scripting and cron. Simply put, schedule some sort of script that runs your preferred reporting tools and dumps them into a set of files. Examine those files periodically to see if there are any trends.
On the other side of the coin comes the aforementioned more
common "at the worst possible time bottleneck" (although according
to most users, anytime is the worst possible time). You
will want to have a sort of generalized procedure ready. Mine is
pretty simple, I quickly type
top to see if the
problem is anything that is blatantly obvious, because, it usually
is in my particular environment. Next I check shared memory ques.
After that I do a
grep for a particular process out of
the process table. If none of those efforts finds it, I usually
rethink the symptoms, then start up
vmstat depending on my conclusions.
Undoubtedly the most difficult part of performance tuning is not mastering the tools, handling the pressure or the realization of the time it may take to find the problem, it is properly identifying the performance problem. Many times a system will react to a problem in a way that would point to a completely unrelated area. For instance, a system that may appear to be thrashing may simply be I/O bound in general, or the other way around. A swapfile may be stressing a filesystem, or the other way around. A network interface might be tying up the CPU with messages or the CPU may be bound up performing calculations thus slowing down the rest of the machine and the network interface. Take your pick. It can sometimes be impossible to properly identify a problem, but, you will lose nothing by taking your best shot at it.
Taking the time to create monitoring scripts, reading reports, tracking trends and so forth is all part of life cycle development, while this is different than performance tuning and monitoring, the two can easily be linked together. When bottlenecks do crop up unexpectedly, don't sweat it, with patience and a little critical thinking, you can always track down the offender.