On 7th September 2001 email@example.com (Andrew Hamm) wrote:-
I would imagine striping is a good easy way to exploit a disk of 4-8Gb from a few years ago, but as we charge head-long into the world of 18, 40, 70gb disks, it's just starting to get ridiculous. In fact, until Informix get around the 2G limit per chunk, we'll all start to suffer from relatively poor performance.
A typical SCSI disk of today can handle between 5-10 parallel threads for maximum throughput. Here's a recent analysis of a disk on an L-1000 series HP
/dev/vg00/ronline00 : 1 concurrent read threads 400 KB/sec. /dev/vg00/ronline00 : 2 concurrent read threads 444 KB/sec. /dev/vg00/ronline00 : 3 concurrent read threads 600 KB/sec. /dev/vg00/ronline00 : 4 concurrent read threads 664 KB/sec. /dev/vg00/ronline00 : 5 concurrent read threads 765 KB/sec. /dev/vg00/ronline00 : 6 concurrent read threads 774 KB/sec. /dev/vg00/ronline00 : 7 concurrent read threads 851 KB/sec. /dev/vg00/ronline00 : 8 concurrent read threads 888 KB/sec. /dev/vg00/ronline00 : 9 concurrent read threads 875 KB/sec. /dev/vg00/ronline00 : 10 concurrent read threads 400 KB/sec.
The write performance curve does the same thing. See how the speed of the disk slumps when you have too much activity, but TOTAL throughput is maximised at between 7-9 threads? Other disks tested on other boxes are a little more extreme - for example, the 1-thread performance is approx 300, rising to 1200 and then collapsing to 300 again at about 10 threads.
SO: lets consider a 4-way stripe built on 4 18gb disks. Since max informix chunk size is 2Gb, each disk will contribute 0.5 gig. That implies you can get 36 chunks out of each disk. That means each disk will be forced to handle 36 threads on a busy system.
So where in the performance curve do you think the disk will be then? I assure you, an old test upto 25 threads did NOT see a fresh rise in the curve.
Let's say you are on disks where this worst-case performance is 300, and best is 1200 at 9 threads. By the magic of numbers, you'll get 4x performance increase from your disks if you allocate 9 2Gb chunks on each disk without striping. Each disk will give you total throughput of 1200 instead of 300. That's 4 times faster in anyone's language.
As your disk gets bigger you have to either choose not to use all of it, or try to setup a few chunks for large, lazy tables wot rarely get accessed. With this kind of allocation, you get some free chunks that would help to consume space on a 24Gb or higher disk. But with our system, there aren't more than approx 15% of tables which can be farmed off to lazy chunks. The rest will be busy.
If you deliberately reduce your AIO's and cleaners to 9 (roughly), then you'll reduce the traffic on each disk down to 9 (optimum performance), but guess what? You'll have to wait to access or flush the other 27 chunk stripes. There's no free lunch...
A stripe-set might give you more throughput when you are the only writer, but in a real engine, you won't have only one thread. I can see a 4-way stripe exploiting upto 5Gb disks effectively without having to think too much about spreading your tables intelligently, but until we escape the 2Gb chunk limit, striping is going to be less effective unless the striping is performed by hot hardware which is capable of hiding the engine threads behind masses of cache memory and intelligent algorithms. I've seen recently that DG Clariion arrays have a practically flat performance curve, which was extremely interesting. But that kind of hardware costs more than a small business can afford. If you can afford it, great.
All of this theory hinges around effective distribution of your tables so that the traffic is evenly spread across the chunks. If you were sitting there monitoring your own system then you've got the time to monitor onstat -g iof and sysmaster:sysptprof and shift the tables around. I've had to go in cold to sites and reload in a day, so it's taken me several attempts to get happy with my algorithm.
With this allocation strategy on multiple disks with optimum number of chunks, I've had great success squeezing plenty more performance from systems. Sites with checkpoints hitting 15 seconds or worse, or sites which were forced to use ridiculously low LRU percentages to combat checkpoint times have ended up with checkpoints of just 2-3 seconds max, and no foreground or LRU writes. You gotta be happy with that.
Don't forget, LRU writes steal bandwith from queries, just a foreground writes steal bandwidth from queries AND updates. So if you can get away from LRU writes then surely you'll have better query performance, not including the overall 4x or greater improvement this practice can give you.
Of course, this theory against striping needs backing up with hard tests. There may be a ghost in the machine which affects the results. Next time a HP with multiple disks passes beneath my fingers I'll do some damage. But until then, as John Cleese once said: A-hemm - my theory of the dinosaur - accc-hem...
Any refutation backed up with hard figures will be very welcome.
On 11th September 2013 firstname.lastname@example.org (JACQUES RENAUT) wrote:-
First off, stuff in the onstat -g lmx output isn't by default bad, it would only possibly reflect a performance problem if you possibly saw large list of waiters on a particular mutex, or large wait times. Second, the uselock mutex has something to do with shared memory communication and so it's normal to see that mutex locked while the sqlexec threads for those sessions aren't doing anything...well what they are doing is waiting for the next request to do something from the client application (so waiting for work).
On 21st September 2013 email@example.com (Marco Greco) wrote:-
po_lock is a spin lock used for low level memory management
On 23rd September 2013 firstname.lastname@example.org (John Miller iii) wrote:-
This is the lock on the global memory pool which often deals with config, gls, network, and general overhead items which are not owned by a user.
On 21st September 2013 email@example.com (Art Kagel) wrote:-
IB that is the memory pool latch. You can probably reduce the activity on it by enabling VP_MEMORY_CACHE_KB so that each CPU VP holds on to some of the memory from the central pool and so has to interact with the pool latch less frequently.
On 23rd September 2013 firstname.lastname@example.org (John Miller iii) wrote:-
The two mutex which are directly effected are sh_stamp and po_lock, but the VP_MEMORY_CACHE_KB speeds up malloc()/free() functionality which can have an indirect impact on many things.
Setting this to a value of 1024 (i.e. 1MB) can have a great impact. If you are not worried about a few MB then setting this to 4096 would be my starting point.
Two additional notes,
1) The functionality of this has been improved so please look at the setting in new releases.
2) This feature is only available with the enterprises edition
On 27th September 2013 email@example.com (Madison Pruet) wrote:-
> What might be the possibilities that will generate "RUN,BLOCK " in the State column of the output?
Usually that would mean that you have a very large transaction that the target is trying to apply, and because it is a large transaction it's blocked the network from sending anything until that one transaction is applied.
On 24th November 2013 firstname.lastname@example.org (Art Kagel) wrote:-
v12.10.xC2 no longer has to keep the buffer pool in the resident segment, it has been moved into a virtual segment of its own and can grow beyond that single segment so the requirement for the buffer pool to be contiguous in memory is gone.
On 28th August 2014 email@example.com (Art Kagel) wrote:-
Onmonitor is gone altogether BTW and should not be used anyway in any version after 7.31. It can trash your ONCONFIG file.