6. Common Questions about the Engine

6.86 Should I use striping?

On 7th September 2001 ahamm@sanderson.net.au (Andrew Hamm) wrote:-

I would imagine striping is a good easy way to exploit a disk of 4-8Gb from a few years ago, but as we charge head-long into the world of 18, 40, 70gb disks, it's just starting to get ridiculous. In fact, until Informix get around the 2G limit per chunk, we'll all start to suffer from relatively poor performance.

A typical SCSI disk of today can handle between 5-10 parallel threads for maximum throughput. Here's a recent analysis of a disk on an L-1000 series HP

/dev/vg00/ronline00 :  1 concurrent read threads        400  KB/sec.
/dev/vg00/ronline00 :  2 concurrent read threads        444  KB/sec.
/dev/vg00/ronline00 :  3 concurrent read threads        600  KB/sec.
/dev/vg00/ronline00 :  4 concurrent read threads        664  KB/sec.
/dev/vg00/ronline00 :  5 concurrent read threads        765  KB/sec.
/dev/vg00/ronline00 :  6 concurrent read threads        774  KB/sec.
/dev/vg00/ronline00 :  7 concurrent read threads        851  KB/sec.
/dev/vg00/ronline00 :  8 concurrent read threads        888  KB/sec.
/dev/vg00/ronline00 :  9 concurrent read threads        875  KB/sec.
/dev/vg00/ronline00 :  10 concurrent read threads       400  KB/sec.

The write performance curve does the same thing. See how the speed of the disk slumps when you have too much activity, but TOTAL throughput is maximised at between 7-9 threads? Other disks tested on other boxes are a little more extreme - for example, the 1-thread performance is approx 300, rising to 1200 and then collapsing to 300 again at about 10 threads.

SO: lets consider a 4-way stripe built on 4 18gb disks. Since max informix chunk size is 2Gb, each disk will contribute 0.5 gig. That implies you can get 36 chunks out of each disk. That means each disk will be forced to handle 36 threads on a busy system.

So where in the performance curve do you think the disk will be then? I assure you, an old test upto 25 threads did NOT see a fresh rise in the curve.

Let's say you are on disks where this worst-case performance is 300, and best is 1200 at 9 threads. By the magic of numbers, you'll get 4x performance increase from your disks if you allocate 9 2Gb chunks on each disk without striping. Each disk will give you total throughput of 1200 instead of 300. That's 4 times faster in anyone's language.

As your disk gets bigger you have to either choose not to use all of it, or try to setup a few chunks for large, lazy tables wot rarely get accessed. With this kind of allocation, you get some free chunks that would help to consume space on a 24Gb or higher disk. But with our system, there aren't more than approx 15% of tables which can be farmed off to lazy chunks. The rest will be busy.

If you deliberately reduce your AIO's and cleaners to 9 (roughly), then you'll reduce the traffic on each disk down to 9 (optimum performance), but guess what? You'll have to wait to access or flush the other 27 chunk stripes. There's no free lunch...

A stripe-set might give you more throughput when you are the only writer, but in a real engine, you won't have only one thread. I can see a 4-way stripe exploiting upto 5Gb disks effectively without having to think too much about spreading your tables intelligently, but until we escape the 2Gb chunk limit, striping is going to be less effective unless the striping is performed by hot hardware which is capable of hiding the engine threads behind masses of cache memory and intelligent algorithms. I've seen recently that DG Clariion arrays have a practically flat performance curve, which was extremely interesting. But that kind of hardware costs more than a small business can afford. If you can afford it, great.

All of this theory hinges around effective distribution of your tables so that the traffic is evenly spread across the chunks. If you were sitting there monitoring your own system then you've got the time to monitor onstat -g iof and sysmaster:sysptprof and shift the tables around. I've had to go in cold to sites and reload in a day, so it's taken me several attempts to get happy with my algorithm.

With this allocation strategy on multiple disks with optimum number of chunks, I've had great success squeezing plenty more performance from systems. Sites with checkpoints hitting 15 seconds or worse, or sites which were forced to use ridiculously low LRU percentages to combat checkpoint times have ended up with checkpoints of just 2-3 seconds max, and no foreground or LRU writes. You gotta be happy with that.

Don't forget, LRU writes steal bandwith from queries, just a foreground writes steal bandwidth from queries AND updates. So if you can get away from LRU writes then surely you'll have better query performance, not including the overall 4x or greater improvement this practice can give you.

Of course, this theory against striping needs backing up with hard tests. There may be a ghost in the machine which affects the results. Next time a HP with multiple disks passes beneath my fingers I'll do some damage. But until then, as John Cleese once said: A-hemm - my theory of the dinosaur - accc-hem...

Any refutation backed up with hard figures will be very welcome.

6.87 How should I do if I have locked mutexes?

On 11th September 2013 jrenaut@us.ibm.com (JACQUES RENAUT) wrote:-

First off, stuff in the onstat -g lmx output isn't by default bad, it would only possibly reflect a performance problem if you possibly saw large list of waiters on a particular mutex, or large wait times. Second, the uselock mutex has something to do with shared memory communication and so it's normal to see that mutex locked while the sqlexec threads for those sessions aren't doing anything...well what they are doing is waiting for the next request to do something from the client application (so waiting for work).

On 21st September 2013 marco@4glworks.com (Marco Greco) wrote:-

po_lock is a spin lock used for low level memory management

On 23rd September 2013 miller3@us.ibm.com (John Miller iii) wrote:-

This is the lock on the global memory pool which often deals with config, gls, network, and general overhead items which are not owned by a user.

On 21st September 2013 art.kagel@gmail.com (Art Kagel) wrote:-

IB that is the memory pool latch. You can probably reduce the activity on it by enabling VP_MEMORY_CACHE_KB so that each CPU VP holds on to some of the memory from the central pool and so has to interact with the pool latch less frequently.

On 23rd September 2013 miller3@us.ibm.com (John Miller iii) wrote:-

The two mutex which are directly effected are sh_stamp and po_lock, but the VP_MEMORY_CACHE_KB speeds up malloc()/free() functionality which can have an indirect impact on many things.

Setting this to a value of 1024 (i.e. 1MB) can have a great impact. If you are not worried about a few MB then setting this to 4096 would be my starting point.

Two additional notes,

1) The functionality of this has been improved so please look at the setting in new releases.

2) This feature is only available with the enterprises edition

6.88 How do I interpret the output of onstat -g nif?

On 27th September 2013 mpruet@us.ibm.com (Madison Pruet) wrote:-

> What might be the possibilities that will generate "RUN,BLOCK " in the State column of the output?

Usually that would mean that you have a very large transaction that the target is trying to apply, and because it is a large transaction it's blocked the network from sending anything until that one transaction is applied.

6.89 How are shared memory segments managed?

On 24th November 2013 art.kagel@gmail.com (Art Kagel) wrote:-

v12.10.xC2 no longer has to keep the buffer pool in the resident segment, it has been moved into a virtual segment of its own and can grow beyond that single segment so the requirement for the buffer pool to be contiguous in memory is gone.

6.90 Should I use onmonitor to manage engine parameters?

On 28th August 2014 art.kagel@gmail.com (Art Kagel) wrote:-

Onmonitor is gone altogether BTW and should not be used anyway in any version after 7.31. It can trash your ONCONFIG file.