6. Common Questions about the Engine

6.21 How do I tune KAIO?

jmiller@informix.com (John Miller)

One way KAIO can hurt performance is when it is no longer asynchronous. This can happen on some platforms when there exists a large number of IO operations being requested. The nice thing about this is that it is usually tunable. On HP there exists an environment variable called IFMX_HPKAIO_NUM_REQ, by default this value is set at 1000. If you are doing alot of I/O this might not be high enough. On DEC I believe it need to be tuned in the kernal and for other platforms you will have to find out what the threshold is and how to tune it.

On 2nd Oct 1998 Eric_Melillo@agsea.com (Eric Melillo) writes:-

An environment variable IFMX_HPKAIO_NUM_REQ is provided to specify the number of simultaneous KAIO requests that are processed by the server at one time. The value can be set in the range of 10 to 5000, the default being 1000. If the error

"KAIO: out of OS resources, errno = %d, pid = %d" occurs, consider increasing the value of IFMX_HPKAIO_NUM_REQ.

On 14th Dec 1997 david@smooth1.co.uk (David Williams) writes:-

KAIO is definitely tunable under AIX, go into smit and there is a whole section on Asynchronous IO.:-

Change / Show Characteristics of Asynchronous I/O

Type or select values in entry fields.

Press Enter AFTER making all desired changes.

                                                        [Entry Fields]
  MINIMUM number of servers                          [1]                       #
  MAXIMUM number of servers                          [10]                      #
  Maximum number of REQUESTS                         [4096]                    #
  Server PRIORITY                                    [39]                      #
  STATE to be configured at system restart            defined                 +

The help for each field is :-

MINIMUM Indicates the minimum number of kernel processes dedicated to asynchronous I/O processing. Since each kernel process uses memory, this number should not be large when the amount of asynchronous I/O expected is small.

MAXIMUM Indicates the maximum number of kernel processes dedicated to asynchronous I/O processing. There can never be more than this many asynchronous I/O requests in progress at one time, so this number limits the possible I/O concurrency.

REQUESTS Indicates the maximum number of asynchronous I/O requests that can be outstanding at one time. This includes requests that are in progress as well as those that are waiting to be started. The maximum number of asynchronous I/O requests cannot be less than the value of AIO_MAX, as defined in the /usr/include/sys/limits.h file, but can be greater. It would be appropriate for a system with a high volume of asynchronous I/O to have a maximum number of asynchronous I/O requests larger than AIO_MAX.

PRIORITY Indicates the priority level of kernel processes dedicated to asynchronous I/O. The lower the priority number, the more favored the process is in scheduling. Concurrency is enhanced by making this number slightly less than the value of PUSER, the priority of a normal user process. It cannot be made lower than the values of PRI_SCHED. PUSER and PRI_SCHED are defined in the /usr/include/sys/pri.h file.

STATE Indicates the state to which asynchronous I/O is to be configured during system initialization. The possible values are defined, which indicates that the asynchronous I/O will be left in the defined state and not available for use, and available, indicating that asynchronous I/O will be configured and available for use.

Hence KAIO is very configurable under AIX 3.2.5p, I expect other version of UNIX are similarly configurable, even if the information is hard to find!

On 17th Sep 1998 mdstock@informix.com (Mark D. Stock) writes:-

Applications can use the aio_read and aio_write subroutines to perform asynchronous disk I/O. Control returns to the application from the subroutine as soon as the request has been queued. The application can then continue processing while the disk operation is being performed.

Although the application can continue processing, a kernel process (kproc) called a server is in charge of each request from the time it is taken off the queue until it completes. The number of servers limits the number of asynchronous disk I/O operations that can be in progress in the system simultaneously. The number of servers can be set with smit (smit->Devices->Asynchronous I/O->Change/Show Characteristics of Asynchronous I/O->{MINIMUM|MAXIMUM} number of servers or smit aio) or with chdev. The minimum number of servers is the number to be started at system boot. The maximum limits the number that can be started in response to large numbers of simultaneous requests.

The default values are minservers=1 and maxservers=10. In systems that seldom run applications that use asynchronous I/O, this is usually adequate. For environments with many disk drives and key applications that use asynchronous I/O, the default is far too low. The result of a deficiency of servers is that disk I/O seems much slower than it should be. Not only do requests spend inordinate lengths of time in the queue, the low ratio of servers to disk drives means that the seek-optimization algorithms have too few requests to work with for each drive

For environments in which the performance of asynchronous disk I/O is critical and the volume of requests is high, we recommend that:

maxservers should be set to at least 10*(number of disks accessed asynchronously) minservers should be set to maxservers/2.

This could be achieved for a system with 3 asynchronously accessed disks with:

# chdev -l aio0 -a minservers='15' -a maxservers='30'

On 3rd Oct 1998 jmiller@informix.com (John F. Miller III) writes:-

Here is the real story behind Informix/HP KAIO over the years. It so long so those of you that make it to the end I have include some HP Kaio tuning tips.

There have been three different version of HP KAIO over the years which are basically the same except their polling medthods. You can tell which version of KAIO by the line put in the online.log at startup

Version 1 "HPUX Version XXX -> Using select based KAIO" Version 2 "HPUX Version XXX -> Using flag style KAIO" Version 3 "HPUX Version XXX ->Using flag/select style KAIO"

Version 1

It uses the select() system call to poll the OS for IO completions status. This causes the kernel to use excess amounts of system time due to the expensive nature of select() system call on HP.

Advantages: Very fast Disadvantages: As the system gets busy the select() system call gets very expensive. Online Versions 7.1, 7.2, 7.3

Version 2

It use a memory address registered with the kernel to poll the OS for IO completions status. When the server issues an I/O request through KAIO, it must poll the HP-UX asynchronous driver to determine when the request has been fulfilled. If a request has not yet been filled, the I/O thread may choose to go to sleep for 10 milliseconds before checking the request status again. Up through ODS 7.1, the I/O thread would sleep using a select() system call. This was changed in 7.2 to use the newer and lighter nanosleep() system call. Performance tests showed this was lighter on the CPU, and still gave us the desired sleep behavior. Sometime after 7.2 was released (we believe), nanosleep() was changed to correctly conform to its IEEE standard. One of the impacts is that we can no longer sleep for just 10 milliseconds. On average, the I/O thread is now sleeping for almost twice that long (when it does not need to) and this is why we believe KAIO performance has gotten so much worse.

Advantages: Signigicant reduction in resources Disadvantages: When the system is not busy the kaio poll thread will wait for I/O an extra 6 to 8 ms (Defect #92132) Not supported prior to HPUX 10.10 Online Versions 7.2, 7.3

Version 3

It use the memory address registered with the kernel to poll the OS for I/O completions status when the system is busy, but when the system goes idle and waits on I/O completions the system will not use nanosleep as in version 2, but will use the select() system call to wait.

Advantages: Signigicant reduction in resources Very fast. In direct response to Defect #92132 Disadvantages: None found to date. Online Versions 7.30.UC3 and above and 9.14.UC1 and above

Question:

Using the select() system call won't this cause the same performance issues as kaio version 1?

Answer:

No it will not. The problem with version one was that the system became busier the amount of I/O inceased which caused an increase in select() calls to be made. In the new method (version 3) we only use the select system call if the system is idle and is wait on I/O. In addition there is a benefit of using the select system call which is the ability to interupt the timer if and I/O completes before the timer expires. This will give early notification of the I/O.

Tune Tips for HP KAIO

RESIDENCY

In version 7.30 for best KAIO performance set RESIDENT flag in the onconfig to -1. In versions prior to version 7.30 set the resident flag to 1. This will reduce a significant amount of kernel locking that must be done on non-resident segments.

NOTE:

Not setting the RESIDENT flag in the onconfig will cause an additional 8KB of memory to be allocated because the online system will require assurance that this flag is on its on OS memory page. If the RESIDENCY flag is set this assurance does not have to be done.

IFMX_HPKAIO_NUM_REQ

This specifies the maximum number of simultaneous KAIO request online can process at one time. This can currently be set between HPKAIO_MAX_CONCURRENT (5000) and HPKAIO_MIN_CONCURRENT (10) with a default value of HPKAIO_DEF_CONCURRENT 1000. The only draw back to setting this higher is memory. Below is the formula to determine how much memory will be used:

bytes of memory = (# of cpuvps)*((12*IFMX_HPKAIO_NUM_REQ)+4) The default setting should take about 12KB per cpu vp. If you see the error "KAIO: out of OS resources, errno = %d, pid = %d" then you should consider raising IFMX_HPKAIO_NUM_REQ.

6.22 Why does ontape not notice when I change device names in onconfig?

kagel@bloomberg.com (Art S. Kagel) writes:-

This is only a problem for LTAPEDEV. Ontape does read the TAPEDEV and TAPESIZE parameters from the ONCONFIG for archives but for logical log backups only the reserved page version of LTAPEDEV and LTAPESIZE are used. Onmonitor modifies both the ONCONFIG and the reserved pages.

6.23 How do I reduce checkpoint times?

johnl@informix.com (Jonathan Leffler) wrote:-

If the performance nose-dives at checkpoint time, seriously consider reducing LRU_MIN_DIRTY and LRU_MAX_DIRTY. Don't try to go below LRU_MIN_DIRTY 1, LRU_MAX_DIRTY 2. The default values (of 30/20 in 4.00 and 60/50 are, in my not so very humble opinion, rubbish for any system with more than about 1 MB of shared memory. Most databases are read a lot more than they are modified.

On 13th Dec 1997 david@smooth1.co.uk (David Williams) wrote

kagel@bloomberg.com (Art S. Kagel) wrote:

"I agree that additional cleaners are of limited use during normal LRU writes but they are needed for fastest checkpointing. Since at checkpoint time each chunk is assigned to another cleaner thread until the threads are exhausted, and since that one thread, as has already been pointed out, is only scheduling the actual I/Os with either the AIO VPs or KAIO VPs, that thread will block on each I/O that it schedules and single thread writes to your disks. You must have multiple cleaners, even with a single CPU VP, since the other cleaners can wake and schedule the I/Os that they are responsible for while the first cleaner is blocked waiting for I/O service...My point is that the single cleaner thread has to wait for the issued I/O to complete. It does relinquish the CPU VP to other threads so that other work is not blocked

So I recommend 1.5 * min(#chunks,#LRUs) as the number of cleaners keep the 2-2.5 * NUMCPUs idea for multiple CPUs in mind and constrained by a maximum useful value of about 64 for multiple CPU systems."

david@smooth1.co.uk (David Williams) continued

I've tried on a single CPU machine at work, here are my findings.


  Initial Setup for tests
  =======================

  Online 7.12.UC1 on an ICL Teamserver running DRS/NX.
  
  3 chunks , 2K pagesize

  Dbspace  File  size          Device

  rootdbs  root1 5000 pages on /dev/c0d0s21     DIsk 1
  dbs1     chk1  1500 pages on /dev/c0d0s21     Disk 1
  chk2     chk2  1500 pages on /dev/c1d4s1      Disk 2

  1 CPU VP, 2 AIO VPs.

 
  Database setup via dbaccess
  ===========================

  create database dlo    # This is in the root dbs and have no logging
  create table t1
   (
   c1 char(2020)
   ) in dbs1;

  create table t2
  (
  c1 char(2020)
  ) in dbs2;


  
  Source code for test program
  ============================


  2.c

  #include <stdio.h>
  #include <sys/types.h>
  #include <time.h>

  main()
  {
  time_t t;
  int i,a;

  a=doit1(0);
  
  t=time(NULL);
  fprintf(stderr,"%s\n",ctime(&t));
  
  a=doit2(0);
  t=time(NULL);
  fprintf(stderr,"%s\n",ctime(&t));
  

  1.4gl 


  FUNCTION doit1()
     DATABASE dlo
     DELETE FROM t1
     DELETE FROM t2
  END FUNCTION

  FUNCTION doit2()
     DEFINE i INTEGER

     FOR i = 1 TO 1000
        INSERT INTO t1 VALUES("11")
        INSERT INTO t2 VALUES("22")
     END FOR
  END FUNCTION


  Run this program once to populate the tables...


  Disk layout
  ===========

  oncheck -pe nows shows

  rootdbs = sysmaster+sysutil.
  
  chunk2 (chk2) = 
                          START LENGTH
    OTHER RESERVED PAGES  0     2
    CHUNK FREE LIST       2     1
    TBLSPACE TBLSPACE     3     50
    dlo.t1                53    1008
    FREE                  1061  439 
  

  chunk3 (chk3) = 
                          START LENGTH
    OTHER RESERVED PAGES  0     2
    CHUNK FREE LIST       2     1
    TBLSPACE TBLSPACE     3     50
    dlo.t2                53    1008
    FREE                  1061  439  


  Online configuration
  ====================

  OK so we have 2 tables on 2 disks

  We have

  NUMAIOVPS 2
  NUMCPUVPS 1
  BUFFERS 2000
  Physical Log Size 2000
  LRU_MAX_DIRTY 99
  LRU_MIN_DIRTY 98

  This is a small instance so I've forced more I/O to occur at 
  checkpoint time.
 

  Test procedure
  ==============

  1. start online                   
    # Note this is from 'cold'
    # Since the program has been run once we have 'preallocated 
    # our extents. Since we delete and reinsert the same rows
    # with no indexes, we have exactly the same I/O's occuring
    # to the same pages on both runs.

  2. Run program above once.
  3. onmode -c  # Force checkpoint
  4. onstat -m  # Note checkpoint time.

  Repeat 2,3*4 two more times


  Timing with CLEANERS = 1
  ========================

        Program             
  Start    End      Run Time   Checkpoint Times   Total checkpoint time
  17:12:18 17:12:59 41         4/4/4/5/5/2        24 
  17:13:50 17:14:30 40         4/4/5/4/4/1        22
  17:15:33 17:16:37 64         4/4/4/5/5/1        23

  Timing with CLEANERS = 2
  ========================

        Program             
  Start    End      Run Time   Checkpoint Times   Total checkpoint time
  17:18:28 17:19:04 36         4/3/3/3/3/1        17 !!!!  
  17:20:09 17:20:44 35         4/3/4/2/3/1        17
  17:21:45 17:22:20 35         4/3/3/2/2/1        15


  With 1 AIO VP (just as control or I forgot to set it to 2 first time 
  aoround!!)

  1 CLEANER

        Program             
  Start    End      Run Time   Checkpoint Times   Total checkpoint time
  16:40:37 16:41:17 40         4/4/4/5/5/2        24
  16:42:22 16:43:01 39         5/4/4/4/4/1        22
  16:44:12 16:44:52 40         4/4/4/5/4/1        22

  2 CLEANERS

        Program             
  Start    End      Run Time   Checkpoint Times   Total checkpoint time
  16:31:12 16:31:52 40         4/4/4/5/4/2        23
  16:32:45 16:33:24 39         4/4/5/4/4/1        23
  16:35:10 16:35:50 40         4/4/4/4/4/1        21  


  Summary
  =======

  With 1 AIO VP                 checkpoint=22-24 secs program = 39-40
  With 2 AIO VPs and 1 CLEANER  checkpoint=22-24 secs program = 39-40
  With 2 AIO VPs and 2 CLEANERS checkpoint=15=17 secs program = 35-36!!


  Conclusion
  ==========

  Checkpoints are faster and LRU cleaners faster with more cleaners
  on a single cpu vp machine.

6.24 Why does Online show up as "oninit -r" processes?

On 17th Dec 1997 kagel@bloomberg.com (Art S. Kagel) wrote:-

The engine was brought up by a cold restore (ontape -r). Ontape is only a manager of the restore process the oninits do all the work so after restoring the reserved pages itself ontape starts up the Informix engine to do the rest. The undocumented -r argument to oninit notifies the master oninit process to coordinate with the ontape and not attempt to read anything beyond the reserved pages (the DATABASE TABLESPACE and TABLESPACE TABLESPACE are normally read at startup but they are not there). When the restore completes the engine is in quiescent mode and apparently someone ran an onmode -m to bring it online. When you next shutdown and restart that instance the "problem" will vanish.

6.25 What are these CDRD_ threads?

On 19th Dec 1997 satriguy@aol.com (SaTriGuy) wrote:-

>Right now we are seeing that there are almost 700 informix threads on
>the primary instance. I get this using 'onstat -u'. All these threads
>are owned by informix. The Flags field of the 'onstat -u' o/p has
>"Y--P--D" (Y - waiting on a condition, P - Primary thread for a session,
>D - deamon thread) for all the above threads. The 'pid' of the above
>sessions is 0 in the sysesssions table.

On 7.2x with Enterprise Replication, there are a lot of threads that might be generated. If the majority of them start with CDRD_, then they are the "data sync" threads. If there appear to be an excess of them in comparision to the repl threads, then there could be a problem.

6.26 What does onstat -h do?

On 8th Sep 1998 Gary.Cherneski@MCI.Com (Gary Cherneski) wrote:-

Hope this helps:

# of chains A count of the # of hash chains for which the length for members of that chain is the same.

of len The length of the chains.

total chains The total number of hash chains (buckets) created.

hashed buffs The number of buffer headers hashed into the hash buckets.

total buffs The total number of buffers in the buffer pool.

6.27 How do I start up Extended Parallel Server?

On 27th Aug 1998 tschaefe@mindspring.com (Tim Schaefer) wrote:-

Regular Startup EveryDay After Initialization: xctl -b -C oninit First and ONLY FIRST time Initialization: xctl -b -C oninit -iy

You should be on 8.2 now so as to take advantage of any potential bug fixes, etc.

6.28 How many AIO VPs should I have?

On 16th Sep 1998 kagel@bloomberg.net (Art S. Kagel) wrote:-

WHOA HORSEY! Informix's recommendations for 7.10 was 2-3 AIO VPs per chunk, for 7.14 they made some improvements and changed that to 1.5-2 AIO VPs per chunk, as of 7.21 the recommendation, after another round of optimizing AIO VPs is 1-1.5 AIO VPs per chunk (leaning toward 1), I have not seen any recommendation for 7.3 but with the other speedups to the 7.3 code stream I'd say that 1 AIO VP/chunk was now a maximum. I have always found that these recommendations are MORE than sufficient. I actually only count active chunks when determining AIO VP needs. For example on one pair of servers I have 279 chunks and growing but many of these are holding one months data for four years of history (ie all September data in one dbspace made of 8-12 2GB chunks) so only one out of twelve dbspaces ( and their chunks) are active and I do very well with 127 AIO VPs (io/wp 0.6 -> 1.3 with most at 0.8 -> 1.1). IBM's recommendation is excessive in the extreme.

Run onstat -g ioq to monitor q length by chunk and onstat -g iov to monitor the activity of the KAIO and AIO VPs (BTW you can determine which are being used this way). If almost all of your AIO VPs are performing at least 1.0 io/wp with many showing 1.5 or more then you need more AIO VPs if not, not. A value of io/wp means that an AIO VP was awakened to handled a pending IO but before it could get to the required page another, previously busy, AIO VP finished what it was doing and took care of it. In effect having some of your VPs below 1.0 means you could even do without those VPs except at peak. Conversely, if you have any AIO VP with <0.5 io/wp you can probably reduce the number of AIO VPs accordingly since more than half the time these VPs awaken there is nothing for them to do, they are just wasting cycles and taking up swap space.

If it turns out, as I expect, that the VPs are not the culprit look to your disk farm. Do you use singleton disk drives on a single controller? Work toward the ultimate setup RAID1+0 of at least eight mirrored pairs, spread across at least 4 controllers (2 primary, 2 mirror, preferably with 4 additional backup controllers handling the same drives), 8 or 16K stripe size, 500MB-1GB cache per controller. Expensive, but no I/O problems. Anyway RAID1+0, small stripe size.

6.29 How should I use mirroring + striping?

On 8th Oct 1998 kagel@bloomberg.net (Art S. Kagel) wrote:-

RAID 0+1 (ie mirroring two stripe sets to each other): If a drive fails one side of the mirror, an entire stripe set is useless and the whole stripe needs to be recovered from mirror. Also you can only recover from multiple drive failures if all failed drives are on the same side of the mirror. (Probability of data loss from multiple drive failure is 50%.) This can be a real problem if you buy drives in large orders which can come from the same manufacturing batch and may all be prone to the same failure. (Don't poo poo, it happened to us! A whole batch of IBM 2GB drives all failed at 100,000 hours of operation we had as many as 20 drives failing on the same day until they were all replaced. We got lucky and data loss was minimal.) The performance penalty during drive rebuild can be as much as 90% of capacity. Also until the drive is replaced the read performance boost that mirroring gives you is completely lost.

RAID 1+0 (ie stripe several mirrored pairs together): If a drive fails only that one drive needs to be rebuild and the read performance boost that mirroring gives you is only lost for that one pair - until the drive is replaced other pairs continue to perform. During drive replacement the performance penalty cannot exceed about 90% of 1/N% of capacity where N is the number of pairs in the stripe set. In addition you are protected against multiple drive failures even before the defunct drive is replaced unless both drives of a mirrored pair are lost (probability 1/2N). Total rebuild time is 1/Nth of RAID 0+1 also.

On 11th May 1998 kagel@bloomberg.net (Art S. Kagel) wrote:-

On the Fragmentation -vs- Striping issue I am going to disagree, in part. I believe that you should BOTH stripe across as many disks and controllers as possible (I use at least 10 drives but prefer 20-30 spread over 2 to 3 separate controllers) and then fragment your critical tables also. In this way you are unlikely to develop any REAL hot-spots. Striping should be with a 16K page size so that any single Informix BIGREAD comes from a single drive while readahead affects several drives. Also, when creating the 2GB virtual drive devices on the stripe set try to arrange things so that each successive virtual drive begins on a different disk and that the stripe is built such that successive drives are on different controllers, if possible. This will go a long way toward eliminating any hotspots. With this scheme only hitting the same 8 pages repeatedly could possible create a hostspot and Informix's BUFFER cache will prevent that from happening anyway.

Your administration load of tracking I/Os to eliminate hotspots will dwindle SIGNIFICANTLY with this scheme.

On 24th Aug 1998 Gary.Cherneski@MCI.Com (Gary D. Cherneski) wrote:-

I am a Sr. DBA managing large (> 1 tb) Informix OLTP dbs. We use Veritas striping and mirroring (RAID 10.) For down and dirty oltp using indexes, its great. We round robin all our table data (many in excess of 300 M rows) and have all of our indexes in a single dbspace.

Your assumption with what your Informix SE said is correct. This configuration will not work optimally (but it will work) for DSS type queries. In addition, large multi-table joins have a somewhat diminished performance, even if they use indexes.

If your primary need is OLTP but you still must maintain a strong DSS type environment, you might try using RAID 10 with veritas striping/mirroring and an expression based fragmentation scheme for the data and indexes. This would give you the ability for multiple threads when performing DSS type queries as well as more threads for multi-table joins. The downside of this is that it is harder to maintain (all of the frag expressions.)

If DSS type query performance is not a factor but you have a significant amount of multi-table joins that are critical, you could try round-robin'ing your data and using expression based fragmentation for your indexes. This would give a better performance for those multi-table joins and is not as terrible to maintain.

If DSS type query and multi-table join query performance is not a factor, I would go with the scheme we use: round-robin your data in n buckets and use one dbspace for indexes. This by far is simpler to maintain since you do not have to watch frag expressions as they grow and this will deliver the OLTP performance you need.

Additionally: Are you considering "hot spares" where veritas will automatically migrate a disk slice to a spare drive/slice in the event of problems? It helps with redundancy but watch out for the long sync times when disk trays or controllers go south.