6. Common Questions about the Engine


6.58 Why should I not use RAID 5?

On 1st December 1999 kagel@bloomberg.net (Art S. Kagel) wrote:-

There are two problems with RAID5. The first is performance which is the one most people notice and if you can live with write throughput which is 50% of the equivalent RAID0 stripe set then that is fine. The performance hit is caused because RAID5 ONLY reads the one drive containing the requested sector leaving the other drives free to return other sectors from different stripe blocks. This is the reason that RAID5 is preferred to RAID3 or RAID4 for filesystems, this feature improved small random read performance. However, since the parity and the balance of the stripe block were not read, if you rewrite the block (which databases do far more frequently than filesytems) the other drives must all be read and a new parity calculated and then both the modified block and the parity block must be written back to disk. This READ-WRITE-READ-WRITE for each modified block is the reason RAID5 is so poor in terms of write throughput. Large RAID controller caches and on controller firmware level RAID implementations alleviate the problem somewhat but not completely and write performance still hovers at around half what a pure stripe (RADI0) would get.

The second problem, despite what others have said IS a FUNDAMENTAL problem with the design of RAID5 which various implementors have tried to correct with varying levels of success. The problem is that if a drive fails slowly over time, known as partial media failure,where periodically a sector or two goes bad, this is NOT detected by RAID5's parity and so is propagated to the parity when that sector is rewritten which means that if another drive fails catastrophically its data will be rebuilt utilizing damaged parity resulting in two sectors with garbage. Now this may not even be noticed for a long time as modern SCSI drives automatically remap bad sectors to a set of sectors set asside for the purpose but the corrected error is NOT reported to the OS or the administrators. Over time if the drive is going it will run out of remap sectors and will have to begin returning data reconstructed from the drive's own ECC codes.

Eventually the damage will exceed the ECC's ability to rebuild a single bit error per byte and will return garbage.

RAID3 and RAID4 are superior in both areas. In both all drives are read for any block which improves sequential read performance (Informix Read Ahead depends on sequential read performance) over RAID5 and parity can be (and in most implementations IS) checked at read time so that partial media failure problems can be detected. Write performance is approximately the same as RAID0 for large writes or smaller stripe block sizes. One problem with early implementations of RAID3/4 was slow parity checking since it has to be calculated for every read and every write. Modern controller based RAID systems use the on-board processor on the SCSI controller to perform the parity checks without impacting system performance by tying up the a system CPU to check and produce parity. These RAID levels require the exact same number of drives as RAID5.

RAID10 provides the best protection and performance with read performance exceeding any other RAID level (since both drives of a mirrored pair can be reading different sectors on parallel) and write performance is closest to pure striping. Indeed in a hardware/firmware implemented RAID10 array with on-board cache apparent write throughput can exceed RAID0 for brief periods due to the two drives of each pair being written to independently though the gain is not sustainable over time.

A third problem with ALL RAID3/4/5 from which RAID10 does not suffer is multiple drive failure. (Ever get a batch or 200 bad drives? We have!) If one drive in a RAID3/4/5 array fails catastrophically you are at risk for complete data loss if ANY of the remaining 4 (or more) drives should fail before the original failed drive can be replaced and rebuilt. With RAID10, since it is made up as a stripe set of N mirrored pairs, when a drive fails you are only at risk for complete data loss if that one drives particular mirror partner should fail. Make each mirrored pair from drives selected from different manufacturer's lots and the probability of this happening become vanishingly small.

Fourth problem. During drive rebuild RAID3/4/5 (and RAID01 mirrored stripe sets) performance of the array during the rebuild can degrade by as much as 80%! Some RAID systems let you tune the relative priority of rebuild versus production to reduce the performance hit to as low as about 40% degradation but this will increase the recovery time increasing the number of production requests that are degraded and increasing the risk of the previous problem with a second drive failure. RAID10, since only one drive is involved in mirror recovery, the array's performance (for a 4 drive array) is degraded only a maximum of 80% for reads and writes against the failed pair and only slightly (due to controller traffic) for accesses to the other drives, on average, since the one pair comprises only 20% of accesses, performance is affected no more than 16% during recovery and the risk of catastrophic data loss is reduced.

On 31st January 2000 kagel@bloomberg.net (Art S. Kagel) wrote:-

Editor's note RAID 0+1 is striping followed by mirroring and RAID 1+0 (RAID 10) is mirroring followed by striping

The problem is that unless you are using hardware RAID01 the mirror software/ firmware does not know it is mirroring a stripe set and so can only recover the entire logical drive that has failed from its mirror. The logical drives that are being mirrored are the entire stripe sets so that is the unit of recovery. IFF you are using a hardware RAID01, or I suppose a sophisticated software only RAID01, then it is POSSIBLE, though not likely, for the firmware to have enough knowledge of the layout to realize that only one drive of the stripe set is down and recover only that one drive. HOWEVER, in this case you do not really have RAID01, since each drive in the stripe sets is mirror individually, what you have, functionally, is RAID10 with an unfortunate name of RAID01!

There is clearly confusion about these two pseudo RAID levels, mostly because they are not officially defined levels but levels that developed in the field when users, and later vendors, combined striping and mirroring to gain the advantages of both and ended up creating these new levels. Both the mirror over stripe and stripe over mirror approaches came to be known as RAID 0+1 and it became rather confusing. About two years ago, to help clear things up, I proposed that we separate the two approaches by calling one RAID01 and the other RAID10 based on the order in which the stripe or mirror were applied, so RAID01 refers to stripe then mirror and RAID10 to mirror then stripe. Some of my messages were in response to cross-posts with RAID related newsgroups so the proposal received some wider exposure than just Informix folk.

Unfortunately there is little discussion about the differences between RAID01 and RAID10 anywhere else than here though from discussions I have had with two controller manufacturers the RAID01/RAID10 as described has become a pretty standard way of referring to the stripe/mirror combinations.

On 19th May 2000 kagel@bloomberg.net (Art S. Kagel) wrote:-

RAID10 for data and RAID1 or (sparate) RAID10 for logs and rootdb. Backups can go anywhere since they should be swept to tape immediately or at least frequently if written to disk. If you do not sweep them immediately consider rcp'ing them to another system on the network immediately for safety.

Considering that write performance of RAID5 is 50% that of RAID0, RAID3 or RAID4 and that RAID10 does a bit better than straight RAID0 why even consider RAID5 for the MOST I/O intensive part of the server, the logical and physical (at least in 7.xx) log dbspaces? One could make the case that the RAID5 safety issue is moot for the logs since the physical log is transitory and the logical logs are constantly backed up (you are using one of the constant log backup options I hope) for a server with a low transaction rate, since write performance is not an issue either, but you are describing a server with an very high transaction rate initially and even down the road! Logical log write performance is CRITICAL to server performance in such a system!

On 13th June 2000 kagel@bloomberg.net (Art S. Kagel) wrote:-

OK, I will deal with why RAID10 (and by extention RAID1) is not as much at risk. Remember that the problem is not the partial media failure trashing the data on the one drive that is in the process of progressively failing over time but the problem that good data will be read from another stripe member (call it drive #1) for the stripe block containing damaged data on the failing drive (call it drive #2) and modified causing a read of the remaining stripe members and the calculation of new parity after which the modified block will be written back to drive 1 and the parity back to that stripe's parity drive (perhaps drive#3). Now since the data from drive 2 was not needed by the application (here IDS) the fact that it was damaged was not detected and so the new parity was calculated using garbage resulting in a parity that can ONLY be used accurately to recreate the trashed block on drive 2. Now suppose that drive #4 suffers from a catastrophic failure and has to be replaced. The damaged drive has continued to fail and now returns a different pattern of bits than was used to calculate the trashed parity block on drive 3. Now when the missing block data on the drive 4 replacement is calculated it too will become garbage and two disk blocks are now unusable, the damage has propagated. Now that dealt with what happens if the bad block was NOT read directly and detected by IDS. If it was IDS will mark the chunk OFFLINE and refuse to use it until you repair the damage. The only way you can do that is if you restore from backup or remove the partially damaged drive and try to rebuild it from the parity as if it had completely failed. HOWEVER, if all, or even several, of the drives in that array are from the same manufacturing lot or are even of similar age, there is a good chance that the previous problem has already trashed the parity of other blocks so that you will possibly be reconstructing a new drive that will have more bad data blocks than the one it replaces.

With RAID10, each drive in each mirrored pair is written independently. If a block on drive a is trashed the data on drive 1b (its mirror) is fine. If the bad data is read from the drive that is failing (say 1a) the engine will recognize it and mark the chunk down. All you have to do is remove drive 1a and mark the chunk back online rebuilding the mirror online. No problems and less chance that there are other damaged blocks on the one remaining mirror than on any of the 5 or more drives in a RAID5 stripe. If the data is NOT read from 1a but from 1b and modified it will be rewritten to BOTH drives improving the chances that it will be correctly readable if read from the failing 1a next time just because the flux changes will have been renewed, if the platter is too far gone we are just back to the possibility that the bad block will be read and flagged by IDS later. In no case can the data on 2a/2b, 3a/3b, 4a/4b, etc be damaged. Yes, if we were talking about ANY old data file on RAID10 the damage might propagate but since IDS has its own methods for detecting bad data reads this probability is vanishingly small (the damage to the block would have to NOT alter the contents of the first 28 bytes or the last 4 to 1020 bytes thus not damaging the page header, page trailer, or the slot table to not be detected by IDS)

On 30th June 2000 kagel@bloomberg.net (Art S. Kagel) wrote:-

RAID5 and any striping scheme (RAID0, RAID3, RAID4, or RAID10) will give you some of the load balancing advantages of fragmentation but not the parallelism that fragmentation along with PDQ can give you and not the fragment elimination advantage either.

RAID5 give 1/2 the write performance of RAID0 (striping alone) and RAID10 will give a slight write performance increase over RAID0 so RAID5 is less than 1/2 the write performance of RAID10. In addition DG/EMC Clariions have an excellent RAID10 implementation which will outperform Informix mirror by a noticable margin and you will not have to waste the parity drives. In your case using RAID10 instead of RAID5 + mirror (ie RAID1) (what is that RAID51?) will actually cost you less or give you more storage for the same number of drives!

In addition, there is a PROBLEM with using Clariion RAID5 with Informix. Though DG/EMC deny it (I have seen it happen for years though) if there is a RAID5 error, ie a lost drive, Informix will receive I/O errors for several seconds when the parity data reconstruction first kicks in and will mark your chunks offline even though the RAID5 is correcting the problem and returning good data. This does not happen with Clariion RAID10 when a drive fails!

Use Clariion RAID10 and no Informix mirrors!

If you want to take advantage of parallel searching you MUST use Informix fragmentation and fragment the table across multiple dbspaces, yes

On 5th April 2003 paul@oninit.com (Paul Watson) wrote:-

Never ever ever ever ever use the same production batch of disks throughout a raid system. In a previous life, I was one of team of 4 looking after just over 400 production Unix servers [40,000 (ish) disks].

We had a disk failure ever week somewhere, and when we tracked the batch numbers they was a very good correlation between the batch and the failure date.

6.59 Are indexes bigger under 9.2?

On 11th September 2000 kagel@bloomberg.net (Art S. Kagel) wrote:-

Yes

6.60 Why is Online 9.14 so slow?

On 25th August 2000 kagel@bloomberg.net (Art S. Kagel) wrote:-

AHAH! This could be the whole problem. IDS/US 9.1x had MAJOR performance problems on anything even remotely OLTPish. Definitely upgrade to IDS v9.21 and look to eventually upgrade to 9.30 when it comes out and has had time to stabilize. 9.1 was based on the VERY OLD 7.0 codebase while the 9.21 is based on the 7.31 code base and is MUCH faster with FULL OLTP support.

6.61 How do I configure DD_HASHMAX and DD_HASHSIZE?

On 13th Septmber 2000 david@smooth1.co.uk (David Williams) wrote:-

NOTE: These previously undocumented parameters under 7.x are now documented in the IDS 9.x manuals!!

On 31st August 2000 andy@kontron.demon.co.uk (Andy Lennard) wrote:-

In case it may be useful (?) here's a tacky script that I've used to work out what my DD_HASHSIZE and DD_HASHMAX should be set to.

It may save a bit of trial-and-error.


#! /usr/bin/ksh

# Attempt to work out 'optimum' values for DD_HASHSIZE and DD_HASHMAX
#
# DD_HASHSIZE sets the number of slots in the dictionary
# DD_HASHMAX sets the max number of table names allowed in each slot
#
# for efficient searching of the dictionary the max number of tablenames
# present in a given slot should be small
#
#
# A Lennard 20-sep-1999
#

if [[ $# -lt 1 ]]; then
   echo Usage: $0 database [database...]
   exit 1
fi

# make a list of all the databases we need to consider
#  remember to include sysmaster too...
db_list=\"sysmaster\"
while [ $# -gt 0 ]
do
  db_list=$db_list,\"$1\"
  shift
done

# all tables get put into the dictionary, not just user ones
dbaccess sysmaster - </dev/null
output to temp$$ without headings
select tabname
from systabnames
where dbsname in ($db_list)
EOF

nawk '
BEGIN {
  
  # make an array for numeric representation of ascii
  for (i = 0; i < 128; i++) {
    c = sprintf("%c", i);
    char[c] = i;
  }

  # initialise a counter for the number of tables in the database
  n_tables = 0;

}
{
  # skip null lines
  if (length( $1 ) == 0) { next; }
}
{
  # on non-blank lines...

  # save the table name
  tabname[n_tables] = $1;

  # evaluate the checksum of the characters of the table name
  s=0;
  for (i = 1; i <= length($1); i++) {
    s += char[substr($1, i, 1)];
  }
  sum[n_tables] = s;

  n_tables++;
}

function isprime(i) {
  int j;
  for (j = 2; j< i/2; j++) {
    if (i%j == 0) return 0;
  }
  return 1;
}

END {

  printf("%d tablenames found\n", n_tables);

  # look through the checksums to find the max checksum
  max_sum = 0;
  for (i = 0; i< n_tables; i++) {
    if (sum[i] > max_sum) {
      max_sum = sum[i];
    }
  }

  # then use this max checksum as a top limit when going through finding
  # out how many table names have an identical checksum

#  printf(" checksum            tablename\n --------            --------
-\n");
  max_same_sum = 0;
  for (j = 0; j<= max_sum; j++) {
    same_sum = 0;
    for (i = 0; i< n_tables; i++) {
      if (sum[i] == j) {
#        printf(" %8d %20s\n", sum[i], tabname[i]);
        same_sum ++;
      }
    }
    if (same_sum > max_same_sum) {
      max_same_sum = same_sum;
    }
  }

  printf("Some table names have the same checksum,\n");
  printf(" the theoretical minimum value of DD_HASHMAX is %d\n\n",
max_same_sum);

  printf(" DD_HASHSIZE     Max names\n")
  printf("                 per slot\n")

  # Now it is reckoned that DD_HASHSIZE should be a prime number
  # so loop over the prime numbers, ignoring the really small ones
  for (prime = 17; prime < 1000; prime += 2 ) {

    if (!isprime(prime)) {
      continue;
    }

    # initialise the slots that the tablenames would be stored in
    for (j = 0; j < prime; j++) {
      slot[j] = 0;
    }

    # for each table..
    for (j = 0; j < n_tables; j++) {

      # work out which slot the table name will hash to
      this_slot = sum[j] % prime;

      # increment the number of table names that would be stored in this
slot
      slot[this_slot]++;

#      printf "Table %s will go into slot %d\n", tabname[j], this_slot;

    }

    # now see which slot has the greatest number of names in it
    max_names = 0;
    which_slot = 0;
    for (j = 0; j < prime; j++) {

#      printf "Slot %d has %d names\n", j, slot[j]

      # does this slot contain more names than the greatest seen so far?
      if (slot[j] > max_names ) {
         max_names = slot[j];
         which_slot = j;
      }
    }

#    printf " slot %d has the max number of names (%d)\n\n", which_slot,
max_names

    if (max_names == max_same_sum) {
      printf("%8d         %5d  -- matches theoretical minimum
DD_HASHMAX\n", prime, max_names);
    } else {
      printf("%8d         %5d\n", prime, max_names);
    }
  }
}
' temp$$

rm temp$$


On 31st March 2003 michael.mueller@kay-mueller.de (Michael Mueller) wrote:-

Some more remarks / corrections about the data dictionary cache:

I was wrong about the hash function. This is what it really does: It multiplies each byte in the table name (ignoring server, database and owner name) with it's (position in the name) * 2 + 1 and adds them all up. In the end it adds the square of the next position * 2 + 1 and takes the result modulo DD_HASHSIZE.

An example will make this clearer: The hash code for a table named "abc" is:

		('a' * 1 + 'b' * 3 + 'c' * 5 + 7 * 7) % DD_HASHSIZE =
		935 % DD_HASHSIZE

The DD_HASHMAX parameter determines how many entries will be cached in each list. This is shown by "onstat -g dic". The list is in least recently used order. Each time a table is released (when a sql statement finishes) its refcnt in the cache is decreased and the table is moved to the front (the mru end) of the list. If the list is longer than DD_HASHMAX, entries with refcnt = 0 are removed beginning from the back of the list (the lru end) until the number of list entries either reaches DD_HASHMAX or the beginning of the list is reached. Entries with positive refcnt are never removed.

A simple way of playing around with this is to set DD_HASHSIZE to 1, DD_HASHSIZE to say 10, create a couple of tables, select from them and use onstat -g dic to monitor the cache.

When should DD_HASHSIZE and DD_HASHMAX be changed? You probably want all frequently accessed tables cached. That is one point. This could be done by increasing either of these values. Making the maximal size of the hash lists a bit longer would be no big deal, at least not in a single processor configuration.

But for a large multiprocessor there is another important consideration. Every hash list is protected by a mutex called "ddh chain" (all with the same name unfortunately) to make sure that only one thread at a time can either insert or delete an entry or increase or decrease an entry's refcnt. If DD_HASHMAX is too large, the mutex might become a hot spot and onstat -g ath and onstat -g lmx might show threads waiting for that mutex. For that reason it is better to increase DD_HASHSIZE and keep DD_HASHMAX small.

6.62 How does Online page size vary across platforms?

On 13th Septmber 2000 heiko.giesselmann@informix.com (Heiko Giesselmann) wrote:-

Page size is not related to the word width of a specific version, i.e. on a given platform 32 bit and 64 bit versions will have the same page size (otherwise you would have to convert the page format of the complete instance disk space going from a 32 bit to a 64 bit version).

As far as I know the only platforms with 4K pages are NT and AIX (again, regardless of 32 bit or 64 bit versions).

On 10th June 2001 obnoxio@hotmail.com (Obnoxio The Clown) wrote:-

IIRC, Sequent/Dynix is 4K as well.

On 10th June 2001 vze2qjg5@verizon.net (Jack Parker) wrote:-

With most engines it depends on OS. With XPS you can set it to 2, 4 or 8K pages. Norm is 2K.

On 7th February 2001 kagel@erols.com (Art S. Kagel) wrote:-

pagesize is ONLY configurable by the DBA in XPS (8.xx). As Jonathan stated in 6.x, 7.x, & 9.x it is fixed at 2K for all platforms except AIX and NT which are fixed at 4K. The ONLY other version of Informix I am aware of that EVER had a pagesize difference than 2K was the old Turbo engine on Amdahl mainframes where it was 4K as well.

In 5.xx there was an ONCONFIG parameter, PAGESIZE, which you could set at your own peril at init time to set the pagesize, however, many folk at Informix warned me that noone was sure that that parameter was being properly and consistently used everywhere in the code and that trouble including data loss might follow changing PAGESIZE from its default value or 2048.

On 24th April 2005 david@smooth1.co.uk (David Williams) wrote:-

Note under IDS 10.x this is configurable as a multiple of the default pagesize for the platform (i.e. a multiple of 2K or 4K) up to 16K.

6.63 What should I check when Online hangs?

On 14th September 2000 murray@quanta.co.nz (Murray Wood) wrote:-

Check the OS is still running.
Check the Informix messages log.
Check the status of the oninit processes - using time?
Check onstat -u changing?

On 14th September 2000 cobravert99@hotmail.com () wrote:-


A few other things to check....

   Make sure your instance isn't rolling back from a long
transaction...
   
   Check to see if you are in a deadlock situation..
   
   Use onstat -g lmx (on 9.x anyway) to make sure you aren't running
into a mutex lock bug.
   
   If you have Shared memory dumps turned on and run into an Assert
Fail, it could take a while for the dump to finish.

   Make sure that Nettype is high enough for the amount of users that
are accessing the database..(It can sometimes seem hung).

   Check to see if AFDEBUG is set to 1...

On 15th October 2000 bryce@hrnz.co.nz (Bryce Stenberg) wrote:-

1  check if the logs are not full
2  check the LRU and the page cleaners
3  increase the number of  open files 
4  check the status of the chunks

6.64 Is IDS affected by Windows service packs?

On 3rd Septmber 2000 andy@kontron.demon.co.uk (Andy Lennard) wrote:-

I was talking with our Informix help support people last week about this matter - as in what service pack versions are certified to what Informix products.

If I remember correctly (and I was referring to the product we use here - Informix Dynamic Server - Workgroup Edition for NT):

Service packs are backwardly compatible - as in later SP's include the updates of earlier SP's - you don't have to apply them sequentially.

Also to note, if running a Windows NT network - if you have servers running SP3 and SP4 or later then you will have problems - in SP4 Microsoft changed the way security is handled making for synchronisation problems within domain - see M$ Knowledge base article Q197488.

6.65 When does onbar read the onconfig file?

Can anyone tell me whether a modification to the BAR_MAX_BACKUP parameter in the onconfig file requires the engine to be "bounced" for the change to take effect? The version is 7.31.UC5.

On 31st October 2000 kernoal.stephens@AUTOZONE.COM () wrote:-

No the instance does not need to be bounced. Onbar read the onconfig file when it runs.

6.66 Why are archives so slow and do they use the buffer cache?

On 23rd October 2000 kagel@bloomberg.net (Art S. Kagel) wrote:-

Improving your cache ratio will not help the archiving speed since the onarchive thread which does the actual page reading for ontape, onarchive, and onbar reads the physical disk into a small set of private buffers to avoid causing the buffer cache to thrash and affect performance for user threads. Sorry. How many chunks are defined? Is this a long existing instance that has been upgraded, over time, to 7.31FC6? You may be running into the page timestamp bug which is not fixed until the C8 maintenance release (though you can ask for a back-port patch to FC6 if you have maintenance it is a backward compatible fix.

The problem is that there are pages with timestamps so old that the timestamp values are wrapping to negative so to prevent them from wrapping again and overtaking the old pages the engine, during each archive, restamps the oldest pages. Unfortunately it seems that it stops archiving to do this which lets the tape stop and have to be backed up and spun up to speed again. The patch apparently just gathers the pagenumbers that need restamping and batches the updates after the archive is complete or assigns a separate thread to do the job.

6.67 How does NETTYPE work and how do I tune it?

On 20th October 2000 heiko.giesselmann@informix.com (Heiko Giesselmann) wrote:-

NETTYPE establishes an actual limit for shared memory connections only. In the shared memory case the nettype settings basically define the number of 'memory slots' that clients can use to pass messages to the database server.

For all other connection types it is rather a tuning parameter that helps the engine to estimate how much memory it should set aside for networking purposes. If there is a performance problem it would show up as non-zero values in the 'q-exceed' columns of the 'global network information' section in the 'onstat -g ntt' output.

Versions 9.21 and later include a column 'alloc/max' in the 'onstat -g ntt' output that allows to check how many network buffers are currently allocated and the maximum of used buffers. These values would give you an indication on what has to be configured for NETYPE. More information on this topic can be found in the MaxConnect manual

6.68 Compatibility between NT 4.0 service packs and IDS?

On 27th September 2001 philippe.carrega-da-silva@ctib.cnamts.fr (Carrega Philippe) wrote:-

Service Pack 3 (SP3)
     7.23.TC13 or lower
     7.30.TC6  or lower
     9.14.TC5  or lower

Service Pack 4 (SP4) **

** Customers should not use a version of our product with SP4 or
greater,if that version does not have bug 117981 fixed in it.  This is a
showstopper bug and has resulted in many down systems.  Only versions 
with the fix are

listed below:

     7.30.TC11
      7.31.TC5
     9.20.TC1  or greater

Service Pack 5 (SP5)
     7.23.TC16
     7.30.TC9  or greater
     7.31.TC2  or greater
     9.14.TC7  or greater
     9.20.TC1  or greater

Service Pack 6 (SP6a)
        7.23.TC17
        7.30.TC11
        7.31.TC5  or greater
        9.14.TC9
        9.20.TC3  or greater 

6.69 How do I calculate the network bandwidth required for ER?

On 7th December 2001 mpruet@attbi.com (Madison Pruet) wrote:-

In addition to the row, we send 28 bytes of control information.

So you should be able to figure out your network requirements by taking the number of replicated rows and adding 28 bytes. You can get an idea on the number of replicated rows by runing onlog.

In addition to the replicated number of rows, we also will be sending ACKs back to the source. This would be about 28 bytes per transaction.

6.70 Why do I get out of disk space during a warm restore?

On 11th November 2003 "anonymous" wrote:-

Restoring logical logs in a warm restore needs temporary disk space. Backup and restore guide says:

The minimum number of temporary space needed is equal to the total logical space or total logical log space which need to be replayed, whichever is smaller.

6.71 What is the difference between latches, mutexes and locks?

On 16th August 2002 mpruet@attbi.com (Madison Pruet) wrote:-

Depends on who you talk to. There is a bit of confusion between latches and mutexes.

Locks are applied against rows or pages.

mutexs are objects on which the thread will queue itself if it can not lock the mutex

latches are sometimes used the same as a mutex, but actually are the V5 lock to prevent two processes from accessing the same shared memory structure. Really in IDS the latch became the spin lock. It is a machine level test and set lock. The biggest difference between the spin lock and the mutex is that the thread (process) that can not lock a spin lock will simply repeat the effort without being queued. These are used to protect structures that are only locked for a few machine instructions. Also, they are used to protect the mutex structure itself.

On 5th June 2001 mpruet@home.com (Madison Pruet) wrote:-

The biggest difference between a spin lock and a mutex is that when a thread tries to obtain a mutex but can't, then the thread is put into a wait state and control of the cpuvp is passed to another thread. That is not the case with a spin lock. In that case, the cpuvp simply tries to get the resource again. Every so often, the spin lock will to a sleep for a few milliseconds, just to force the cpuvp off of the physical cpu.

Spin locks are used to protect memory when the access against the memory is very short, for instance to add an entry onto a linked list, or to set a bit in a bitmask flag. Also, spin locks are used to protect mutexes themselves.

The output of onstat -g spi is only a few of the more critical spinlocks.

6.72 What does the output of onstat -C (pre IDS 10.0) mean?

On 28th August 2003 michael.mueller01@kay-mueller.de (Michael Mueller) wrote:-

"Outstanding Requests" are requests to clean some page that have to be worked on by the btree cleaner. They are issued after some index key got deleted.

An "Invalidation Request" is a request to cancel an outstanding request. This is done for example when an index with outstanding requests is dropped or when btree pages with outstanding requests are merged.

6.73 What does the output at the end of onstat -u mean?

On 7th February 2001 kagel@erols.com (Art S. Kagel) wrote:-

maximum: the maximum number of concurrent users so far since startup
concurrent: current number of concurrent users
total: the size of the memory data structure that has been allocated to 
   keep track of concurrent users. 

6.74 Why are all sessions hung on the nsf.lock mutex?

On 12th July 2002 mpruet@attbi.com (Madison Pruet) wrote:-

If the engine is up and accepting connections but no sql can be executed you need to contact tech support on this one.

The nfs lock is used as part of the system to allow a specific file number to be duplicated to all of the VPS. It comes into play most often when a new socket/tli connection has been established and the other VPs need to be able to write to that socket.

The most important thing that you can do to gather information about the problem is "onstat -g ath", "onstat -g stk all", "onstat -g lmx", and "onstat -g wmx"

Once you gather the information, then you need to get in touch with tech support, open a case, and submit the information.

On 24th May 2001 mpruet@home.com (Madison Pruet) wrote:-

When a file descriptor is established, then that file descriptor must be made know to all of the other CPUVPS so that they can issue read/writes to that descriptor. This is done by an internal 'mail' system so that one of the VPs posts a 'assigne fd/free fd' request and then sends a signal to the other VPs. The nsf.lock mutex is used to protect access to that message. It has a -1 because the interrupt handler is doing the locking and is working outside of the thread's normal activity. A socket/tli connection is actually a file descriptor so it uses this process as well.

It's been a while since I studied the specific code for this process, but I think that most of the process is using fcntl() calls so that the FD is known by the various VPS. You should probably contact tech support on this one. I seem to remember some issues with nsf.lock timeouts, so there might be some know issues that they could help you with. Getting some onstat -g ath when the problem is occuring would probably be the first step in analysis of the problem.

6.75 Why does the first CPU VP work harder than the others?

On 13th January 2003 mpruet@attbi.com (Madison Pruet) wrote:-

When the cpuvp has no work to do, it blocks itself on a semaphore and is not activated until there are threads in the ready queue that it could be processed. The order that the cpuvps are reactiviated is ordered by cpuvp number. Hence, the first cpuvps tend to get more system time than the later cpuvps.

Also, there are a couple of other things to consider. The master timer is associated with the first cpuvp. The master timer wakes up once a second and does a bit of houskeeping work such as moving threads from a waiting/sleeping state to the active queue.

Also, if you have inline polling for TCPIP or SHM connections, the cpuvp running the poll threads will poll to see if there is any incoming network requests. So those cpuvps don't queue themselves on the wait semaphore so often.

Understand, there is no performance advantage in trying to schedule activity roundrobin on the CPUVPS. In fact right the opposite is true. When the CPUVP is not running, it must block itself on its semaphore. Blocking and unblocking the activity on the CPUVP would cause an OS process context switch. The context switch is much more expensive overall than simply leaving the secondary CPUVPs in a blocked state and only utilizing them in an 'overflow' case.

6.76 How do I fix permissions on files under $INFORMIXDIR?

On 7th September 2001 ahamm@sanderson.net.au (Andrew Hamm) wrote:-

shutdown the engine and login as root.


cd $INFORMIXDIR
grep -v '^#' etc/*files | while read file uid gid mode junk; do
    chgrp $gid $file
    chown $uid $file
    chmod $mode $file
done

There will be a few nonsense "files" that it complains about in this process, but don't sweat it.

6.77 What is a light scan?

On 6th February 2001 JParker@engage.com (Jack Parker) wrote:-

Normally when you read a table, you go through the buffer pool. That is your row is retrieved from disk, stuck into a buffer, an LRU entry is made to track that buffer. If you are reading a big table, you will fill the buffer and ruin everybody else's day, and have to worry about all of the buffer management. A light scan uses it's own set of private buffer pools - so it leaves the main buffer pool alone. There is minimal buffer management for these pools. It can run about 2-4x faster than a normal buffered read. Number of light scan buffers allocated is a function of your Read ahead settings - max them out.

To get a light scan:

The problem with a light scan is that you are reading disk, what happens if the data you wanted to read was already in a buffer - and potentially altered? Enter the look aside. Your light scan continues, but pages that are in the buffer are threaded into your results.

On 1st October 2001 jparker@artentech.com (Jack Parker) wrote:-

If you are doing DSS then RA determines the number of light scan buffer pools you get. Max them out.

6.78 What happens during an archive?

On 4th May 2001 ahamm@sanderson.net.au (Andrew Hamm) wrote:-

Well, during archives, certain things are blocked from happening. One is checkpoints, requiring you to have a physical log that has plenty of room to last the entire duration of the archive. Since the logical log with the last checkpoint cannot be reused (even if it's free) until the archive is complete, you must also have plenty of room in your logical logs. I err on the side of extreme generosity in this regard, 'cos I've lost too many summer nights recovering from deadlocked systems to ever want that to happen again.

6.79 How do I disable fuzzy checkpoints?

On 1st December 2000 doug.mcallister@nospam.fmr.com (Doug McAllister) wrote:-

onconfig parameter - NOFUZZYCKPT 1

6.80 How do I see free space in a sbspace?

On 16th December 2000 jonathanespejo@my-deja.com (Jonathan) wrote:-

When you are measuring disk usage of an sbspace, "nfree" column in syschunks refers to the "metadata" free space.

You could identify an sbspce in syschunks SMI table if you have the following value is_sbspace=1.

When an sbspace is initially created, the chunk will be divided into 3 areas, they are: "header", "metadata", and the "userdata".

The userdata is used to store your smart large objects (LO's) while header are reserved pages, and metadata area stores information for one or more chunks within the sbspace.

You would notice that onstat -d output will display 2 rows for an sbspace chunks. The first row is an information about "userdata" while the second row displays information about "metadata" area.

To clearly measure and monitor sbpace usage, you would need the following SMI sql script, this will work in at least 9.21.UC1 version:


select name dbspace,
       sum(mdsize) metadata_size,
       sum(nfree) metadata_free,
       sum(mdsize)-sum(nfree) metadata_used,
       sum(udsize) userdata_size,
       sum(udfree) userdata_free
       sum(udsize)-sum(udfree) userdata_used
from sysdbspace d, syschunks c
where d.dbsnum=c.dbsnum and is_sbspace=1
group by 1
order by 1


6.81 Do logical log switches affect Enterprise Replication?

On 5th November 2001 mpruet@home.com (Madison Pruet) wrote:-

It has to do with conflict resolution and the cleanup of the delete tables.

(N.B. - this is more of a problem on the releases prior to 7.31)

When you have conflict resolution, it is possible that not all of the servers are activelly updating the replicated tables. However, we can not purge stuff from the delete tables until we know that all of the servers have gotten past a given point in time. For instance, if we know that all servers within the replication domain have advanced to 3:00 GMT, then we can purge data from the delete tables for any data prior to 3:00 GMT.

Prior to 7.31, we always sent a log event to all servers within the replication domain. This would cause the DTCleaner thread to fire and determine if it could do any work. This tended to cause an "(N-1) factoral" type of problem where N is the number of servers times the number of defined replicates. In some systems folks noticed that eventhough there was no database activity, that the logs would continue to advance. The internal DB work done by the processing of these log events would create enough activity to cause the log file to switch, which in turn caused more activity.

We now no longer use the log event alone to determine how to determine what to purge. We do still issue a log event if a server is not the source of any replication, but is defined within the replicate. We do that simply because since it has not been the source of replication, then it can not have notified the other servers within the domain which log file it has advanced to.

In 7.22 - 7.30, this was a major problem. In 7.31, it is somewhat less of an issue. In 9.2, it does not create that much of an issue.

6.82 What is the sh_lock spin lock?

On 2nd May 2003 dwood@informix.com (Dan Wood) wrote:-

I know the SHM segment bitmap search algorithm, which locates free 4k mem blocks for memory pools, was greatly improved in 9.3. Also, it was noticed that if you had many SHM Virtual segments, with very little free space, the allocations are slower. So, not only was the search algorithm improved it also works better in high memory usage situations.

with many CPU's and CPU VP's the contention on the sh_lock, which protects the bitmap during the search, can become very significant.If you do have many CPU VP's, you can see if this may be at the heart of the problem, by doing:

    onstat -z
    sleep 600
    onstat -g spi | sort -n +1.
while you have a heavy workload where you notice the problem. Even when things are going good you will often see "shmcb sh_lock" with one of the highest spin counts. The question is whether you see the same high sh_lock "Loops" when you have only one big virt segment but with the same heavy workload. If the sh_lock spin count, for the big virt segment test, is only a fraction of what it is when you have many small segments then this would explain your situation.

6.83 What does "mt yield n" mean in the wait statistics?

On 14th January 2003 rsarkar@us.ibm.com (Rajib Sarkar) wrote:-

Usually, in multi-threaded environment, the programmer puts in "yield" statements to let the other threads run if its waiting for something. That's the same with Informix too ..as it allows other threads a fair share of CPU time to run on the VPs. For example, if an AIO VP is looking for work, its pointless for it to continuously run and consume CPU cycles but if it is woken up when a request is on the queue it will make it more efficient ...that's why if you take a stack trace of an idle AIO VP you will see the top function is iowork() which is actually an infinite loop. So, when the VP process is forked its parked (if that's the right word ..:-) ) in the iowork() function and then it "yields" to let other threads/processes run if they've got some work to do. Once, a request is queued in the queue its looking at, it wakes up and executes the IO request (workon() is the function) and then after its complete, it goes back to sleep i.e. it yields and lets other thread do the work

On 14th May 2005 david@smooth1.co.uk (David Williams) wrote:-

Looking at IDS 10.00.TC1TL under Windows when the aio vp is idle the functions are :-

Stack for thread: 4 aio vp 0
 base: 0x0e070000
  len:   20480
   pc: 0x00a979a9
  tos: 0x0e074e78
state: sleeping
   vp: 5

0x00a8d1ca (oninit)_mt_yield(0xffffffff, 0x0, 0xe074fa4, 0xa9cf15)
0x00a9e03d (oninit)_ioidle (0xe0551c0, 0xe0551c0, 0xe055030, 0xe055204)
0x00a9cf15 (oninit)_iowork (0xe0551c0, 0x7, 0xe05d148, 0x0)
0x00a9ce39 (oninit)_iothread(0xe0551c0, 0x0, 0x0, 0x0)
0x00abb554 (oninit)_startup(0x28, 0xe077130, 0xe05dfb8, 0x2130)
0x00000000 (*nosymtab*)0x0

So it appears there is now an ioidle() function!

6.84 How do I add more than 10 shared memory segments under AIX?

On 23rd July 2001 kernoal.stephens@AUTOZONE.COM wrote:-

Starting with AIX 4.2.1 you can access more than the 9 or 10 Shm Segs by setting the following env parameters.These have to be set when you start the engine and for the users who want to access the shared memory.

EXTSHM=ON
SHMLBA_EXTSHM=bytes

Snip from Tech Notes.

"AIX Version 4.2.1 provides the ability to attach more than 10 shared memory regions to a process when the process is created in a shell with an environment variable defined: EXTSHM=ON. In this environment, a shared memory region can be as small as one page in size (4096 bytes) and as large as 256MB. The address space consumed is exactly the size of the shared memory region. The number of regions a process can attach is now limited only by the available address space. The total number of address space available is this mode is also 11*256MB."

"The SHMLBA_EXTSHM variable determines the shared memory alignment. The standard default is 256MB. Setting a smaller alignment allows a process to attach more segments without exceeding the overall shared memory address limit. You will probably be able to reduce the smm shared memory size (probably reduce SHM_GPAGESZ) as well, depending on how you have already tuned it."

6.85 What are the fields of the ixbar file?

On 16th January 2002 sberg@nb.com (Steve Berg) wrote:-