APPENDIX S Chars and Varchars

Chars and Varchars

Every so often, a question is posted to the comp.databases.informix newsgroup asking whether varchars are "better" than chars, or the question might go like this:

>My company proposes changing all char(x) columns to varchar(x)
>on the grounds that <etc etc etc>

The answer is: Like all "all-or-nothing" proposals it is not a good idea to go the whole hog.

Programming issues:

First area to consider is within your programming. If you are using 4GL (or languages offering similar semantics for the chars and varchars) then:

*) The automatic clipping of spaces from chars (in all but one circumstance) WILL DISAPPEAR and that can have serious consequences requiring reprogramming. You will need to ensure appropriate clipping is inserted in your application code.

*) The circumstance where plain old chars are not automagically clipped is when used with the , (concatenate) operator but that's great because with chars it gives you a choice of glueing the strings together either with or without the intermediate blanks. You will have to review all use of the concatenate operator, because if you are not clipping, the assumed blanks included in the middle will all disappear.

*) A very peculiar characteristic of varchars: if you clip a varchar that is full of blanks, it will yield a string of ONE blank, not an empty string. So once again, clipping of chars converted to varchars will need reviewing. This characteristic is very annoying with varchars, and is a consequence of the implementation of them in the 4GL. It will be troublesome if potentially empty strings are used with concatenation.

*) Most paradoxically, varchars are NOT completely variable length, when compared with chars. Chars are implicitly variable length because the trailing blanks are ignored at all times (except with concatenate) so plain old chars in fact program exactly like variable length strings. IF you get trailing blanks on a varchar then 4GL and SQL will NOT remove it for you, and you can get peculiar results (=bugs) until you throw in a lot of clipping at suspicious points - that means examining your code and doing the appropriate thing at critical places. Data entry etc can yield "invisible" spaces - you won't see them in selects or printouts, so you can waste time wondering why the code isn't working. Plain old chars are basically "more" variable than varchars in programming.

*) If a programmer is relying on the fixed width of a char in printouts and displays, then first thing to say is "naughty programmer!" because that makes the program less flexible in the face of database changes. But seriously, if you change to varchars then reports and displays may well "wander" without appropriate use of the column control etc.

To show the behaviour of 4GL with varchars, witness this little test program:

define v varchar(10) main let v = "abc" display "<abc> <", v, ">" let v = "abc " display "<abc > <", v, ">" let v = NULL display "NULL <", v, ">" let v = "" display "<> <", v, ">" let v = " " display "< > <", v, ">" let v = " " display "< > <", v, ">" let v = v clipped display "v clipped <", v, ">" let v = "a " display "<a > clipped <", v clipped, ">" let v = " " display "< > clipped <", v clipped, ">" end main

and the resultant output, which has some expected results, and a few surprises, in the results for NULL, a single blank, an "empty" string.

<abc> <abc> <abc > <abc > NULL < > <> < > < > < > < > < > v clipped < > <a > clipped <a> < > clipped <>

Engine considerations:

*) Max length of a varchar is 255

*) Total space consumed for a varchar(size, reserved) can be size + 1

*) Minimum space consumed for a varchar(size, reserved) can be reserved + 1. If you attempt to store less than reserved, then the space consumed in the rows is still reserved + 1, although the true length of the data will be correctly saved. "reserved" is merely a performance setting for updates. To exploit it, analyse your data and guess the most useful mimimum length so that most data will fit without reallocation to an overflow area if it grows. If rows are inserted but (the varchars are never updated then you can safely set reserved to 1 because overflowing is only an issue during updates.

*) varchars in an index consume their length + 1 because indexes cannot have variable length "records" in them.

*) The bit map associated with the table is larger when the rows are variable length, because it needs to store 4 bit entries in the bitmap instead of the usual 2 bit entries.

*) varchars cannot participate in Key Only Scans.

IF you have an index on columns A, B ... AND you issue a select which only selects some of the columns in the index AND the nature of the query means the engine can use the index to get those results (perhaps because of suitable joining or filtering, or no better reason NOT to use a key only scan) THEN the engine will merely get the "rows" from the index without having to look into the table data itself.

Basically if the index can be read to get all the interesting fields then it will do that instead of touching the table. Key-only scans are described in the Performance Guide Manual from informix. The effects of varchars on key-only scans are mentioned as an errata to that book, in the file PERFDOC_7.3 (and probably similar) within the release notes directory for the engine.

*) varchars cannot participate in Light Scans. The following words are from Jack Parker:

When joining two tables with an equality statement:

select a.col, b.col from tab1 a, tab2 b where a.key=b.key;

The optimizer has the choice of using a nested loop (if there is an index on either key - although you may wind up with the dreaded auto-index) or in performing a dynamic hash join - the love of all DSS types. A hash join in this situation can be one to two orders of magnitude faster than a nested loop. An issue with the hash join is that each table is fully scanned. With a normal scan, each page scanned will be dumped into the buffers along with the associate buffer management that entails - forcing out perhaps more usefully cached data.

With a light scan, the hash join gets it's own personal light scan buffer pools (how many depends on RA_PAGES and RA_THRESHOLD) which comes out of DS_TOTAL_MEMORY. Hence you don't muck up the cache, you don't have the same buffer management issues, and you can get a whole lot more memory for your join - depending on your DS_TOTAL_MEMORY setting. I have seen light scans work anywhere from 1.5 to 3 times faster than normal buffered reads.

You can get light scans whenever you need to scan an entire table.

They can monitored with onstat -g lsc (under IDS) and onstat -g scn (undex XPS). Pre-requisites to a light scan include:

- table to be scanned must be larger thaan BUFFERS
- isolation dirty read OR ( committed orr repeatable) WITH a shared table table lock.
- (undocumented) no varchars. (althoughh XPS will do a light scan on varchars).

I take issue with the ICP Performance exam (self-assessment) Item 4 - where they cloud up one of the answers "Light scans will be performed only for sequential scans that meet certain criteria" - answer 'c' is 'The operation being performed is a read-only operation or a DDL operation'. While you certainly would not expect to have any scanning during table creation, you most certainly can get a light scan during an index build.
- - Jack Parker

Other considerations:

*) If a particular column is a code, then it will typically be fixed length or close to that, so it's not worth converting.

*) If a column is informational (eg a street address) then it may be variable in length but possibly involved in SQL or 4GL activity that would be more easily programmed with fixed length chars.

*) If a column is "narrational" ie plain old descriptions and chatter, then it's probably the most interesting candidate for conversion. These sorts of columns are less likely to be involved in any interesting programming, and are merely recorded, printed or thrown up on screens, so you will have less hassle converting. One problem with this type of column is that it's the most likely to exceed the max length of 255 available with varchars, and if you have a lot of those then you need to consider the consistency of your database and programming.

*) Actual space saved would really depend on your data. You could make a program to examine all tables in a live database and make a prediction on the total space saving. You may be saddened by the less than expected results.

Summary:

There are issues when converting chars to varchars, so don't jump in with a dream of an easy conversion! Life may be simpler if you START a particular table with appropriate varchars right from the beginning, and you can use the considerations above to choose appropriate columns and deal with the programming consequences. Converting to varchars later will require conscientious and thorough examination of your codebase and also the performance considerations in the engine. You should be aware of the sad truth that there is no such thing as a "small change" in an established codebase.

If you are starting a new table, once again varchars are by no means the proper choice for columns. I hope this document has given you enough information to choose wisely.

Please address any questions to the comp.databases.informix newsgroup.