suche 36x36

Informix 271 could not insert new row -
downtime prevented

For our customers, we monitor existentially important and time-critical database systems.
For comprehensive monitoring and alerting, we naturally use our own tool, the Admin Scout for Informix.

In this real-life example, a parameter has reached the defined alarm threshold.
The Admin-Scout sends an e-mail to CURSOR Informix Support:

**************************************************

Notification Type: PROBLEM
Service: informix-checkalert
Alert color: RED

Message:
database:tablename has 16777215 pages, free are 1248744 pages
pagesisze is 2k maximum is 16.7 Mio pages

**************************************************

Meaning of the alarm

A table has allocated the maximum number of pages and the free pages are less than 10%.
As soon as the free pages are already used, all attempts to write to this table are aborted.

Verification

CURSOR Informix Support checks the facts directly on the machine. The table is fragmented according to individual year bookings. A continued strong growth can be seen in the current year - this is the reason for the high consumption of pages.

Estimation

There is an urgent need for action. As soon as the remaining free pages are occupied, all attempts to write to this table are aborted with two error messages

- 271 Could not insert new row into the table.
- 136 ISAM error: no more extents. (*)

The growth data collected by the Admin-Scout shows that the usual growth rate for the current year has risen from around 200,000 records per day to over 500,000. With the historical data of the Admin-Scout and a foresighted calculation, our Informix support determines a maximum period of 10 - 14 days until the imminent standstill of the database.

Coordination

Our Informix support team contacts the customer directly and a planning meeting with the technical contact persons takes place the very next day.

Special challenge

The table in this case is necessary for permanent 24-hour operation and therefore time-critical. A possible adjustment of the fragmentation strategy to bypass the page limit per partition can only be performed during a scheduled maintenance window. Since no downtime is possible within the next few weeks, an alternative solution must be found.

In cooperation with the customer, the solution proposed is to store older data in a second table with an identical structure and delete the copied records in the current table.

The challenge now is to copy and delete 75 million records, transaction-safe and without affecting application performance during operation. This is only possible in small completed transactions. Our Informix support develops and tests a new copy-delete stored procedure. Time stamps are used to divide the records into groups. A group contains the related data records for half an hour. These are now copied and deleted in groups, each group in a completed transaction. This ensures that there is no data loss and that logical logging takes place in small-sized batches.

Execution

Processing starts while the system is running. The processing using the stored procedure takes about two and a half hours. During this time, the instance and the application are closely monitored. If there are any effects on the performance of the application, it can be terminated at any time, by dividing it into small transactions.

As expected, there will be a strong increase in I/O, but no noticeable impairments to the application. In normal operation, the system writes a checkpoint with 2 - 4 thousand dirty pages every 3 minutes. During processing, the number increases to 2 - 3 checkpoints per minute with 70 - 90 thousand dirty pages per checkpoint. The duration of the checkpoints increases to 0.5 to 0.8 seconds. A blocking of sessions can hardly be observed during the entire time (less than 10 sessions for a maximum of 0.7 seconds).

This eliminates the risk of database downtime without any impact on running operations.

Feedback

In the concluding discussion with the customer's technical team, the following sentence is said:
„II am happy that we have the CURSOR Informix Support, otherwise we would have had a standing database in two weeks, with hours downtime - first to find the problem and then solve it.“.

It speaks for itself that this is a zero-downtime system that can only stand still for minutes a year.

If you are interested in further details of this case, do not hesitate to ask our team via e-mail.

(*) Whereby the ISAM Error Code has several meanings here and the text in this case points to the wrong cause. If you read the detailed error description, you will get a hint to the maximum number of pages per fragment, which would be the correct reason.

Zurück zur Übersicht