Berkeley DB Reference Guide:
Programmer Notes

Application-specific logging and recovery

Berkeley DB includes tools to assist in the development of application-specific logging and recovery. Specifically, given a description of the information to be logged, these tools will automatically create logging functions (functions that take the values as parameters and construct a single record that is written to the log), read functions (functions that read a log record and unmarshall the values into a structure that maps onto the values you chose to log), a print function (for debugging), templates for the recovery functions, and automatic dispatching to your recovery functions.

Defining Application-Specific Operations

Log records are described in files named XXX.src, where "XXX" is a unique prefix. The prefixes currently used in the Berkeley DB package are btree, crdel, db, hash, log, qam, and txn. These files contain interface definition language descriptions for each type of log record that is supported.

All lines beginning with a hash character in .src files are treated as comments.

The first non-comment line in the file should begin with the keyword PREFIX, followed by a string that will be prepended to every function. Frequently, the PREFIX is either identical or similar to the name of the .src file.

The rest of the file consists of one or more log record descriptions. Each log record description begins with the line

BEGIN RECORD_NAME RECORD_NUMBER

and ends with the line

END

The RECORD_NAME variable should be replaced with a unique record name for this log record. Record names must be unique within .src files.

The RECORD_NUMBER variable should be replaced with a record number. Record numbers must be unique for an entire application; that is, both application-specific and Berkeley DB log records must have unique values. Further, because record numbers are stored in log files, which often must be portable across application releases, no record number should ever be reused. The record number space below 10,000 is reserved for Berkeley DB itself; applications should choose record number values equal to or greater than 10,000.

Between the BEGIN and END keywords there should be one line for each data item that will be logged in this log record. The format of these lines is as follows:

ARG | DBT | POINTER	variable_name	variable_type	printf_format

The keyword ARG indicates that the argument is a simple parameter of the type specified. The keyword DBT indicates that the argument is a DBT containing a length and pointer to a byte string. The keyword PTR indicates that the argument is a pointer to the data type specified, and the entire type should be logged.

The variable name is the field name within the structure that will be used to refer to this item. The variable type is the C type of the variable, and the printf format should be "s" for string, "d" for signed integral type, or "u" for unsigned integral type.

Automatically Generated Functions

For each log record description found in the file, the following structure declarations and #defines will be created in the file PREFIX_auto.h:


#define DB_PREFIX_RECORD_TYPE        /* Integer ID number */

typedef struct _PREFIX_RECORD_TYPE_args {
    /*
     * These three fields are generated for every record.
     */
    u_int32_t type;      /* Record type used for dispatch. */

    /*
     * Transaction handle that identifies the transaction on whose
     * behalf the record is being logged.
     */
    DB_TXN *txnid;

    /*
     * The log sequence number returned by the previous call to log_put
     * for this transaction.
     */
    DB_LSN *prev_lsn;

    /*
     * The rest of the structure contains one field for each of
     * the entries in the record statement.
     */
};

The DB_PREFIX_RECORD_TYPE should be described as an offset from the library provided DB_user_BEGIN #define (this is the value of the first identifier available to users outside of the Berkeley DB library).

In addition to the PREFIX_auto.h file, a file named PREFIX_auto.c is created, containing the following functions for each record type:

The log function, with the following parameters:

dbenv: The environment handle returned by db_env_create.
txnid: The transaction identifier for the transaction handle returned by txn_begin.
lsnp: A pointer to storage for a log sequence number into which the log sequence number of the new log record will be returned.
syncflag: A flag indicating whether the record must be written synchronously. Valid values are 0 and DB_FLUSH.

The log function marshalls the parameters into a buffer, and calls log_put on that buffer returning 0 on success and non-zero on failure.

The read function with the following parameters:

recbuf: A buffer.
argp: A pointer to a structure of the appropriate type.

The read function takes a buffer and unmarshalls its contents into a structure of the appropriate type. It returns 0 on success and non-zero on error. After the fields of the structure have been used, the pointer returned from the read function should be freed.

The recovery function with the following parameters:

dbenv: The handle returned from the db_env_create call, which identifies the environment in which recovery is running.
rec: The rec parameter is the record being recovered.
lsn: The log sequence number of the record being recovered.
op: A parameter of type db_recops, which indicates what operation is being run (DB_TXN_OPENFILES, DB_TXN_ABORT, DB_TXN_BACKWARD_ROLL, DB_TXN_FORWARD_ROLL).
info: A structure passed by the dispatch function. It is used to contain a list of committed transactions and information about files that may have been deleted.

The recovery function is called on each record read from the log during system recovery or transaction abort.

The recovery function is created in the file PREFIX_rtemp.c because it contains templates for recovery functions. The actual recovery functions must be written manually, but the templates usually provide a good starting point.

The print function:

The print function takes the same parameters as the recover function, so that it is simple to dispatch both to simple print functions as well as to the actual recovery functions. This is useful for debugging purposes, and is used by the db_printlog utility to produce a human-readable version of the log. All parameters except the rec and lsnp parameters are ignored. The rec parameter contains the record to be printed.

One additional function, an initialization function, is created for each .src file.

The initialization function has the following parameters:

dbenv: The environment handle returned by db_env_create.

The recovery initialization function registers each log record type declared with the recovery system, so the appropriate function is called during recovery.

Using Automatically Generated Routines

Applications use the automatically generated functions, as follows:

When the application starts, call the DB_ENV->set_recovery_init with your recovery initialization function so that the initialization function is called at the appropriate time.
Issue a txn_begin call before any operations you want to be transaction-protected.
Before accessing any data, issue the appropriate lock call to lock the data (either for reading or writing).
Before modifying any data that is transaction-protected, issue a call to the appropriate log function.
Issue a txn_commit to save all the changes, or a txn_abort to cancel all of the modifications.

The recovery functions (described as follows) can be called in two cases:

From the recovery daemon upon system failure, with op set to DB_TXN_FORWARD_ROLL or DB_TXN_BACKWARD_ROLL.
From txn_abort if it is called to abort a transaction, with op set to DB_TXN_ABORT.

For each log record type you declare, you must write the appropriate function to undo and redo the modifications. The shell of these functions will be generated for you automatically, but you must fill in the details.

Your code must be able to detect whether the described modifications have been applied to the data. The function will be called with the "op" parameter set to DB_TXN_ABORT when a transaction that wrote the log record aborts, and with DB_TXN_FORWARD_ROLL and DB_TXN_BACKWARD_ROLL during recovery. The actions for DB_TXN_ABORT and DB_TXN_BACKWARD_ROLL should generally be the same.

For example each access method database page contains the log sequence number of the most recent log record that describes a modification to the page. When the access method changes a page, it writes a log record describing the change and including the log sequence number (LSN) that was on the page before the change. This LSN is referred to as the previous LSN. The recovery functions read the page described by a log record, and compare the LSN on the page to the LSN they were passed. If the page LSN is less than the passed LSN and the operation is an undo, no action is necessary (because the modifications have not been written to the page). If the page LSN is the same as the previous LSN and the operation is a redo, the actions described are reapplied to the page. If the page LSN is equal to the passed LSN and the operation is an undo, the actions are removed from the page; if the page LSN is greater than the passed LSN and the operation is a redo, no further action is necessary. If the action is a redo and the LSN on the page is less than the previous LSN in the log record, it is an error because it could happen only if some previous log record was not processed.

Please refer to the internal recovery functions in the Berkeley DB library (found in files named XXX_rec.c) for examples of the way recovery functions should work.

Non-conformant Logging

If your application cannot conform to the default logging and recovery structure, you will have to create your own logging and recovery functions explicitly.

First, you must decide how you will dispatch your records. Encapsulate this algorithm in a dispatch function that is passed to DB_ENV->open. The arguments for the dispatch function are as follows:

dbenv: The environment handle returned by db_env_create.
rec: The record being recovered.
lsn: The log sequence number of the record to be recovered.
op: Indicates what operation of recovery is needed (openfiles, abort, forward roll, or backward roll).
info: An opaque value passed to your function during system recovery.

When you abort a transaction, txn_abort reads the last log record written for the aborting transaction and then calls your dispatch function. It continues looping, calling the dispatch function on the record whose LSN appears in the lsn parameter of the dispatch call (until a NULL LSN is placed in that field). The dispatch function will be called with the op set to DB_TXN_ABORT.

Your dispatch function can do any processing necessary. See the code in db/db_dispatch.c for an example dispatch function (that is based on the assumption that the transaction ID, previous LSN, and record type appear in every log record written).