We provide an interface allowing client/server access to Berkeley DB. Our goal is to provide a client and server library to allow users to separate the functionality of their applications yet still have access to the full benefits of Berkeley DB. The goal is to provide a totally seamless interface with minimal modification to existing applications as well.
The client/server interface for Berkeley DB can be broken up into several layers. At the lowest level there is the transport mechanism to send out the messages over the network. Above that layer is the messaging layer to interpret what comes over the wire, and bundle/unbundle message contents. The next layer is Berkeley DB itself.
The transport layer uses ONC RPC (RFC 1831) and XDR (RFC 1832).
We declare our message types and operations supported by our program and
the RPC library and utilities pretty much take care of the rest.
The
rpcgen program generates all of the low level code needed.
We need to define both sides of the RPC.
The planned interface for this method is:
DBENV->set_server(dbenv, /* DB_ENV structure */ hostname /* Host of server */ cl_timeout, /* Client timeout (sec) */ srv_timeout,/* Server timeout (sec) */ flags); /* Flags: unused */This new method takes the hostname of the server, establishes our connection and an environment on the server. If a server timeout is specified, then we send that to the server as well (and the server may or may not choose to use that value). This timeout is how long the server will allow the environment to remain idle before declaring it dead and releasing resources on the server. The pointer to the connection is stored on the client in the DBENV structure and is used by all other methods to figure out with whom to communicate. If a client timeout is specified, it indicates how long the client is willing to wait for a reply from the server. If the values are 0, then defaults are used. Flags is currently unused, but exists because we always need to have a placeholder for flags and it would be used for specifying authentication desired (were we to provide an authentication scheme at some point) or other uses not thought of yet!
This client code is part of the monolithic DB library. The user accesses the client functions via a new flag to db_env_create(). That flag is DB_CLIENT. By using this flag the user indicates they want to have the client methods rather than the standard methods for the environment. Also by issuing this flag, the user needs to connect to the server via the DBENV->set_server() method.
We need two new fields in the DB_ENV structure. One is the socket descriptor to communicate to the server, the other field is the client identifier the server gives to us. The DB, and DBC only need one additional field, the client identifier. The DB_TXN structure does not need modification, we are overloading the txn_id field.
If the server crashes, then the client will get an error back. I have chosen to implement time-outs on the client side, using a default or allowing the application to specify one through the DBENV->set_server() method. Either the current operation will time-out waiting for the reply or the next operation called will time out (or get back some other kind of error regarding the server's non-existence). In any case, if the client application gets back such an error, it should abort any open transactions locally, close any databases, and close its environment. It may then decide to retry to connect to the server periodically or whenever it comes back. If the last operation a client did was a transaction commit that did not return or timed out from the server, the client cannot determine if the transaction was committed or not but must release the local transaction resources. Once the server is back up, recovery must be run on the server. If the transaction commit completed on the server before the crash, then the operation is redone, if the transaction commit did not get to the server, the pieces of the transaction are undone on recover. The client can then re-establish its connection and begin again. This is effectively like beginning over. The client cannot use ID's from its previous connection to the server. However, if recovery is run, then consistency is assured.
If the client crashes, the server needs to somehow figure this out. The server is just sitting there waiting for a request to come in. A server must be able to time-out a client. Similar to ftpd, if a connection is idle for N seconds, then the server decides the client is dead and releases that client's resources, aborting any open transactions, closing any open databases and environments. The server timing out a client is not a trivial issue however. The generated function for the server just calls svc_run(). The server code I write contains procedures to do specific things. We do not have access to the code calling select(). Timing out the select is not good enough even if we could do so. We want to time-out idle environments, not simply cause a time-out if the server is idle a while. See the discussion of the server program for a description of how we accomplish this.
Since rpcgen generates the main() function of the server, I do not yet know how we are going to have the server multi-threaded or multi-process without changing the generated code. The RPC book indicates that the only way to accomplish this is through modifying the generated code in the server. For the moment we will ignore this issue while we get the core server working, as it is only a performance issue.
We do not do any security or authentication. Someone could get the code and modify it to spoof messages, trick the server, etc. RPC has some amount of authentication built into it. I haven't yet looked into it much to know if we want to use it or just point a user at it. The changes to the client code are fairly minor, the changes to our server procs are fairly minor. We would have to add code to a sed script or awk script to change the generated server code (yet again) in the dispatch routine to perform authentication.
We will need to get an official program number from Sun. We can
get this by sending mail to rpc@sun.com and presumably at some point
they will send us back a program number that we will encode into our XDR
description file. Until we release this we can use a program number
in the "user defined" number space.
We have made a choice to modify the generated code for the server. The changes will be minimal, generally calling functions we write, that are in other source files. The first change is adding a call to our time-out function as described below. The second change is changing the name of the generated main() function to __dbsrv_main(), and adding our own main() function so that we can parse options, and set up other initialization we require. I have a sed script that is run from the distribution scripts that massages the generated code to make these minor changes.
Primarily the code needed for the server is the collection of the specified RPC functions. Each function receives the structure indicated, and our code takes out what it needs and passes the information into DB itself. The server needs to maintain a translation table for identifiers that we pass back to the client for the environment, transaction and database handles.
The table that the server maintains, assuming one client per server process/thread, should contain the handle to the environment, database or transaction, a link to maintain parent/child relationships between transactions, or databases and cursors, this handle's identifier, a type so that we can error if the client passes us a bad id for this call, and a link to this handle's environment entry (for time out/activity purposes). The table contains, in entries used by environments, a time-out value and an activity time stamp. Its use is described below for timing out idle clients.
Here is how we time out clients in the server. We have to modify the generated server code, but only to add one line during the dispatch function to run the time-out function. The call is made right before the return of the dispatch function, after the reply is sent to the client, so that client's aren't kept waiting for server bookkeeping activities. This time-out function then runs every time the server processes a request. In the time-out function we maintain a time-out hint that is the youngest environment to time-out. If the current time is less than the hint we know we do not need to run through the list of open handles. If the hint is expired, then we go through the list of open environment handles, and if they are past their expiration, then we close them and clean up. If they are not, we set up the hint for the next time.
Each entry in the open handle table has a pointer back to its environment's entry. Every operation within this environment can then update the single environment activity record. Every environment can have a different time-out. The DBENV->set_server call takes a server time-out value. If this value is 0 then a default (currently 5 minutes) is used. This time-out value is only a hint to the server. It may choose to disregard this value or set the time-out based on its own implementation.
For completeness, the flaws of this time-out implementation should be pointed out. First, it is possible that a client could crash with open handles, and no other requests come in to the server. Therefore the time-out function never gets run and those resources are not released (until a request does come in). Similarly, this time-out is not exact. The time-out function uses its hint and if it computes a hint on one run, an earlier time-out might be created before that time-out expires. This issue simply yields a handle that doesn't get released until that original hint expires. To illustrate, consider that at the time that the time-out function is run, the youngest time-out is 5 minutes in the future. Soon after, a new environment is opened that has a time-out of 1 minute. If this environment becomes idle (and other operations are going on), the time-out function will not release that environment until the original 5 minute hint expires. This is not a problem since the resources will eventually be released.
On a similar note, if a client crashes during an RPC, our reply generates a SIGPIPE, and our server crashes unless we catch it. Using signal(SIGPIPE, SIG_IGN) we can ignore it, and the server will go on. This is a call in our main() function that we write. Eventually this client's handles would be timed out as described above. We need this only for the unfortunate window of a client crashing during the RPC.
The options below are primarily for control of the program itself,. Details relating to databases and environments should be passed from the client to the server, since the server can serve many clients, many environments and many databases. Therefore it makes more sense for the client to set the cache size of its own environment, rather than setting a default cachesize on the server that applies as a blanket to any environment it may be called upon to open. Options are:
The client code contains each method function that goes along with the RPC calls described elsewhere. The client library also contains its own version of db_env_create(), which does not result in any messages going over to the server (since we do not yet know what server we are talking to). This function sets up the pointers to the correct client functions.
All of the method functions that handle the messaging have a basic flow similar to this:
BEGIN dbjoin 1 RETCODE ARG ID DB * dbp ARG LIST DBC ** curs ID ARG IGNORE DBC ** dbcpp ARG INT u_int32_t flags RET ID long dbcid ENDOur first line tells us we are writing the dbjoin function. It requires special code on the client so we indicate that with the RETCODE. This method takes four arguments. For the RPC request we need the database ID from the dbp, we construct a NULL-terminated list of IDs for the cursor list, we ignore the argument to return the cursor handle to the user, and we pass along the flags. On the return, the reply contains a status, by default, and additionally, it contains the ID of the newly created cursor.
As mentioned early on, in the section on DB Modifications, we have a single library, but allowing the user to access the client portion by sending a flag to db_env_create(). The Makefile is modified to include the new files.
Testing is performed in two ways. First I have a new example program, that should become part of the example directory. It is basically a merging of ex_access.c and ex_env.c. This example is adequate to test basic functionality, as it does just does database put/get calls and appropriate open and close calls. However, in order to test the full set of functions a more generalized scheme is required. For the moment, I am going to modify the Tcl interface to accept the server information. Nothing else should need to change in Tcl. Then we can either write our own test modules or use a subset of the existing ones to test functionality on a regular basis.