diff options
author | jbj <devnull@localhost> | 2003-12-15 21:42:09 +0000 |
---|---|---|
committer | jbj <devnull@localhost> | 2003-12-15 21:42:09 +0000 |
commit | 8960e3895f7af91126465368dff8fbb36ab4e853 (patch) | |
tree | 3c515e39dde0e88edeb806ea87d08524ba25c761 /db/docs/ref/rep | |
parent | 752cac72e220dcad4e6fce39508e714e59e3e0a1 (diff) | |
download | rpm-8960e3895f7af91126465368dff8fbb36ab4e853.tar.gz rpm-8960e3895f7af91126465368dff8fbb36ab4e853.tar.bz2 rpm-8960e3895f7af91126465368dff8fbb36ab4e853.zip |
- upgrade to db-4.2.52.
CVS patchset: 6972
CVS date: 2003/12/15 21:42:09
Diffstat (limited to 'db/docs/ref/rep')
-rw-r--r-- | db/docs/ref/rep/app.html | 87 | ||||
-rw-r--r-- | db/docs/ref/rep/comm.html | 61 | ||||
-rw-r--r-- | db/docs/ref/rep/elect.html | 90 | ||||
-rw-r--r-- | db/docs/ref/rep/ex.html | 36 | ||||
-rw-r--r-- | db/docs/ref/rep/ex_comm.html | 41 | ||||
-rw-r--r-- | db/docs/ref/rep/ex_rq.html | 34 | ||||
-rw-r--r-- | db/docs/ref/rep/faq.html | 39 | ||||
-rw-r--r-- | db/docs/ref/rep/id.html | 20 | ||||
-rw-r--r-- | db/docs/ref/rep/init.html | 22 | ||||
-rw-r--r-- | db/docs/ref/rep/intro.html | 37 | ||||
-rw-r--r-- | db/docs/ref/rep/logonly.html | 32 | ||||
-rw-r--r-- | db/docs/ref/rep/newsite.html | 18 | ||||
-rw-r--r-- | db/docs/ref/rep/partition.html | 32 | ||||
-rw-r--r-- | db/docs/ref/rep/pri.html | 30 | ||||
-rw-r--r-- | db/docs/ref/rep/trans.html | 244 |
15 files changed, 510 insertions, 313 deletions
diff --git a/db/docs/ref/rep/app.html b/db/docs/ref/rep/app.html index 5ab61eb18..5a652ad59 100644 --- a/db/docs/ref/rep/app.html +++ b/db/docs/ref/rep/app.html @@ -1,27 +1,27 @@ -<!--Id: app.so,v 1.6 2002/08/21 21:02:15 bostic Exp --> -<!--Copyright 1997-2002 by Sleepycat Software, Inc.--> +<!--$Id: app.so,v 1.16 2003/11/19 20:06:01 bostic Exp $--> +<!--Copyright 1997-2003 by Sleepycat Software, Inc.--> <!--All rights reserved.--> <!--See the file LICENSE for redistribution information.--> <html> <head> <title>Berkeley DB Reference Guide: Building replicated applications</title> <meta name="description" content="Berkeley DB: An embedded database programmatic toolkit."> -<meta name="keywords" content="embedded,database,programmatic,toolkit,b+tree,btree,hash,hashing,transaction,transactions,locking,logging,access method,access methods,java,C,C++"> +<meta name="keywords" content="embedded,database,programmatic,toolkit,b+tree,btree,hash,hashing,transaction,transactions,locking,logging,access method,access methods,Java,C,C++"> </head> <body bgcolor=white> <table width="100%"><tr valign=top> <td><h3><dl><dt>Berkeley DB Reference Guide:<dd>Berkeley DB Replication</dl></h3></td> -<td align=right><a href="../../ref/rep/pri.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../../reftoc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../../ref/rep/comm.html"><img src="../../images/next.gif" alt="Next"></a> +<td align=right><a href="../rep/pri.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../toc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../rep/comm.html"><img src="../../images/next.gif" alt="Next"></a> </td></tr></table> <p> -<h1 align=center>Building replicated applications</h1> +<h3 align=center>Building replicated applications</h3> <p>The simplest way to build a replicated Berkeley DB application is to first build (and debug!) the transactional version of the same application. Then, add a thin replication layer to the application. All highly available applications use the following additional four Berkeley DB methods: <a href="../../api_c/rep_elect.html">DB_ENV->rep_elect</a>, <a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a>, <a href="../../api_c/rep_start.html">DB_ENV->rep_start</a> and <a href="../../api_c/rep_transport.html">DB_ENV->set_rep_transport</a> and may also use the configuration method -<a href="../../api_c/rep_limit.html">DB_ENV->set_rep_limit</a>: +<a href="../../api_c/rep_limit.html">DB_ENV->set_rep_limit</a>:</p> <p><dl compact> <p><dt><a href="../../api_c/rep_transport.html">DB_ENV->set_rep_transport</a><dd>The <a href="../../api_c/rep_transport.html">DB_ENV->set_rep_transport</a> method configures the replication system's communications infrastructure. @@ -33,26 +33,38 @@ for accepting log records and updating the local databases based on messages from the master. For both the master and the clients, it is responsible for handling administrative functions (for example, the protocol for dealing with lost messages), and permitting new clients to -join an active replication group. +join an active replication group. This method should only be called +after the environment has been configured as a replication master or +client via <a href="../../api_c/rep_start.html">DB_ENV->rep_start</a>. <p><dt><a href="../../api_c/rep_elect.html">DB_ENV->rep_elect</a><dd>The <a href="../../api_c/rep_elect.html">DB_ENV->rep_elect</a> method causes the replication group to elect a new -master; it is called whenever contact with the master is lost. -<p><dt><a href="../../api_c/rep_limit.html">DB_ENV->set_rep_limit</a><dd>The <a href="../../api_c/rep_limit.html">DB_ENV->set_rep_limit</a> imposes an upper bound on the amount of data +master; it is called whenever contact with the master is lost and the +application wants the remaining sites to select a new master. +<p><dt><a href="../../api_c/rep_limit.html">DB_ENV->set_rep_limit</a><dd>The <a href="../../api_c/rep_limit.html">DB_ENV->set_rep_limit</a> method imposes an upper bound on the amount of data that will be sent in response to a single call to <a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a>. +During client recovery, that is, when a replica site is trying to +synchronize with the master, clients may ask the master for a large +number of log records. If it is going to harm an application for the +master message loop to remain busy for an extended period transmitting +records to the replica, then the application will want to use <a href="../../api_c/rep_limit.html">DB_ENV->set_rep_limit</a> +to limit the amount of data the master will send before relinquishing +control and accepting other messages. </dl> <p>To add replication to a Berkeley DB application, application initialization must be changed and the application's communications infrastructure must be written. The application initialization changes are relatively -simple, but the communications infrastructure code can be complex. -<p>During application initialization, the application performs two -additional tasks: first, it must provide Berkeley DB information about its -communications infrastructure, and second, it must start the Berkeley DB -replication system. Generally, a replicated application will do normal -Berkeley DB recovery and configuration, exactly like any other transactional -application. Then, once the database environment has been opened, it -will call the <a href="../../api_c/rep_transport.html">DB_ENV->set_rep_transport</a> method to configure Berkeley DB for replication, -and then will call the <a href="../../api_c/rep_start.html">DB_ENV->rep_start</a> method to join or create the -replication group. -<p>When calling <a href="../../api_c/rep_start.html">DB_ENV->rep_start</a>, the application has two choices: +simple, but the communications infrastructure code can be complex.</p> +<p>During application initialization, the application performs three +additional tasks: first, it must specify the <a href="../../api_c/env_open.html#DB_INIT_REP">DB_INIT_REP</a> flag +when opening its database environment; second, it must provide Berkeley DB +information about its communications infrastructure; and third, it must +start the Berkeley DB replication system. Generally, a replicated application +will do normal Berkeley DB recovery and configuration, exactly like any other +transactional application. Then, once the database environment has been +opened, it will call the <a href="../../api_c/rep_transport.html">DB_ENV->set_rep_transport</a> method to configure Berkeley DB for +replication, and then will call the <a href="../../api_c/rep_start.html">DB_ENV->rep_start</a> method to join or create +the replication group.</p> +<p>When calling <a href="../../api_c/rep_start.html">DB_ENV->rep_start</a> at application startup, +the application has two choices: specifically configure the master for the replication group, or, alternatively, configure all group members as clients and then call an election, letting the clients select the master from among themselves. @@ -60,10 +72,10 @@ Either is correct, and the choice is entirely up to the application. The result of calling <a href="../../api_c/rep_start.html">DB_ENV->rep_start</a> is usually the discovery of a master, or the declaration of the local environment as the master. If a master has not been discovered after a reasonable amount of time, the -application should call <a href="../../api_c/rep_elect.html">DB_ENV->rep_elect</a> to call for an election. -<p>In the case of multiple processes accessing a replicated environment, -all of the threads of control expecting to modify databases in the -environment or process replication messages must call the +application should call <a href="../../api_c/rep_elect.html">DB_ENV->rep_elect</a> to call for an election.</p> +<p>In the case of multiple processes or threads accessing a replicated +environment, any environment handle that modifies databases in the +environment or processes replication messages must call the <a href="../../api_c/rep_start.html">DB_ENV->rep_start</a> method. Note that not all processes running in replicated environments need to call <a href="../../api_c/rep_transport.html">DB_ENV->set_rep_transport</a> or <a href="../../api_c/rep_start.html">DB_ENV->rep_start</a>. Read-only processes running in a master environment do not need to be @@ -74,23 +86,16 @@ may become masters, it is usually simplest to configure for replication on process startup rather than trying to reconfigure when the client becomes a master). Obviously, at least one thread of control on each client must be configured for replication as messages must be passed -between the master and the client. -<p>Databases are generally opened read-write on both clients and masters -in order to simplify upgrading replication clients to be masters. (If -databases are opened read-only on clients, and the client is then -upgraded to be the master, the client would have to close and reopen -all of its databases in order to support database update queries.) -However, even though the database is opened read-write on the client, -any attempt to update it will result in an error until the client is -reconfigured as a master. No databases can be opened on clients before -calling <a href="../../api_c/rep_start.html">DB_ENV->rep_start</a>, and attempting to do so will result in an -error. -<p>There are no additional interface calls required to shut down a database -environment participating in a replication group. The application -should shut down the environment in the usual manner, by calling the -<a href="../../api_c/env_close.html">DB_ENV->close</a> method. -<table width="100%"><tr><td><br></td><td align=right><a href="../../ref/rep/pri.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../../reftoc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../../ref/rep/comm.html"><img src="../../images/next.gif" alt="Next"></a> +between the master and the client.</p> +<p>For implementation reasons, all incoming replication messages must be +processed using the same <a href="../../api_c/env_class.html">DB_ENV</a> handle. It is not required that +a single thread of control process all messages, only that all threads +of control processing messages use the same handle.</p> +<p>No additional calls are required to shut down a database environment +participating in a replication group. The application should shut down +the environment in the usual manner, by calling the <a href="../../api_c/env_close.html">DB_ENV->close</a> method.</p> +<table width="100%"><tr><td><br></td><td align=right><a href="../rep/pri.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../toc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../rep/comm.html"><img src="../../images/next.gif" alt="Next"></a> </td></tr></table> -<p><font size=1><a href="http://www.sleepycat.com">Copyright Sleepycat Software</a></font> +<p><font size=1><a href="../../sleepycat/legal.html">Copyright (c) 1996-2003</a> <a href="http://www.sleepycat.com">Sleepycat Software, Inc.</a> - All rights reserved.</font> </body> </html> diff --git a/db/docs/ref/rep/comm.html b/db/docs/ref/rep/comm.html index 00f4f2b45..9c5cadd16 100644 --- a/db/docs/ref/rep/comm.html +++ b/db/docs/ref/rep/comm.html @@ -1,49 +1,56 @@ -<!--Id: comm.so,v 1.5 2002/09/05 01:46:30 margo Exp --> -<!--Copyright 1997-2002 by Sleepycat Software, Inc.--> +<!--$Id: comm.so,v 1.11 2003/10/18 19:16:06 bostic Exp $--> +<!--Copyright 1997-2003 by Sleepycat Software, Inc.--> <!--All rights reserved.--> <!--See the file LICENSE for redistribution information.--> <html> <head> <title>Berkeley DB Reference Guide: Building the communications infrastructure</title> <meta name="description" content="Berkeley DB: An embedded database programmatic toolkit."> -<meta name="keywords" content="embedded,database,programmatic,toolkit,b+tree,btree,hash,hashing,transaction,transactions,locking,logging,access method,access methods,java,C,C++"> +<meta name="keywords" content="embedded,database,programmatic,toolkit,b+tree,btree,hash,hashing,transaction,transactions,locking,logging,access method,access methods,Java,C,C++"> </head> <body bgcolor=white> <table width="100%"><tr valign=top> <td><h3><dl><dt>Berkeley DB Reference Guide:<dd>Berkeley DB Replication</dl></h3></td> -<td align=right><a href="../../ref/rep/app.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../../reftoc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../../ref/rep/newsite.html"><img src="../../images/next.gif" alt="Next"></a> +<td align=right><a href="../rep/app.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../toc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../rep/newsite.html"><img src="../../images/next.gif" alt="Next"></a> </td></tr></table> <p> -<h1 align=center>Building the communications infrastructure</h1> +<h3 align=center>Building the communications infrastructure</h3> <p>The replication support in an application is typically written with one or more threads of control looping on one or more communication channels, receiving and sending messages. These threads accept messages from remote environments for the local database environment, and accept messages from the local environment for remote environments. Messages from remote environments are passed to the local database environment -using the <a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> method. Messages from the local environment -are passed to the application for transmission using the callback -interface specified to the <a href="../../api_c/rep_transport.html">DB_ENV->set_rep_transport</a> method. +using the <a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> method. Messages from the local environment are +passed to the application for transmission using the callback function +specified to the <a href="../../api_c/rep_transport.html">DB_ENV->set_rep_transport</a> method.</p> <p>Processes establish communication channels by calling the <a href="../../api_c/rep_transport.html">DB_ENV->set_rep_transport</a> method, regardless of whether they are running in client or server environments. This method specifies the <b>send</b> -interface, a callback interface used by Berkeley DB for sending messages to +function, a callback function used by Berkeley DB for sending messages to other database environments in the replication group. The <b>send</b> -interface takes an environment ID and two opaque data objects. It is -the responsibility of the <b>send</b> interface to transmit the -information in the two data objects to the database environment -corresponding to the ID, with the receiving application then calling -the <a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> method to process the message. +function takes an environment ID and two opaque data objects. It is the +responsibility of the <b>send</b> function to transmit the information +in the two data objects to the database environment corresponding to the +ID, with the receiving application then calling the <a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> method +to process the message.</p> <p>The details of the transport mechanism are left entirely to the application; the only requirement is that the data buffer and size of each of the control and rec <a href="../../api_c/dbt_class.html">DBT</a>s passed to the <b>send</b> function on the sending site be faithfully copied and delivered to the receiving site by means of a call to <a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> with -corresponding arguments. The <a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> method is free-threaded; it -is safe to deliver any number of messages simultaneously, and from any -arbitrary thread or process in the Berkeley DB environment. +corresponding arguments. Messages that are broadcast (whether by +broadcast media or when directed by setting the <a href="../../api_c/rep_transport.html">DB_ENV->set_rep_transport</a> method's +envid parameter DB_EID_BROADCAST), should not be processed by the +message sender. In all cases, the application's transport media or +software must ensure that <a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> is never called with a +message intended for a different database environment or a broadcast +message sent from the same environment on which <a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> will +be called. The <a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> method is free-threaded; it is safe to +deliver any number of messages simultaneously, and from any arbitrary +thread or process in the Berkeley DB environment.</p> <p>There are a number of informational returns from the -<a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> method: +<a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> method:</p> <p><dl compact> <p><dt><a href="../../api_c/rep_message.html#DB_REP_DUPMASTER">DB_REP_DUPMASTER</a><dd>When <a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> returns <a href="../../api_c/rep_message.html#DB_REP_DUPMASTER">DB_REP_DUPMASTER</a>, it means that another database environment in the replication group also believes @@ -54,6 +61,12 @@ calling the <a href="../../api_c/rep_elect.html">DB_ENV->rep_elect</a> method <p><dt><a href="../../api_c/rep_message.html#DB_REP_HOLDELECTION">DB_REP_HOLDELECTION</a><dd>When <a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> returns <a href="../../api_c/rep_message.html#DB_REP_HOLDELECTION">DB_REP_HOLDELECTION</a>, it means that another database environment in the replication group has called for an election. The application should call the <a href="../../api_c/rep_elect.html">DB_ENV->rep_elect</a> method. +<p><dt><a href="../../api_c/rep_message.html#DB_REP_ISPERM">DB_REP_ISPERM</a><dd>When <a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> returns <a href="../../api_c/rep_message.html#DB_REP_ISPERM">DB_REP_ISPERM</a>, it means a +permanent record, perhaps a message previously returned as +<a href="../../api_c/rep_message.html#DB_REP_NOTPERM">DB_REP_NOTPERM</a> was successfully +written to disk. This record may have filled a gap in the log record that +allowed additional records to be written. The <b>ret_lsnp</b> +contains the maximum LSN of the permanent records written. <p><dt><a href="../../api_c/rep_message.html#DB_REP_NEWMASTER">DB_REP_NEWMASTER</a><dd>When <a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> returns <a href="../../api_c/rep_message.html#DB_REP_NEWMASTER">DB_REP_NEWMASTER</a>, it means that a new master has been elected. The call will also return the local environment's ID for that master. If the ID of the master has changed, @@ -66,14 +79,22 @@ replication master. a message from a previously unknown member of the replication group has been received. The application should reconfigure itself as necessary so it is able to send messages to this site. +<p><dt><a href="../../api_c/rep_message.html#DB_REP_NOTPERM">DB_REP_NOTPERM</a><dd>When <a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> returns <a href="../../api_c/rep_message.html#DB_REP_NOTPERM">DB_REP_NOTPERM</a>, it means a +message marked as <a href="../../api_c/rep_transport.html#DB_REP_PERMANENT">DB_REP_PERMANENT</a> was processed successfully +but was not written to disk. This is normally an indication that one +or more messages, which should have arrived before this message, have +not yet arrived. This operation will be written to disk when the +missing messages arrive. The <b>ret_lsnp</b> argument will contain +the LSN of this record. The application should take whatever action +is deemed necessary to retain its recoverability characteristics. <p><dt><a href="../../api_c/rep_message.html#DB_REP_OUTDATED">DB_REP_OUTDATED</a><dd>When <a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> returns <a href="../../api_c/rep_message.html#DB_REP_OUTDATED">DB_REP_OUTDATED</a>, it means that the environment has been partitioned from the master for too long a time, and the master no longer has the necessary log files to update the local client. The application should shut down, and the client should be reinitialized (see <a href="../../ref/rep/init.html">Initializing a new site</a> for more information). </dl> -<table width="100%"><tr><td><br></td><td align=right><a href="../../ref/rep/app.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../../reftoc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../../ref/rep/newsite.html"><img src="../../images/next.gif" alt="Next"></a> +<table width="100%"><tr><td><br></td><td align=right><a href="../rep/app.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../toc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../rep/newsite.html"><img src="../../images/next.gif" alt="Next"></a> </td></tr></table> -<p><font size=1><a href="http://www.sleepycat.com">Copyright Sleepycat Software</a></font> +<p><font size=1><a href="../../sleepycat/legal.html">Copyright (c) 1996-2003</a> <a href="http://www.sleepycat.com">Sleepycat Software, Inc.</a> - All rights reserved.</font> </body> </html> diff --git a/db/docs/ref/rep/elect.html b/db/docs/ref/rep/elect.html index f4b39fdd9..c7b690afc 100644 --- a/db/docs/ref/rep/elect.html +++ b/db/docs/ref/rep/elect.html @@ -1,38 +1,53 @@ -<!--Id: elect.so,v 1.7 2002/09/11 19:25:03 bostic Exp --> -<!--Copyright 1997-2002 by Sleepycat Software, Inc.--> +<!--$Id: elect.so,v 1.16 2003/06/18 01:44:58 bostic Exp $--> +<!--Copyright 1997-2003 by Sleepycat Software, Inc.--> <!--All rights reserved.--> <!--See the file LICENSE for redistribution information.--> <html> <head> <title>Berkeley DB Reference Guide: Elections</title> <meta name="description" content="Berkeley DB: An embedded database programmatic toolkit."> -<meta name="keywords" content="embedded,database,programmatic,toolkit,b+tree,btree,hash,hashing,transaction,transactions,locking,logging,access method,access methods,java,C,C++"> +<meta name="keywords" content="embedded,database,programmatic,toolkit,b+tree,btree,hash,hashing,transaction,transactions,locking,logging,access method,access methods,Java,C,C++"> </head> <body bgcolor=white> <table width="100%"><tr valign=top> <td><h3><dl><dt>Berkeley DB Reference Guide:<dd>Berkeley DB Replication</dl></h3></td> -<td align=right><a href="../../ref/rep/init.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../../reftoc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../../ref/rep/logonly.html"><img src="../../images/next.gif" alt="Next"></a> +<td align=right><a href="../rep/init.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../toc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../rep/logonly.html"><img src="../../images/next.gif" alt="Next"></a> </td></tr></table> <p> -<h1 align=center>Elections</h1> -<p>Berkeley DB never initiates elections, that is the responsibility of the -application. It is not dangerous to hold an election, as the Berkeley DB -election process ensures there is never more than a single master -environment. Clients should initiate an election whenever they lose -contact with the master environment, whenever they see a return of +<h3 align=center>Elections</h3> +<p>It is the responsibility of the application to initiate elections. It +is never dangerous to hold an election, as the Berkeley DB election process +ensures there is never more than a single master database environment. +Clients should initiate an election whenever they lose contact with the +master environment, whenever they see a return of <a href="../../api_c/rep_message.html#DB_REP_HOLDELECTION">DB_REP_HOLDELECTION</a> from the <a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> method, or when, for whatever reason, they do not know who the master is. It is not necessary for applications to immediately hold elections when they -start, as any existing master will be quickly discovered after calling +start, as any existing master will be discovered after calling <a href="../../api_c/rep_start.html">DB_ENV->rep_start</a>. If no master has been found after a short wait -period, then the application should call for an election. -<p>For a client to become the master, the client must win an election. To -win an election, the replication group must currently have no master, -the client must have the highest priority of the database environments -participating in the election, and at least (N / 2 + 1) of the members -of the replication group must participate in the election. In the case -of multiple database environments with equal priorities, the environment -with the most recent log records will win. +period, then the application should call for an election.</p> +<p>For a client to win an election, the replication group must currently +have no master, and the client must have the most recent log records. +In the case of clients having equivalent log records, the priority of +the database environments participating in the election will determine +the winner. At least ((N/2) + 1) of the members of the replication +group must participate in the election for a winner to be declared.</p> +<p>If an application's policy for what site should win an election can be +parameterized in terms the database environment's information (that is, +the number of sites, available log records and a relative priority are +all that matter), then Berkeley DB can handle all elections transparently. +However, there are cases where the application has more complete +knowledge and needs to affect the outcome of elections. For example, +applications may choose to handle master selection, explicitly +designating master and client sites. Applications in these cases may +never need to call for an election. Alternatively, applications may +choose to use <a href="../../api_c/rep_elect.html">DB_ENV->rep_elect</a>'s arguments to force the correct outcome +to an election. That is, if an application has three sites, A, B, and +C, and after a failure of C determines that A must become the winner, +the application can guarantee an election's outcome by specifying +priorities appropriately after an election:</p> +<blockquote><pre>on A: priority 100, nsites 2 +on B: priority 0, nsites 2</pre></blockquote> <p>It is dangerous to configure more than one master environment using the <a href="../../api_c/rep_start.html">DB_ENV->rep_start</a> method, and applications should be careful not to do so. Applications should only configure themselves as the master environment @@ -41,19 +56,20 @@ An application can only know it has won an election if the <a href="../../api_c/rep_elect.html">DB_ENV->rep_elect</a> method returns success and the local database environment's ID as the new master environment ID, or if the <a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> method returns <a href="../../api_c/rep_message.html#DB_REP_NEWMASTER">DB_REP_NEWMASTER</a> and the local database environment's -ID as the new master environment ID. +ID as the new master environment ID.</p> <p>To add a database environment to the replication group with the intent of it becoming the master, first add it as a client. Since it may be out-of-date with respect to the current master, allow it to update itself from the current master. Then, shut the current master down. Presumably, the added client will win the subsequent election. If the client does not win the election, it is likely that it was not given -sufficient time to update itself with respect to the current master. +sufficient time to update itself with respect to the current master.</p> <p>If a client is unable to find a master or win an election, it means that the network has been partitioned and there are not enough environments -participating in the election for one of the participants to win. In -this case, the application should repeatedly call <a href="../../api_c/rep_start.html">DB_ENV->rep_start</a> and -<a href="../../api_c/rep_elect.html">DB_ENV->rep_elect</a>, alternating between attempting to discover an +participating in the election for one of the participants to win (or, +there were only two sites in the replication group and one crashed). +In this case, the application should repeatedly call <a href="../../api_c/rep_start.html">DB_ENV->rep_start</a> +and <a href="../../api_c/rep_elect.html">DB_ENV->rep_elect</a>, alternating between attempting to discover an existing master, and holding an election to declare a new one. In desperate circumstances, an application could simply declare itself the master by calling <a href="../../api_c/rep_start.html">DB_ENV->rep_start</a>, or by reducing the number of @@ -62,7 +78,9 @@ Neither of these solutions is recommended: in the case of a network partition, either of these choices can result in there being two masters in one replication group, and the databases in the environment might irretrievably diverge as they are modified in different ways by the -masters. +masters. In the case of a two-system replication group, the application +may want to require access to a remote network site, or some other +external tie-breaker to allow a system to declare itself master.</p> <p>It is possible for a less-preferred database environment to win an election if a number of systems crash at the same time. Because an election winner is declared as soon as enough environments participate @@ -73,25 +91,23 @@ the same time (for example, a set of replicated servers in a single machine room), applications should bring the database environments on line as clients initially (which will allow them to process read queries immediately), and then hold an election after sufficient time has passed -for the slower booting machines to catch up. +for the slower booting machines to catch up.</p> <p>If, for any reason, a less-preferred database environment becomes the -master, it is possible to switch masters in a replicated environment, -although it is not a simple operation. For example, the preferred -master crashes, and one of the replication group clients becomes the -group master. In order to restore the preferred master to master -status, take the following steps: -<p><ol> +master, it is possible to switch masters in a replicated environment. +For example, the preferred master crashes, and one of the replication +group clients becomes the group master. In order to restore the +preferred master to master status, take the following steps:</p> +<ol> <p><li>The preferred master should reboot and re-join the replication group as a client. <li>Once the preferred master has caught up with the replication group, the -application on the current master should complete all active -transactions, close all open database handles, and reconfigure itself -as a client using the <a href="../../api_c/rep_start.html">DB_ENV->rep_start</a> method. +application on the current master should complete all active transactions +and reconfigure itself as a client using the <a href="../../api_c/rep_start.html">DB_ENV->rep_start</a> method. <li>Then, the current or preferred master should call for an election using the <a href="../../api_c/rep_elect.html">DB_ENV->rep_elect</a> method. </ol> -<table width="100%"><tr><td><br></td><td align=right><a href="../../ref/rep/init.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../../reftoc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../../ref/rep/logonly.html"><img src="../../images/next.gif" alt="Next"></a> +<table width="100%"><tr><td><br></td><td align=right><a href="../rep/init.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../toc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../rep/logonly.html"><img src="../../images/next.gif" alt="Next"></a> </td></tr></table> -<p><font size=1><a href="http://www.sleepycat.com">Copyright Sleepycat Software</a></font> +<p><font size=1><a href="../../sleepycat/legal.html">Copyright (c) 1996-2003</a> <a href="http://www.sleepycat.com">Sleepycat Software, Inc.</a> - All rights reserved.</font> </body> </html> diff --git a/db/docs/ref/rep/ex.html b/db/docs/ref/rep/ex.html index b9f89921c..d502e1242 100644 --- a/db/docs/ref/rep/ex.html +++ b/db/docs/ref/rep/ex.html @@ -1,35 +1,35 @@ -<!--Id: ex.so,v 1.4 2002/06/24 14:50:47 bostic Exp --> -<!--Copyright 1997-2002 by Sleepycat Software, Inc.--> +<!--$Id: ex.so,v 1.5 2003/02/06 03:19:11 mjc Exp $--> +<!--Copyright 1997-2003 by Sleepycat Software, Inc.--> <!--All rights reserved.--> <!--See the file LICENSE for redistribution information.--> <html> <head> <title>Berkeley DB Reference Guide: Ex_repquote: a replication example</title> <meta name="description" content="Berkeley DB: An embedded database programmatic toolkit."> -<meta name="keywords" content="embedded,database,programmatic,toolkit,b+tree,btree,hash,hashing,transaction,transactions,locking,logging,access method,access methods,java,C,C++"> +<meta name="keywords" content="embedded,database,programmatic,toolkit,b+tree,btree,hash,hashing,transaction,transactions,locking,logging,access method,access methods,Java,C,C++"> </head> <body bgcolor=white> <table width="100%"><tr valign=top> <td><h3><dl><dt>Berkeley DB Reference Guide:<dd>Berkeley DB Replication</dl></h3></td> -<td align=right><a href="../../ref/rep/faq.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../../reftoc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../../ref/rep/ex_comm.html"><img src="../../images/next.gif" alt="Next"></a> +<td align=right><a href="../rep/faq.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../toc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../rep/ex_comm.html"><img src="../../images/next.gif" alt="Next"></a> </td></tr></table> <p> -<h1 align=center>Ex_repquote: a replication example</h1> +<h3 align=center>Ex_repquote: a replication example</h3> <p>Ex_repquote, found in the <b>examples_c/ex_repquote</b> subdirectory of the Berkeley DB distribution, is a simple but complete demonstration of a replicated application. The application is a mock stock ticker. The master accepts a stock symbol and an integer value as input, stores this information into a replicated database; the clients display the contents -of the database every few seconds. +of the database every few seconds.</p> <p>The ex_repquote application's communication infrastructure is based on TCP/IP sockets, and uses POSIX 1003.1 style networking/socket support. As a result, it is not as portable as the Berkeley DB library itself. The Makefile created by the standard UNIX configuration will build the ex_repquote application on most platforms. Enter "make ex_repquote" to -attempt to build it. -<p>The synopsis for ex_repquote is as follows: +attempt to build it.</p> +<p>The synopsis for ex_repquote is as follows:</p> <pre>ex_repquote [<b>-MC</b>] [<b>-h home</b>] [<b>-m host:port</b>] [<b>-o host:port</b>] [<b>-n sites</b>] [<b>-p priority</b>]</pre> -<p>The options to ex_repquote are as follows: +<p>The options to ex_repquote are as follows:</p> <p><dl compact> <p><dt><b>-M</b><dd>Configure this process as a master. <p><dt><b>-C</b><dd>Configure this process as a client. @@ -46,20 +46,20 @@ group. information. </dl> <p>A typical ex_repquote session begins with a command such as the -following, to start a master: -<p><blockquote><pre>ex_repquote -M -p 100 -n 4 -h DIR1 -m localhost:5000</pre></blockquote> -<p>and several clients: -<p><blockquote><pre>ex_repquote -C -p 50 -n 4 -h DIR2 -m localhost:5001 -o localhost:5000 -ex_repquote -C -p 10 -n 4 -h DIR3 -m localhost:5002 -o localhost:5000 -ex_repquote -C -p 0 -n 4 -h DIR4 -m localhost:5003 -o localhost:5000</pre></blockquote> +following, to start a master:</p> +<blockquote><pre>ex_repquote -M -p 100 -n 4 -h DIR1 -m localhost:6000</pre></blockquote> +<p>and several clients:</p> +<blockquote><pre>ex_repquote -C -p 50 -n 4 -h DIR2 -m localhost:6001 -o localhost:6000 +ex_repquote -C -p 10 -n 4 -h DIR3 -m localhost:6002 -o localhost:6000 +ex_repquote -C -p 0 -n 4 -h DIR4 -m localhost:6003 -o localhost:6000</pre></blockquote> <p>In this example, the client with home directory DIR4 can never become a master (its priority is 0). Both of the other clients can become masters, but the one with home directory DIR2 is preferred. Priorities are assigned by the application and should reflect the desirability of having particular clients take over as master in the case that the -master fails. -<table width="100%"><tr><td><br></td><td align=right><a href="../../ref/rep/faq.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../../reftoc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../../ref/rep/ex_comm.html"><img src="../../images/next.gif" alt="Next"></a> +master fails.</p> +<table width="100%"><tr><td><br></td><td align=right><a href="../rep/faq.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../toc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../rep/ex_comm.html"><img src="../../images/next.gif" alt="Next"></a> </td></tr></table> -<p><font size=1><a href="http://www.sleepycat.com">Copyright Sleepycat Software</a></font> +<p><font size=1><a href="../../sleepycat/legal.html">Copyright (c) 1996-2003</a> <a href="http://www.sleepycat.com">Sleepycat Software, Inc.</a> - All rights reserved.</font> </body> </html> diff --git a/db/docs/ref/rep/ex_comm.html b/db/docs/ref/rep/ex_comm.html index e0a56003d..7214a11c2 100644 --- a/db/docs/ref/rep/ex_comm.html +++ b/db/docs/ref/rep/ex_comm.html @@ -1,20 +1,20 @@ -<!--Id: ex_comm.so,v 1.6 2002/06/24 14:50:48 bostic Exp --> -<!--Copyright 1997-2002 by Sleepycat Software, Inc.--> +<!--$Id: ex_comm.so,v 1.6 2002/06/24 14:50:48 bostic Exp $--> +<!--Copyright 1997-2003 by Sleepycat Software, Inc.--> <!--All rights reserved.--> <!--See the file LICENSE for redistribution information.--> <html> <head> <title>Berkeley DB Reference Guide: Ex_repquote: a TCP/IP based communication infrastructure</title> <meta name="description" content="Berkeley DB: An embedded database programmatic toolkit."> -<meta name="keywords" content="embedded,database,programmatic,toolkit,b+tree,btree,hash,hashing,transaction,transactions,locking,logging,access method,access methods,java,C,C++"> +<meta name="keywords" content="embedded,database,programmatic,toolkit,b+tree,btree,hash,hashing,transaction,transactions,locking,logging,access method,access methods,Java,C,C++"> </head> <body bgcolor=white> <table width="100%"><tr valign=top> <td><h3><dl><dt>Berkeley DB Reference Guide:<dd>Berkeley DB Replication</dl></h3></td> -<td align=right><a href="../../ref/rep/ex.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../../reftoc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../../ref/rep/ex_rq.html"><img src="../../images/next.gif" alt="Next"></a> +<td align=right><a href="../rep/ex.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../toc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../rep/ex_rq.html"><img src="../../images/next.gif" alt="Next"></a> </td></tr></table> <p> -<h1 align=center>Ex_repquote: a TCP/IP based communication infrastructure</h1> +<h3 align=center>Ex_repquote: a TCP/IP based communication infrastructure</h3> <p>All Berkeley DB replication applications must implement a communication infrastructure. The communication infrastructure consists of three parts: a way to map environment IDs to particular sites, the functions @@ -23,7 +23,7 @@ supports the particular communication infrastructure used (for example, individual threads per communicating site, a shared message handler for all sites, a hybrid solution). The communication infrastructure is implemented in the file <b>ex_repquote/ex_rq_net.c</b>, and each part -of that infrastructure is described as follows. +of that infrastructure is described as follows.</p> <p>Ex_repquote maintains a table of environment ID to TCP/IP port mappings. This table is stored in the app_private field of the <a href="../../api_c/env_class.html">DB_ENV</a> object so it can be accessed by any function that has the database @@ -31,8 +31,8 @@ environment handle. The table is represented by a machtab_t structure which contains a reference to a linked list of member_t's, both of which are defined in <b>ex_repquote/ex_rq_net.c</b>. Each member_t contains the host and port identification, the environment ID, and a file -descriptor. The table is maintained by the following interfaces: -<p><blockquote><pre>int machtab_add(machtab_t *machtab, int fd, u_int32_t hostaddr, int port, int *eidp); +descriptor. The table is maintained by the following interfaces:</p> +<blockquote><pre>int machtab_add(machtab_t *machtab, int fd, u_int32_t hostaddr, int port, int *eidp); int machtab_init(machtab_t **machtabp, int priority, int nsites); int machtab_getinfo(machtab_t *machtab, int eid, u_int32_t *hostp, int *portp); void machtab_parm(machtab_t *machtab, int *nump, int *priorityp, u_int32_t *timeoutp); @@ -48,26 +48,27 @@ when given the special environment ID <a href="../../api_c/rep_transport.html#DB function can send messages to all the machines in the group. Third, upon receipt of an incoming message, the receive function can correctly identify the sender and pass the appropriate environment ID to the -<a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> method. +<a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> method.</p> <p>Mapping a particular environment ID to a specific port is accomplished by looping through the linked list until the desired environment ID is found. Broadcast communication is implemented by looping through the linked list and sending to each member found. Since each port communicates with only a single other environment, receipt of a message -on a particular port precisely identifies the sender. +on a particular port precisely identifies the sender.</p> <p>The example provided is merely one way to satisfy these requirements, and there are alternative implementations as well. For instance, instead of associating separate socket connections with each remote environment, an application might instead label each message with a sender identifier; instead of looping through a table and sending a copy of a message to each member of the replication group, the -application could send a single message using a broadcast protocol. +application could send a single message using a broadcast protocol.</p> <p>In ex_repquote's case, the send function (slightly simplified) is as -follows: -<pre><p><blockquote>int -quote_send(dbenv, control, rec, eid, flags) +follows:</p> +<pre><blockquote>int +quote_send(dbenv, control, rec, lsn, eid, flags) DB_ENV *dbenv; const DBT *control, *rec; + const DB_LSN *lsn; int eid; u_int32_t flags; { @@ -139,7 +140,7 @@ quote_send_broadcast(machtab, rec, control, flags) <p>The quote_send_one function has been omitted as it simply writes the data requested over the file descriptor that it is passed. It contains nothing specific to Berkeley DB or this communication infrastructure. The -complete code can be found in <b>ex_repquote/ex_rq_net.c</b>. +complete code can be found in <b>ex_repquote/ex_rq_net.c</b>.</p> <p>The quote_send function is passed as the callback to <a href="../../api_c/rep_transport.html">DB_ENV->set_rep_transport</a>; Berkeley DB automatically sends messages as needed for replication. The receive function is a mirror to the quote_send_one function. It is not @@ -149,7 +150,7 @@ the sample application, all messages transmitted are Berkeley DB messages that get handled by <a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a>, however, this is not always going to be the case. The application may want to pass its own messages across the same channels, distinguish between its own messages and those -of Berkeley DB, and then pass only the Berkeley DB ones to <a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a>. +of Berkeley DB, and then pass only the Berkeley DB ones to <a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a>.</p> <p>The final component of the communication infrastructure is the process model used to communicate with all the sites in the replication group. Each site creates a thread of control that listens on its designated @@ -158,8 +159,8 @@ then creates a new channel for each site that contacts it. In addition, each site explicitly connects to the sites specified in the <b>-o</b> command line argument. This is a fairly standard TCP/IP process architecture and is implemented by the following functions (all -in <b>ex_repquote/ex_rq_net.c</b>). -<p><blockquote><pre>int get_connected_socket(machtab_t *machtab, char *progname, char *remotehost, +in <b>ex_repquote/ex_rq_net.c</b>).</p> +<blockquote><pre>int get_connected_socket(machtab_t *machtab, char *progname, char *remotehost, int port, int *is_open, int *eidp): Connect to the specified host/port, add the site to the machtab, and return a file descriptor for communication with this site. @@ -170,8 +171,8 @@ listening on a particular part. int listen_socket_accept(machtab_t *machtab, char *progname, int socket, int *eidp): Accept a connection on a socket and add it to the machtab. int listen_socket_connect</pre></blockquote> -<table width="100%"><tr><td><br></td><td align=right><a href="../../ref/rep/ex.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../../reftoc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../../ref/rep/ex_rq.html"><img src="../../images/next.gif" alt="Next"></a> +<table width="100%"><tr><td><br></td><td align=right><a href="../rep/ex.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../toc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../rep/ex_rq.html"><img src="../../images/next.gif" alt="Next"></a> </td></tr></table> -<p><font size=1><a href="http://www.sleepycat.com">Copyright Sleepycat Software</a></font> +<p><font size=1><a href="../../sleepycat/legal.html">Copyright (c) 1996-2003</a> <a href="http://www.sleepycat.com">Sleepycat Software, Inc.</a> - All rights reserved.</font> </body> </html> diff --git a/db/docs/ref/rep/ex_rq.html b/db/docs/ref/rep/ex_rq.html index 6d83e98aa..60fb51444 100644 --- a/db/docs/ref/rep/ex_rq.html +++ b/db/docs/ref/rep/ex_rq.html @@ -1,31 +1,31 @@ -<!--Id: ex_rq.so,v 1.4 2002/06/24 14:50:48 bostic Exp --> -<!--Copyright 1997-2002 by Sleepycat Software, Inc.--> +<!--$Id: ex_rq.so,v 1.4 2002/06/24 14:50:48 bostic Exp $--> +<!--Copyright 1997-2003 by Sleepycat Software, Inc.--> <!--All rights reserved.--> <!--See the file LICENSE for redistribution information.--> <html> <head> <title>Berkeley DB Reference Guide: Ex_repquote: putting it all together</title> <meta name="description" content="Berkeley DB: An embedded database programmatic toolkit."> -<meta name="keywords" content="embedded,database,programmatic,toolkit,b+tree,btree,hash,hashing,transaction,transactions,locking,logging,access method,access methods,java,C,C++"> +<meta name="keywords" content="embedded,database,programmatic,toolkit,b+tree,btree,hash,hashing,transaction,transactions,locking,logging,access method,access methods,Java,C,C++"> </head> <body bgcolor=white> <table width="100%"><tr valign=top> <td><h3><dl><dt>Berkeley DB Reference Guide:<dd>Berkeley DB Replication</dl></h3></td> -<td align=right><a href="../../ref/rep/ex_comm.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../../reftoc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../../ref/xa/intro.html"><img src="../../images/next.gif" alt="Next"></a> +<td align=right><a href="../rep/ex_comm.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../toc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../xa/intro.html"><img src="../../images/next.gif" alt="Next"></a> </td></tr></table> <p> -<h1 align=center>Ex_repquote: putting it all together</h1> +<h3 align=center>Ex_repquote: putting it all together</h3> <p>A replicated application must initialize a replicated environment, set up its communication infrastructure, and then make sure that incoming -messages are received and processed. +messages are received and processed.</p> <p>To initialize replication, ex_repquote creates a Berkeley DB environment and calls <a href="../../api_c/rep_transport.html">DB_ENV->set_rep_transport</a> to establish a send function. The following code fragment (from the env_init function, found in <b>ex_repquote/ex_rq_main.c</b>) demonstrates this. Prior to calling this function, the application has called machtab_init to initialize its environment ID to port mapping structure and passed this structure -into env_init. -<pre><p><blockquote>if ((ret = db_env_create(&dbenv, 0)) != 0) { +into env_init.</p> +<pre><blockquote>if ((ret = db_env_create(&dbenv, 0)) != 0) { fprintf(stderr, "%s: env create failed: %s\n", progname, db_strerror(ret)); return (ret); @@ -49,12 +49,12 @@ ex_repquote creates a user-level thread to listen on its socket, plus a thread to loop and handle messages on each socket, in addition to the threads needed to manage the user interface, update the database on the master, and read from the database on the client (in other words, in -addition to the normal functionality of any database application). +addition to the normal functionality of any database application).</p> <p>Once the initial threads have all been started and the communications infrastructure is initialized, the application signals that it is ready for replication and joins a replication group by calling -<a href="../../api_c/rep_start.html">DB_ENV->rep_start</a>: -<pre><p><blockquote>if (whoami == MASTER) { +<a href="../../api_c/rep_start.html">DB_ENV->rep_start</a>:</p> +<pre><blockquote>if (whoami == MASTER) { if ((ret = dbenv->rep_start(dbenv, NULL, DB_REP_MASTER)) != 0) { /* Complain and exit on error. */ } @@ -79,7 +79,7 @@ without knowing the location of all its members; the new client will be contacted by the members it does not know about, who will receive the new client's contact information that was specified in "myaddr." See <a href="../../ref/rep/newsite.html">Connecting to a new site</a> for more -information. +information.</p> <p>The final piece of a replicated application is the code that loops, receives, and processes messages from a given remote environment. ex_repquote runs one of these loops in a parallel thread for each socket @@ -89,8 +89,8 @@ either look up the correct environment ID for each or encapsulate the ID in the communications protocol. The details may thus vary from application to application, but in ex_repquote the message-handling loop is as follows (code fragment from the hm_loop function, found in -<b>ex_repquote/ex_rq_util.c</b>): -<pre><p><blockquote>DB_ENV *dbenv; +<b>ex_repquote/ex_rq_util.c</b>):</p> +<pre><blockquote>DB_ENV *dbenv; DBT rec, control; /* Structures encapsulating a received message. */ elect_args *ea; /* Parameters to the elect thread. */ machtab_t *tab; /* The environment ID to fd mapping table. */ @@ -160,7 +160,7 @@ for (ret = 0; ret == 0;) { <p> tmpid = eid; switch(r = dbenv->rep_process_message(dbenv, - &control, &rec, &tmpid)) { + &control, &rec, &tmpid, NULL)) { case DB_REP_NEWSITE: /* * Check if we got sent connect information and if we @@ -229,8 +229,8 @@ for (ret = 0; ret == 0;) { break; } }</blockquote></pre> -<table width="100%"><tr><td><br></td><td align=right><a href="../../ref/rep/ex_comm.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../../reftoc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../../ref/xa/intro.html"><img src="../../images/next.gif" alt="Next"></a> +<table width="100%"><tr><td><br></td><td align=right><a href="../rep/ex_comm.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../toc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../xa/intro.html"><img src="../../images/next.gif" alt="Next"></a> </td></tr></table> -<p><font size=1><a href="http://www.sleepycat.com">Copyright Sleepycat Software</a></font> +<p><font size=1><a href="../../sleepycat/legal.html">Copyright (c) 1996-2003</a> <a href="http://www.sleepycat.com">Sleepycat Software, Inc.</a> - All rights reserved.</font> </body> </html> diff --git a/db/docs/ref/rep/faq.html b/db/docs/ref/rep/faq.html index bd0e421a3..f825f3233 100644 --- a/db/docs/ref/rep/faq.html +++ b/db/docs/ref/rep/faq.html @@ -1,21 +1,21 @@ -<!--Id: faq.so,v 1.6 2002/05/09 20:38:15 bostic Exp --> -<!--Copyright 1997-2002 by Sleepycat Software, Inc.--> +<!--$Id: faq.so,v 1.8 2003/05/17 19:07:59 bostic Exp $--> +<!--Copyright 1997-2003 by Sleepycat Software, Inc.--> <!--All rights reserved.--> <!--See the file LICENSE for redistribution information.--> <html> <head> <title>Berkeley DB Reference Guide: Replication FAQ</title> <meta name="description" content="Berkeley DB: An embedded database programmatic toolkit."> -<meta name="keywords" content="embedded,database,programmatic,toolkit,b+tree,btree,hash,hashing,transaction,transactions,locking,logging,access method,access methods,java,C,C++"> +<meta name="keywords" content="embedded,database,programmatic,toolkit,b+tree,btree,hash,hashing,transaction,transactions,locking,logging,access method,access methods,Java,C,C++"> </head> <body bgcolor=white> <table width="100%"><tr valign=top> <td><h3><dl><dt>Berkeley DB Reference Guide:<dd>Berkeley DB Replication</dl></h3></td> -<td align=right><a href="../../ref/rep/partition.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../../reftoc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../../ref/rep/ex.html"><img src="../../images/next.gif" alt="Next"></a> +<td align=right><a href="../rep/partition.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../toc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../rep/ex.html"><img src="../../images/next.gif" alt="Next"></a> </td></tr></table> <p> -<h1 align=center>Replication FAQ</h1> -<p><ol> +<h3 align=center>Replication FAQ</h3> +<ol> <p><li><b>Does Berkeley DB provide support for forwarding write queries from clients to masters?</b> <p>No, it does not. The Berkeley DB RPC server code could be modified to support @@ -23,11 +23,16 @@ this functionality, but in general this protocol is left entirely to the application. Note, there is no reason not to use the communications channels the application establishes for replication support to forward database update messages to the master, Berkeley DB does not require that -those channels be used exclusively for replication messages. +those channels be used exclusively for replication messages.</p> <p><li><b>Can I use replication to partition my environment across multiple sites?</b> <p>No, this is not possible. All replicated databases must be equally -shared by all environments in the replication group. +shared by all environments in the replication group.</p> +<p><li><b>I'm running with replication but I don't see my databases +on the client.</b> +<p>This problem may be the result of the application using absolute path +names for its databases, and the pathnames are not valid on the client +system.</p> <p><li><b>How can I distinguish Berkeley DB messages from application messages?</b> <p>There is no way to distinguish Berkeley DB messages from application-specific messages, nor does Berkeley DB offer any way to wrap application messages @@ -38,7 +43,7 @@ The one exception to this rule is connection information for new sites; Berkeley DB offers a simple method for sites joining replication groups to send connection information to the other database environments in the group (see <a href="../../ref/rep/newsite.html">Connecting to a new site</a> -for more information). +for more information).</p> <p><li><b>How should I build my <b>send</b> function?</b> <p>This depends on the specifics of the application. One common way is to write the <b>rec</b> and <b>control</b> arguments' sizes and data to @@ -49,7 +54,7 @@ message, with header information specifying the intended recipient(s) for the message. This will likely require a global numbering scheme, however, as the Berkeley DB library has to be able to send specific log records to clients apart from the general broadcast of new log records -intended for all members of a replication group. +intended for all members of a replication group.</p> <p><li><b>Does every one of my threads of control on the master have to set up its own connection to every client? And, does every one of my threads of control on the client have to set up its own connection to @@ -59,7 +64,7 @@ thread of control which modifies a database in the master environment must be prepared to send a message to the client environments, and any thread of control which delivers a message to a client environment must be prepared to send a message to the master. There are many ways in -which these requirements can be satisfied. +which these requirements can be satisfied.</p> <p>The simplest case is probably a single, multithreaded process running on the master and clients. The process running on the master would require a single write connection to each client and a single read @@ -67,7 +72,7 @@ connection from each client. A process running on each client would require a single read connection from the master and a single write connection to the master. Threads running in these processes on the master and clients would use the same network connections to pass -messages back and forth. +messages back and forth.</p> <p>A common complication is when there are multiple processes running on the master and clients. A straight-forward solution is to increase the numbers of connections on the master -- each process running on the @@ -81,7 +86,7 @@ still only requires a single thread of control that receives master messages and forwards them into the database, and which also takes database messages and forwards them back to the master. This model requires the networking infrastructure support many-to-one -writers-to-readers, of course. +writers-to-readers, of course.</p> <p>If the number of network connections is a problem in the multiprocess model, and inter-process communication on the system is inexpensive enough, an alternative is have a single process which communicates @@ -90,16 +95,16 @@ between the master the each client, and whenever a process' communications process which is responsible for forwarding the message to the appropriate client. Alternatively, a broadcast mechanism will simplify the entire networking infrastructure, as processes will likely -no longer have to maintain their own specific network connections. +no longer have to maintain their own specific network connections.</p> <p><li><b>Can I use replication to replicate just the database environment's log files?</b> <p>Yes. If the <a href="../../api_c/rep_start.html#DB_REP_LOGSONLY">DB_REP_LOGSONLY</a> flag is specified to <a href="../../api_c/rep_start.html">DB_ENV->rep_start</a>, the client site acts as a repository for logfiles (see <a href="../../ref/rep/logonly.html">Log file only clients</a> for more -information). +information).</p> </ol> -<table width="100%"><tr><td><br></td><td align=right><a href="../../ref/rep/partition.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../../reftoc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../../ref/rep/ex.html"><img src="../../images/next.gif" alt="Next"></a> +<table width="100%"><tr><td><br></td><td align=right><a href="../rep/partition.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../toc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../rep/ex.html"><img src="../../images/next.gif" alt="Next"></a> </td></tr></table> -<p><font size=1><a href="http://www.sleepycat.com">Copyright Sleepycat Software</a></font> +<p><font size=1><a href="../../sleepycat/legal.html">Copyright (c) 1996-2003</a> <a href="http://www.sleepycat.com">Sleepycat Software, Inc.</a> - All rights reserved.</font> </body> </html> diff --git a/db/docs/ref/rep/id.html b/db/docs/ref/rep/id.html index c728f5a00..091e6aea6 100644 --- a/db/docs/ref/rep/id.html +++ b/db/docs/ref/rep/id.html @@ -1,20 +1,20 @@ -<!--Id: id.so,v 1.6 2002/05/09 20:38:15 bostic Exp --> -<!--Copyright 1997-2002 by Sleepycat Software, Inc.--> +<!--$Id: id.so,v 1.7 2003/10/18 19:16:06 bostic Exp $--> +<!--Copyright 1997-2003 by Sleepycat Software, Inc.--> <!--All rights reserved.--> <!--See the file LICENSE for redistribution information.--> <html> <head> <title>Berkeley DB Reference Guide: Replication environment IDs</title> <meta name="description" content="Berkeley DB: An embedded database programmatic toolkit."> -<meta name="keywords" content="embedded,database,programmatic,toolkit,b+tree,btree,hash,hashing,transaction,transactions,locking,logging,access method,access methods,java,C,C++"> +<meta name="keywords" content="embedded,database,programmatic,toolkit,b+tree,btree,hash,hashing,transaction,transactions,locking,logging,access method,access methods,Java,C,C++"> </head> <body bgcolor=white> <table width="100%"><tr valign=top> <td><h3><dl><dt>Berkeley DB Reference Guide:<dd>Berkeley DB Replication</dl></h3></td> -<td align=right><a href="../../ref/rep/intro.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../../reftoc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../../ref/rep/pri.html"><img src="../../images/next.gif" alt="Next"></a> +<td align=right><a href="../rep/intro.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../toc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../rep/pri.html"><img src="../../images/next.gif" alt="Next"></a> </td></tr></table> <p> -<h1 align=center>Replication environment IDs</h1> +<h3 align=center>Replication environment IDs</h3> <p>Each database environment included in a replication group must have a unique identifier for itself and for the other members of the replication group. The identifiers do not need to be global, that is, @@ -23,14 +23,14 @@ the replication group as it encounters them. For example, given three sites: A, B and C, site A might assign the identifiers 1 and 2 to sites B and C respectively, while site B might assign the identifiers 301 and 302 to sites A and C respectively. Note that it is not wrong to have -global identifiers, it is just not a requirement. +global identifiers, it is just not a requirement.</p> <p>It is the responsibility of the application to label each incoming replication message passed to <a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> method with the appropriate identifier. Subsequently, Berkeley DB will label outgoing messages to the -<b>send</b> interface with those same identifiers. +<b>send</b> function with those same identifiers.</p> <p>Negative identifiers are reserved for use by Berkeley DB, and should never be assigned to environments by the application. Two of these reserved -identifiers are intended for application use, as follows: +identifiers are intended for application use, as follows:</p> <p><dl compact> <p><dt><a href="../../api_c/rep_transport.html#DB_EID_BROADCAST">DB_EID_BROADCAST</a><dd>The <a href="../../api_c/rep_transport.html#DB_EID_BROADCAST">DB_EID_BROADCAST</a> identifier indicates a message should be broadcast to all members of a replication group. @@ -38,8 +38,8 @@ broadcast to all members of a replication group. may be used to initialize environment ID variables that are subsequently checked for validity. </dl> -<table width="100%"><tr><td><br></td><td align=right><a href="../../ref/rep/intro.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../../reftoc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../../ref/rep/pri.html"><img src="../../images/next.gif" alt="Next"></a> +<table width="100%"><tr><td><br></td><td align=right><a href="../rep/intro.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../toc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../rep/pri.html"><img src="../../images/next.gif" alt="Next"></a> </td></tr></table> -<p><font size=1><a href="http://www.sleepycat.com">Copyright Sleepycat Software</a></font> +<p><font size=1><a href="../../sleepycat/legal.html">Copyright (c) 1996-2003</a> <a href="http://www.sleepycat.com">Sleepycat Software, Inc.</a> - All rights reserved.</font> </body> </html> diff --git a/db/docs/ref/rep/init.html b/db/docs/ref/rep/init.html index 0155e8fbc..b093a2b84 100644 --- a/db/docs/ref/rep/init.html +++ b/db/docs/ref/rep/init.html @@ -1,23 +1,23 @@ -<!--Id: init.so,v 1.2 2001/11/05 17:24:27 bostic Exp --> -<!--Copyright 1997-2002 by Sleepycat Software, Inc.--> +<!--$Id: init.so,v 1.2 2001/11/05 17:24:27 bostic Exp $--> +<!--Copyright 1997-2003 by Sleepycat Software, Inc.--> <!--All rights reserved.--> <!--See the file LICENSE for redistribution information.--> <html> <head> <title>Berkeley DB Reference Guide: Initializing a new site</title> <meta name="description" content="Berkeley DB: An embedded database programmatic toolkit."> -<meta name="keywords" content="embedded,database,programmatic,toolkit,b+tree,btree,hash,hashing,transaction,transactions,locking,logging,access method,access methods,java,C,C++"> +<meta name="keywords" content="embedded,database,programmatic,toolkit,b+tree,btree,hash,hashing,transaction,transactions,locking,logging,access method,access methods,Java,C,C++"> </head> <body bgcolor=white> <table width="100%"><tr valign=top> <td><h3><dl><dt>Berkeley DB Reference Guide:<dd>Berkeley DB Replication</dl></h3></td> -<td align=right><a href="../../ref/rep/newsite.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../../reftoc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../../ref/rep/elect.html"><img src="../../images/next.gif" alt="Next"></a> +<td align=right><a href="../rep/newsite.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../toc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../rep/elect.html"><img src="../../images/next.gif" alt="Next"></a> </td></tr></table> <p> -<h1 align=center>Initializing a new site</h1> +<h3 align=center>Initializing a new site</h3> <p>Perform the following steps to add a new site to the replication -group: -<p><ol> +group:</p> +<ol> <p><li>Do an archival backup of the master's environment, as described in <a href="../../ref/transapp/archival.html">Database and log file archival</a>. The backup can either be a conventional backup or a hot @@ -34,14 +34,14 @@ frequency with which log files are reclaimed using the <a href="../../utility/db_archive.html">db_archive</a> utility or the <a href="../../api_c/log_archive.html">DB_ENV->log_archive</a> method, it may be necessary to suppress log reclamation until the newly restarted client has "caught up" and applied all log records generated during its -downtime. +downtime.</p> <p>As with any Berkeley DB application, the database environment must be in a consistent state at application startup. This is most easily assured by running recovery at startup time in one thread or process; it is harmless to do this on both clients and masters even when not strictly -necessary. -<table width="100%"><tr><td><br></td><td align=right><a href="../../ref/rep/newsite.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../../reftoc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../../ref/rep/elect.html"><img src="../../images/next.gif" alt="Next"></a> +necessary.</p> +<table width="100%"><tr><td><br></td><td align=right><a href="../rep/newsite.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../toc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../rep/elect.html"><img src="../../images/next.gif" alt="Next"></a> </td></tr></table> -<p><font size=1><a href="http://www.sleepycat.com">Copyright Sleepycat Software</a></font> +<p><font size=1><a href="../../sleepycat/legal.html">Copyright (c) 1996-2003</a> <a href="http://www.sleepycat.com">Sleepycat Software, Inc.</a> - All rights reserved.</font> </body> </html> diff --git a/db/docs/ref/rep/intro.html b/db/docs/ref/rep/intro.html index 538c68e2f..65f25d410 100644 --- a/db/docs/ref/rep/intro.html +++ b/db/docs/ref/rep/intro.html @@ -1,21 +1,21 @@ -<!--Id: intro.so,v 1.6 2002/08/30 20:02:24 bostic Exp --> -<!--Copyright 1997-2002 by Sleepycat Software, Inc.--> +<!--$Id: intro.so,v 1.8 2003/04/16 20:28:05 margo Exp $--> +<!--Copyright 1997-2003 by Sleepycat Software, Inc.--> <!--All rights reserved.--> <!--See the file LICENSE for redistribution information.--> <html> <head> <title>Berkeley DB Reference Guide: Introduction</title> <meta name="description" content="Berkeley DB: An embedded database programmatic toolkit."> -<meta name="keywords" content="embedded,database,programmatic,toolkit,b+tree,btree,hash,hashing,transaction,transactions,locking,logging,access method,access methods,java,C,C++"> +<meta name="keywords" content="embedded,database,programmatic,toolkit,b+tree,btree,hash,hashing,transaction,transactions,locking,logging,access method,access methods,Java,C,C++"> </head> <body bgcolor=white> <a name="2"><!--meow--></a> <table width="100%"><tr valign=top> <td><h3><dl><dt>Berkeley DB Reference Guide:<dd>Berkeley DB Replication</dl></h3></td> -<td align=right><a href="../../ref/transapp/faq.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../../reftoc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../../ref/rep/id.html"><img src="../../images/next.gif" alt="Next"></a> +<td align=right><a href="../transapp/faq.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../toc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../rep/id.html"><img src="../../images/next.gif" alt="Next"></a> </td></tr></table> <p> -<h1 align=center>Introduction</h1> +<h3 align=center>Introduction</h3> <p>Berkeley DB includes support for building highly available applications based on replication. Berkeley DB replication groups consist of some number of independently configured database environments. There is a single @@ -25,22 +25,25 @@ and writes; client environments support only database reads. If the master environment fails, applications may upgrade a client to be the new master. The database environments might be on separate computers, on separate hardware partitions in a non-uniform memory access (NUMA) -system, or on separate disks in a single server. As always with Berkeley DB +system, or on separate disks in a single server. The only constraint +is that all the participants in a replication group all be on machines +of the same endianness. (We expect this constraint to be removed +in a future release.) As always with Berkeley DB environments, any number of concurrent processes or threads may access a database environment. In the case of a master environment, any number of threads of control may read and write the environment, and in the case of a client environment, any number of threads of control may read -the environment. +the environment.</p> <p>Applications may be written to provide various degrees of consistency between the master and clients. The system can be run synchronously such that replicas are guaranteed to be up-to-date with all committed transactions, but doing so may incur a significant performance penalty. Higher performance solutions sacrifice total consistency, allowing the -clients to be out of date for an application-controlled amount of time. +clients to be out of date for an application-controlled amount of time.</p> <p>While Berkeley DB includes the database infrastructure necessary to construct highly available database environments, applications must still provide -some critical components: -<p><ol> +some critical components:</p> +<ol> <p><li>The application is responsible for providing the communication infrastructure. Applications may use whatever wire protocol is appropriate for their application (for example, RPC, TCP/IP, UDP, VI or @@ -56,8 +59,14 @@ For example, the application may choose to encrypt data, use a secure socket layer, or do nothing at all. The level of security is left to the sole discretion of the application. </ol> -<!--Id: m4.methods,v 1.1 2002/08/30 20:02:36 bostic Exp --> -<p><table border=1 align=center> +<p>Finally, the Berkeley DB replication implementation has one other additional +feature to increase application reliability. Replication in Berkeley DB is +implemented to perform database updates using a different code path than +the standard ones. This means operations that manage to crash the +replication master due to a software bug will not necessarily also crash +replication clients.</p> +<!--$Id: m4.methods,v 1.1 2002/08/30 20:02:36 bostic Exp $--> +<table border=1 align=center> <tr><th>Replication and Related Methods</th><th>Description</th></tr> <tr><td><a href="../../api_c/rep_transport.html">DB_ENV->set_rep_transport</a></td><td>Configure replication transport</td></tr> <tr><td><a href="../../api_c/rep_elect.html">DB_ENV->rep_elect</a></td><td>Hold a replication election</td></tr> @@ -66,8 +75,8 @@ the sole discretion of the application. <tr><td><a href="../../api_c/rep_start.html">DB_ENV->rep_start</a></td><td>Configure an environment for replication</td></tr> <tr><td><a href="../../api_c/rep_stat.html">DB_ENV->rep_stat</a></td><td>Replication statistics</td></tr> </table> -<table width="100%"><tr><td><br></td><td align=right><a href="../../ref/transapp/faq.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../../reftoc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../../ref/rep/id.html"><img src="../../images/next.gif" alt="Next"></a> +<table width="100%"><tr><td><br></td><td align=right><a href="../transapp/faq.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../toc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../rep/id.html"><img src="../../images/next.gif" alt="Next"></a> </td></tr></table> -<p><font size=1><a href="http://www.sleepycat.com">Copyright Sleepycat Software</a></font> +<p><font size=1><a href="../../sleepycat/legal.html">Copyright (c) 1996-2003</a> <a href="http://www.sleepycat.com">Sleepycat Software, Inc.</a> - All rights reserved.</font> </body> </html> diff --git a/db/docs/ref/rep/logonly.html b/db/docs/ref/rep/logonly.html index 2aca1b8cf..297420b3f 100644 --- a/db/docs/ref/rep/logonly.html +++ b/db/docs/ref/rep/logonly.html @@ -1,29 +1,29 @@ -<!--Id: logonly.so,v 1.6 2002/08/08 15:46:02 bostic Exp --> -<!--Copyright 1997-2002 by Sleepycat Software, Inc.--> +<!--$Id: logonly.html,v 1.5 2003/12/15 21:43:57 jbj Exp $--> +<!--Copyright 1997-2003 by Sleepycat Software, Inc.--> <!--All rights reserved.--> <!--See the file LICENSE for redistribution information.--> <html> <head> <title>Berkeley DB Reference Guide: Log file only clients</title> <meta name="description" content="Berkeley DB: An embedded database programmatic toolkit."> -<meta name="keywords" content="embedded,database,programmatic,toolkit,b+tree,btree,hash,hashing,transaction,transactions,locking,logging,access method,access methods,java,C,C++"> +<meta name="keywords" content="embedded,database,programmatic,toolkit,b+tree,btree,hash,hashing,transaction,transactions,locking,logging,access method,access methods,Java,C,C++"> </head> <body bgcolor=white> <table width="100%"><tr valign=top> <td><h3><dl><dt>Berkeley DB Reference Guide:<dd>Berkeley DB Replication</dl></h3></td> -<td align=right><a href="../../ref/rep/elect.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../../reftoc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../../ref/rep/trans.html"><img src="../../images/next.gif" alt="Next"></a> +<td align=right><a href="../rep/elect.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../toc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../rep/trans.html"><img src="../../images/next.gif" alt="Next"></a> </td></tr></table> <p> -<h1 align=center>Log file only clients</h1> +<h3 align=center>Log file only clients</h3> <p>Applications wanting to use replication to support recovery after catastrophic failure of the master may want to configure a site as a logs-file-only replica. Such clients cannot respond to read (or write) -queries but they still receive a complete copy the log files, so that in the -event of master failure, a copy of the logs is available. -<p>Log file only clients are configured like other client sites, except +queries but they still receive a complete copy of the log files, so that in the +event of master failure, a copy of the logs is available.</p> +<p>Log-file-only clients are configured like other client sites, except they should specify the <a href="../../api_c/rep_start.html#DB_REP_LOGSONLY">DB_REP_LOGSONLY</a> flag to the <a href="../../api_c/rep_start.html">DB_ENV->rep_start</a> method and should specify a priority of 0 to the -<a href="../../api_c/rep_elect.html">DB_ENV->rep_elect</a> method. +<a href="../../api_c/rep_elect.html">DB_ENV->rep_elect</a> method.</p> <p>There are two ways to recover using a log-file-only replica. The simplest way is to copy the log files from the log-file-only replica onto another site (either master or replica) and run catastrophic @@ -41,18 +41,18 @@ on the log-file-only replica), once the site returns to being a log-file-only replica, the database files on the log-file-only replica should be removed, and if the log files do not begin with log file number 1, a new set of archival databases should be created from -the current master. +the current master.</p> <p>More specifically, the log files accumulating on the log-file-only replica can take the place of the log files described in <i>catastrophic recovery</i> section of the <a href="../../ref/transapp/recovery.html">Recovery procedures</a> Berkeley DB -Reference Guide. +Reference Guide.</p> <p>In all other ways, a log-file-only site behaves as other replication -clients do. It should have a thread or process receiving messages and -passing them to <a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> and must respond to all returns -described for that interface. -<table width="100%"><tr><td><br></td><td align=right><a href="../../ref/rep/elect.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../../reftoc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../../ref/rep/trans.html"><img src="../../images/next.gif" alt="Next"></a> +clients do. It should have at least one thread or process receiving +messages and passing them to <a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> and must respond to all +returns described for that method.</p> +<table width="100%"><tr><td><br></td><td align=right><a href="../rep/elect.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../toc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../rep/trans.html"><img src="../../images/next.gif" alt="Next"></a> </td></tr></table> -<p><font size=1><a href="http://www.sleepycat.com">Copyright Sleepycat Software</a></font> +<p><font size=1><a href="../../sleepycat/legal.html">Copyright (c) 1996-2003</a> <a href="http://www.sleepycat.com">Sleepycat Software, Inc.</a> - All rights reserved.</font> </body> </html> diff --git a/db/docs/ref/rep/newsite.html b/db/docs/ref/rep/newsite.html index 850d8967d..3a2dc0d06 100644 --- a/db/docs/ref/rep/newsite.html +++ b/db/docs/ref/rep/newsite.html @@ -1,20 +1,20 @@ -<!--Id: newsite.so,v 1.2 2001/10/25 14:58:49 bostic Exp --> -<!--Copyright 1997-2002 by Sleepycat Software, Inc.--> +<!--$Id: newsite.so,v 1.2 2001/10/25 14:58:49 bostic Exp $--> +<!--Copyright 1997-2003 by Sleepycat Software, Inc.--> <!--All rights reserved.--> <!--See the file LICENSE for redistribution information.--> <html> <head> <title>Berkeley DB Reference Guide: Connecting to a new site</title> <meta name="description" content="Berkeley DB: An embedded database programmatic toolkit."> -<meta name="keywords" content="embedded,database,programmatic,toolkit,b+tree,btree,hash,hashing,transaction,transactions,locking,logging,access method,access methods,java,C,C++"> +<meta name="keywords" content="embedded,database,programmatic,toolkit,b+tree,btree,hash,hashing,transaction,transactions,locking,logging,access method,access methods,Java,C,C++"> </head> <body bgcolor=white> <table width="100%"><tr valign=top> <td><h3><dl><dt>Berkeley DB Reference Guide:<dd>Berkeley DB Replication</dl></h3></td> -<td align=right><a href="../../ref/rep/comm.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../../reftoc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../../ref/rep/init.html"><img src="../../images/next.gif" alt="Next"></a> +<td align=right><a href="../rep/comm.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../toc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../rep/init.html"><img src="../../images/next.gif" alt="Next"></a> </td></tr></table> <p> -<h1 align=center>Connecting to a new site</h1> +<h3 align=center>Connecting to a new site</h3> <p>Connecting to a new site in the replication group happens whenever the <a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> method returns <a href="../../api_c/rep_message.html#DB_REP_NEWSITE">DB_REP_NEWSITE</a>. The application should assign the new site a local environment ID number, and all future @@ -23,7 +23,7 @@ environment ID number. It is possible, of course, for the application to be aware of a new site before the return of <a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> (for example, applications using connection-oriented protocols are likely to detect new sites immediately, while applications using broadcast -protocols may not). +protocols may not).</p> <p>Regardless, in applications supporting the dynamic addition of database environments to replication groups, environments joining an existing replication group may need to provide contact information. (For @@ -36,9 +36,9 @@ the group using the <b>rec</b> parameter returned by <a href="../../api_c/rep_me If no additional information was provided for Berkeley DB to forward to the existing members of the group, the <b>data</b> field of the <b>rec</b> parameter passed to the <a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> method will be NULL after -<a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> returns <a href="../../api_c/rep_message.html#DB_REP_NEWSITE">DB_REP_NEWSITE</a>. -<table width="100%"><tr><td><br></td><td align=right><a href="../../ref/rep/comm.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../../reftoc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../../ref/rep/init.html"><img src="../../images/next.gif" alt="Next"></a> +<a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> returns <a href="../../api_c/rep_message.html#DB_REP_NEWSITE">DB_REP_NEWSITE</a>.</p> +<table width="100%"><tr><td><br></td><td align=right><a href="../rep/comm.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../toc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../rep/init.html"><img src="../../images/next.gif" alt="Next"></a> </td></tr></table> -<p><font size=1><a href="http://www.sleepycat.com">Copyright Sleepycat Software</a></font> +<p><font size=1><a href="../../sleepycat/legal.html">Copyright (c) 1996-2003</a> <a href="http://www.sleepycat.com">Sleepycat Software, Inc.</a> - All rights reserved.</font> </body> </html> diff --git a/db/docs/ref/rep/partition.html b/db/docs/ref/rep/partition.html index 4e12fe7c9..326f17845 100644 --- a/db/docs/ref/rep/partition.html +++ b/db/docs/ref/rep/partition.html @@ -1,22 +1,22 @@ -<!--Id: partition.so,v 1.1 2001/10/25 20:05:34 bostic Exp --> -<!--Copyright 1997-2002 by Sleepycat Software, Inc.--> +<!--$Id: partition.so,v 1.1 2001/10/25 20:05:34 bostic Exp $--> +<!--Copyright 1997-2003 by Sleepycat Software, Inc.--> <!--All rights reserved.--> <!--See the file LICENSE for redistribution information.--> <html> <head> <title>Berkeley DB Reference Guide: Network partitions</title> <meta name="description" content="Berkeley DB: An embedded database programmatic toolkit."> -<meta name="keywords" content="embedded,database,programmatic,toolkit,b+tree,btree,hash,hashing,transaction,transactions,locking,logging,access method,access methods,java,C,C++"> +<meta name="keywords" content="embedded,database,programmatic,toolkit,b+tree,btree,hash,hashing,transaction,transactions,locking,logging,access method,access methods,Java,C,C++"> </head> <body bgcolor=white> <table width="100%"><tr valign=top> <td><h3><dl><dt>Berkeley DB Reference Guide:<dd>Berkeley DB Replication</dl></h3></td> -<td align=right><a href="../../ref/rep/trans.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../../reftoc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../../ref/rep/faq.html"><img src="../../images/next.gif" alt="Next"></a> +<td align=right><a href="../rep/trans.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../toc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../rep/faq.html"><img src="../../images/next.gif" alt="Next"></a> </td></tr></table> <p> -<h1 align=center>Network partitions</h1> +<h3 align=center>Network partitions</h3> <p>The Berkeley DB replication implementation can be affected by network -partitioning problems. +partitioning problems.</p> <p>For example, consider a replication group with N members. The network partitions with the master on one side and more than N/2 of the sites on the other side. The sites on the side with the master will continue @@ -26,7 +26,7 @@ realizing they no longer have a master, will hold an election. The election will succeed as there are more than N/2 of the total sites participating, and there will then be two masters for the replication group. Since both masters are potentially accepting write queries, the -databases could diverge in incompatible ways. +databases could diverge in incompatible ways.</p> <p>If multiple masters are ever found to exist in a replication group, a master detecting the problem will return <a href="../../api_c/rep_message.html#DB_REP_DUPMASTER">DB_REP_DUPMASTER</a>. If the application sees this return, it should reconfigure itself as a @@ -34,7 +34,7 @@ client (by calling <a href="../../api_c/rep_start.html">DB_ENV->rep_start</a> (by calling <a href="../../api_c/rep_elect.html">DB_ENV->rep_elect</a>). The site that wins the election may be one of the two previous masters, or it may be another site entirely. Regardless, the winning system will bring all of the other systems into -conformance. +conformance.</p> <p>As another example, consider a replication group with a master environment and two clients A and B, where client A may upgrade to master status and client B cannot. Then, assume client A is partitioned @@ -44,19 +44,19 @@ not come back on-line. Subsequently, the network partition is restored, and clients A and B hold an election. As client B cannot win the election, client A will win by default, and in order to get back into sync with client B, possibly committed transactions on client B will be -unrolled until the two sites can once again move forward together. +unrolled until the two sites can once again move forward together.</p> <p>In both of these examples, there is a phase where a newly elected master brings the members of a replication group into conformance with itself so that it can start sending new information to them. This can result in the loss of information as previously committed transactions are -unrolled. +unrolled.</p> <p>In architectures where network partitions are an issue, applications may want to implement a heart-beat protocol to minimize the consequences of a bad network partition. As long as a master is able to contact at least half of the sites in the replication group, it is impossible for there to be two masters. If the master can no longer contact a sufficient number of systems, it should reconfigure itself as a client, -and hold an election. +and hold an election.</p> <p>There is another tool applications can use to minimize the damage in the case of a network partition. By specifying a <b>nsites</b> argument to <a href="../../api_c/rep_elect.html">DB_ENV->rep_elect</a> that is larger than the actual number of @@ -66,7 +66,7 @@ a large percentage of the sites in the system. For example, if there are 20 database environments in the replication group, and an argument of 30 is specified to the <a href="../../api_c/rep_elect.html">DB_ENV->rep_elect</a> method, then a system will have to be able to talk to at least 16 of the sites to declare itself the -master. +master.</p> <p>Specifying a <b>nsites</b> argument to <a href="../../api_c/rep_elect.html">DB_ENV->rep_elect</a> that is smaller than the actual number of database environments in the replication group has its uses as well. For example, consider a @@ -77,14 +77,14 @@ the master. A reasonable alternative would be to specify a argument of 1 to the other. That way, one of the systems could win elections even when partitioned, while the other one could not. This would allow at one of the systems to continue accepting write queries -after the partition. +after the partition.</p> <p>These scenarios stress the importance of good network infrastructure in Berkeley DB replicated environments. When replicating database environments over sufficiently lossy networking, the best solution may well be to pick a single master, and only hold elections when human intervention -has determined the selected master is unable to recover at all. -<table width="100%"><tr><td><br></td><td align=right><a href="../../ref/rep/trans.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../../reftoc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../../ref/rep/faq.html"><img src="../../images/next.gif" alt="Next"></a> +has determined the selected master is unable to recover at all.</p> +<table width="100%"><tr><td><br></td><td align=right><a href="../rep/trans.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../toc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../rep/faq.html"><img src="../../images/next.gif" alt="Next"></a> </td></tr></table> -<p><font size=1><a href="http://www.sleepycat.com">Copyright Sleepycat Software</a></font> +<p><font size=1><a href="../../sleepycat/legal.html">Copyright (c) 1996-2003</a> <a href="http://www.sleepycat.com">Sleepycat Software, Inc.</a> - All rights reserved.</font> </body> </html> diff --git a/db/docs/ref/rep/pri.html b/db/docs/ref/rep/pri.html index e3d7a8f9d..0428611c4 100644 --- a/db/docs/ref/rep/pri.html +++ b/db/docs/ref/rep/pri.html @@ -1,25 +1,25 @@ -<!--Id: pri.so,v 1.3 2001/11/05 17:24:27 bostic Exp --> -<!--Copyright 1997-2002 by Sleepycat Software, Inc.--> +<!--$Id: pri.so,v 1.6 2003/04/04 18:27:19 sue Exp $--> +<!--Copyright 1997-2003 by Sleepycat Software, Inc.--> <!--All rights reserved.--> <!--See the file LICENSE for redistribution information.--> <html> <head> <title>Berkeley DB Reference Guide: Replication environment priorities</title> <meta name="description" content="Berkeley DB: An embedded database programmatic toolkit."> -<meta name="keywords" content="embedded,database,programmatic,toolkit,b+tree,btree,hash,hashing,transaction,transactions,locking,logging,access method,access methods,java,C,C++"> +<meta name="keywords" content="embedded,database,programmatic,toolkit,b+tree,btree,hash,hashing,transaction,transactions,locking,logging,access method,access methods,Java,C,C++"> </head> <body bgcolor=white> <table width="100%"><tr valign=top> <td><h3><dl><dt>Berkeley DB Reference Guide:<dd>Berkeley DB Replication</dl></h3></td> -<td align=right><a href="../../ref/rep/id.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../../reftoc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../../ref/rep/app.html"><img src="../../images/next.gif" alt="Next"></a> +<td align=right><a href="../rep/id.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../toc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../rep/app.html"><img src="../../images/next.gif" alt="Next"></a> </td></tr></table> <p> -<h1 align=center>Replication environment priorities</h1> -<p>Each database environment included in a replication group must have a -priority, which specifies a relative ordering among the different -environments in a replication group. This ordering determines which -environment will be selected as a new master in case the existing master -fails. +<h3 align=center>Replication environment priorities</h3> +<p>Each database environment included in a replication group must have +a priority, which specifies a relative ordering among the different +environments in a replication group. This ordering is a factor in +determining which environment will be selected as a new master in +case the existing master fails.</p> <p>Priorities must be a non-negative integer, but do not need to be unique throughout the replication group. A priority of 0 means the system can never become a master, regardless. Otherwise, larger valued priorities @@ -27,9 +27,13 @@ indicate a more desirable master. For example, if a replication group consists of three database environments, two of which are connected by an OC3 and the third of which is connected by a T1, the third database environment should be assigned a priority value which is lower than -either of the other two. -<table width="100%"><tr><td><br></td><td align=right><a href="../../ref/rep/id.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../../reftoc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../../ref/rep/app.html"><img src="../../images/next.gif" alt="Next"></a> +either of the other two.</p> +<p>Desirability of the master is first determined by the client having +the most recent log records. Ties in log records are broken with +the client priority. If both sites have the same number of log +records and the same priority, one is selected at random.</p> +<table width="100%"><tr><td><br></td><td align=right><a href="../rep/id.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../toc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../rep/app.html"><img src="../../images/next.gif" alt="Next"></a> </td></tr></table> -<p><font size=1><a href="http://www.sleepycat.com">Copyright Sleepycat Software</a></font> +<p><font size=1><a href="../../sleepycat/legal.html">Copyright (c) 1996-2003</a> <a href="http://www.sleepycat.com">Sleepycat Software, Inc.</a> - All rights reserved.</font> </body> </html> diff --git a/db/docs/ref/rep/trans.html b/db/docs/ref/rep/trans.html index 4f9021ad9..f72f3648c 100644 --- a/db/docs/ref/rep/trans.html +++ b/db/docs/ref/rep/trans.html @@ -1,20 +1,20 @@ -<!--Id: trans.so,v 1.6 2002/05/11 18:00:23 bostic Exp --> -<!--Copyright 1997-2002 by Sleepycat Software, Inc.--> +<!--$Id: trans.so,v 1.14 2003/10/18 19:16:06 bostic Exp $--> +<!--Copyright 1997-2003 by Sleepycat Software, Inc.--> <!--All rights reserved.--> <!--See the file LICENSE for redistribution information.--> <html> <head> <title>Berkeley DB Reference Guide: Transactional guarantees</title> <meta name="description" content="Berkeley DB: An embedded database programmatic toolkit."> -<meta name="keywords" content="embedded,database,programmatic,toolkit,b+tree,btree,hash,hashing,transaction,transactions,locking,logging,access method,access methods,java,C,C++"> +<meta name="keywords" content="embedded,database,programmatic,toolkit,b+tree,btree,hash,hashing,transaction,transactions,locking,logging,access method,access methods,Java,C,C++"> </head> <body bgcolor=white> <table width="100%"><tr valign=top> <td><h3><dl><dt>Berkeley DB Reference Guide:<dd>Berkeley DB Replication</dl></h3></td> -<td align=right><a href="../../ref/rep/logonly.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../../reftoc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../../ref/rep/partition.html"><img src="../../images/next.gif" alt="Next"></a> +<td align=right><a href="../rep/logonly.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../toc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../rep/partition.html"><img src="../../images/next.gif" alt="Next"></a> </td></tr></table> <p> -<h1 align=center>Transactional guarantees</h1> +<h3 align=center>Transactional guarantees</h3> <p>It is important to consider replication in the context of the overall database environment's transactional guarantees. To briefly review, transactional guarantees in a non-replicated application are based on @@ -23,82 +23,218 @@ drive. If the application or system then fails, the Berkeley DB logging information is reviewed during recovery, and the databases are updated so that all changes made as part of committed transactions appear, and all changes made as part of uncommitted transactions do not appear. In -this case, no information will have been lost. -<p>If a database environment does not require that the log be flushed to +this case, no information will have been lost.</p> +<p>If a database environment does not require the log be flushed to stable storage on transaction commit (using the <a href="../../api_c/env_set_flags.html#DB_TXN_NOSYNC">DB_TXN_NOSYNC</a> flag to increase performance at the cost of sacrificing transactional durability), Berkeley DB recovery will only be able to restore the system to the state of the last commit found on stable storage. In this case, information may have been lost (for example, the changes made by some -committed transactions may not appear in the databases after recovery). +committed transactions may not appear in the databases after recovery).</p> <p>Further, if there is database or log file loss or corruption (for example, if a disk drive fails), then catastrophic recovery is necessary, and Berkeley DB recovery will only be able to restore the system to the state of the last archived log file. In this case, information -may also have been lost. +may also have been lost.</p> <p>Replicating the database environment extends this model, by adding a new component to "stable storage": the client's replicated information. If a database environment is replicated, there is no lost information in the case of database or log file loss, because the replicated system can be configured to contain a complete set of databases and log records up to the point of failure. A database environment that loses a disk -drive can have the drive replaced, and it can rejoin the replication -group as a client. +drive can have the drive replaced, and it can then rejoin the +replication group.</p> <p>Because of this new component of stable storage, specifying <a href="../../api_c/env_set_flags.html#DB_TXN_NOSYNC">DB_TXN_NOSYNC</a> in a replicated environment no longer sacrifices durability, as long as one or more clients have acknowledged receipt of the messages sent by the master. Since network connections are often -faster than local disk writes, replication becomes a way for -applications to significantly improve their performance as well as their -reliability. -<p>The return status from the <b>send</b> interface specified to the -<a href="../../api_c/rep_transport.html">DB_ENV->set_rep_transport</a> method must be set by the application to ensure the -transactional guarantees the application wants to provide. The effect -of the <b>send</b> interface returning failure is to flush the local -database environment's log as necessary to ensure that any information -critical to database integrity is not lost. Because this flush is an -expensive operation in terms of database performance, applications will -want to avoid returning an error from the <b>send</b> interface, if at -all possible: -<p>First, there is no reason for the <b>send</b> interface to ever return -failure unless the <a href="../../api_c/rep_transport.html#DB_REP_PERMANENT">DB_REP_PERMANENT</a> flag is specified. Messages -without that flag do not make visible changes to databases, and -therefore the application's <b>send</b> interface can return success -to Berkeley DB for such messages as soon as the message has been sent or even -just copied to local memory. -<p>Further, unless the master's database environment has been configured -to not synchronously flush the log on transaction commit, there is no -reason for the <b>send</b> interface to ever return failure, as any -information critical to database integrity has already been flushed to -the local log before <b>send</b> was called. Again, the <b>send</b> -interface should return success to Berkeley DB as soon as possible. However, -in this case, in order to avoid potential loss of information after the -master database environment fails, the master database environment -should be recovered before holding an election, as only the master -database environment is guaranteed to have the most up-to-date logs. -<p>To sum up, the only reason for the <b>send</b> interface to return +faster than local synchronous disk writes, replication becomes a way +for applications to significantly improve their performance as well as +their reliability.</p> +<p>The return status from the application's <b>send</b> function must be +set by the application to ensure the transactional guarantees the +application wants to provide. Whenever the <b>send</b> function +returns failure, the local database environment's log is flushed as +necessary to ensure that any information critical to database integrity +is not lost. Because this flush is an expensive operation in terms of +database performance, applications should avoid returning an error from +the <b>send</b> function, if at all possible.</p> +<p>The only interesting message type for replication transactional +guarantees is when the application's <b>send</b> function was called +with the <a href="../../api_c/rep_transport.html#DB_REP_PERMANENT">DB_REP_PERMANENT</a> flag specified. There is no reason +for the <b>send</b> function to ever return failure unless the +<a href="../../api_c/rep_transport.html#DB_REP_PERMANENT">DB_REP_PERMANENT</a> flag was specified -- messages without the +<a href="../../api_c/rep_transport.html#DB_REP_PERMANENT">DB_REP_PERMANENT</a> flag do not make visible changes to databases, +and the <b>send</b> function can return success to Berkeley DB as soon as +the message has been sent to the client(s) or even just copied to local +application memory in preparation for being sent.</p> +<p>When a client receives a <a href="../../api_c/rep_transport.html#DB_REP_PERMANENT">DB_REP_PERMANENT</a> message, the client +will flush its log to stable storage before returning (unless the client +environment has been configured with the <a href="../../api_c/env_set_flags.html#DB_TXN_NOSYNC">DB_TXN_NOSYNC</a> option). +If the client is unable to flush a complete transactional record to disk +for any reason (for example, there is a missing log record before the +flagged message), the call to the <a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> method on the client +will return <a href="../../api_c/rep_message.html#DB_REP_NOTPERM">DB_REP_NOTPERM</a> and return the LSN of this record +to the application in the <b>ret_lsnp</b> parameter. +The application's client or master +message handling loops should take proper action to ensure the correct +transactional guarantees in this case. When missing records arrive +and allow subsequent processing of previously stored permanent +records, the call to the <a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> method on the client will +return <a href="../../api_c/rep_message.html#DB_REP_ISPERM">DB_REP_ISPERM</a> and return the largest LSN of the +permanent records that were flushed to disk. Client applications +can use these LSNs to know definitively if any particular LSN is +permanently stored or not.</p> +<p>An application relying on a client's ability to become a master and +guarantee that no data has been lost will need to write the <b>send</b> +function to return an error whenever it cannot guarantee the site that +will win the next election has the record. Applications not requiring +this level of transactional guarantees need not have the <b>send</b> +function return failure (unless the master's database environment has +been configured with <a href="../../api_c/env_set_flags.html#DB_TXN_NOSYNC">DB_TXN_NOSYNC</a>), as any information critical +to database integrity has already been flushed to the local log before +<b>send</b> was called.</p> +<p>To sum up, the only reason for the <b>send</b> function to return failure is when the master database environment has been configured to -not synchronously flush the log on transaction commit, the +not synchronously flush the log on transaction commit (that is, +<a href="../../api_c/env_set_flags.html#DB_TXN_NOSYNC">DB_TXN_NOSYNC</a> was configured on the master), the <a href="../../api_c/rep_transport.html#DB_REP_PERMANENT">DB_REP_PERMANENT</a> flag is specified for the message, and the -<b>send</b> interface was unable to determine that some number of +<b>send</b> function was unable to determine that some number of clients have received the current message (and all messages preceding -the current message). How many clients should receive the message -before the <b>send</b> interface can return success is an application +the current message). How many clients need to receive the message +before the <b>send</b> function can return success is an application choice (and may not depend as much on a specific number of clients -reporting success as one or more geographically distributed clients). +reporting success as one or more geographically distributed clients).</p> +<p>If, however, the application does require on-disk durability on the master, +the master should be configured to synchronously flush the log on commit. +If clients are not configured to synchronously flush the log, +that is, if a client is running with <a href="../../api_c/env_set_flags.html#DB_TXN_NOSYNC">DB_TXN_NOSYNC</a> configured, +then it is up to the application to reconfigure that client +appropriately when it becomes a master. That is, the +application must explicitly call <a href="../../api_c/env_set_flags.html">DB_ENV->set_flags</a> to +disable asynchronous log flushing as part of re-configuring +the client as the new master.</p> <p>Of course, it is important to ensure that the replicated master and client environments are truly independent of each other. For example, it does not help matters that a client has acknowledged receipt of a message if both master and clients are on the same power supply, as the -failure of the power supply will still potentially lose information. -<p>Finally, the Berkeley DB replication implementation has one other additional -feature to increase application reliability. Replication in Berkeley DB is -implemented to perform database updates using a different code path than -the standard ones. This means operations which manage to crash the -replication master due to a software bug will not necessarily also crash -replication clients. -<table width="100%"><tr><td><br></td><td align=right><a href="../../ref/rep/logonly.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../../reftoc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../../ref/rep/partition.html"><img src="../../images/next.gif" alt="Next"></a> +failure of the power supply will still potentially lose information.</p> +<p>Configuring your replication-based application to achieve the proper +mix of performance and transactional guarantees can be complex. In +brief, there are a few controls an application can set to configure the +guarantees it makes: specification of <a href="../../api_c/env_set_flags.html#DB_TXN_NOSYNC">DB_TXN_NOSYNC</a> for the +master environment, specification of <a href="../../api_c/env_set_flags.html#DB_TXN_NOSYNC">DB_TXN_NOSYNC</a> for the +client environment, the priorities of different sites participating in +an election, and the behavior of the application's <b>send</b> +function.</p> +<p>First, it is rarely useful to write and synchronously flush the log when +a transaction commits on a replication client. It may be useful where +systems share resources and multiple systems commonly fail at the same +time. By default, all Berkeley DB database environments, whether master or +client, synchronously flush the log on transaction commit or prepare. +Generally, replication masters and clients turn log flush off for +transaction commit using the <a href="../../api_c/env_set_flags.html#DB_TXN_NOSYNC">DB_TXN_NOSYNC</a> flag.</p> +<p>Consider two systems connected by a network interface. One acts as the +master, the other as a read-only client. The client takes over as +master if the master crashes and the master rejoins the replication +group after such a failure. Both master and client are configured to +not synchronously flush the log on transaction commit (that is, +<a href="../../api_c/env_set_flags.html#DB_TXN_NOSYNC">DB_TXN_NOSYNC</a> was configured on both systems). The +application's <b>send</b> function never returns failure to the Berkeley DB +library, simply forwarding messages to the client (perhaps over a +broadcast mechanism), and always returning success. On the client, any +<a href="../../api_c/rep_message.html#DB_REP_NOTPERM">DB_REP_NOTPERM</a> returns from the client's <a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> method +are ignored, as well. This system configuration has excellent +performance, but may lose data in some failure modes.</p> +<p>If both the master and the client crash at once, it is possible to lose +committed transactions, that is, transactional durability is not being +maintained. Reliability can be increased by providing separate power +supplies for the systems and placing them in separate physical locations.</p> +<p>If the connection between the two machines fails (or just some number +of messages are lost), and subsequently the master crashes, it is +possible to lose committed transactions. Again, because transactional +durability is not being maintained. Reliability can be improved in a +couple of ways:</p> +<ol> +<p><li>Use a reliable network protocol (for example, TCP/IP instead of UDP). +<p><li>Increase the number of clients and network paths to make it less likely +that a message will be lost. In this case, it is important to also make +sure a client that did receive the message wins any subsequent election. +If a client that did not receive the message wins a subsequent election, +data can still be lost. +</ol> +<p>Further, systems may want to guarantee message delivery to the client(s) +(for example, to prevent a network connection from simply discarding +messages). Some systems may want to ensure clients never return +out-of-date information, that is, once a transaction commit returns +success on the master, no client will return old information to a +read-only query. Some of the following changes may be used to address +these issues:</p> +<ol> +<p><li>Write the application's <b>send</b> function to not return to Berkeley DB +until one or more clients have acknowledged receipt of the message. +The number of clients chosen will be dependent on the application: you +will want to consider likely network partitions (ensure that a client +at each physical site receives the message) and geographical diversity +(ensure that a client on each coast receives the message). +<p><li>Write the client's message processing loop to not acknowledge receipt +of the message until a call to the <a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> method has returned +success. Messages resulting in a return of <a href="../../api_c/rep_message.html#DB_REP_NOTPERM">DB_REP_NOTPERM</a> from +the <a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> method mean the message could not be flushed to the +client's disk. If the client does not acknowledge receipt of such +messages to the master until a subsequent call to the +<a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> method returns <a href="../../api_c/rep_message.html#DB_REP_ISPERM">DB_REP_ISPERM</a> and the LSN +returned is at least as large as this message's LSN, then the master's +<b>send</b> function will not return success to the Berkeley DB library. +This means the thread committing the transaction on the master will not +be allowed to proceed based on the transaction having committed until +the selected set of clients have received the message and consider it +complete. +<p>Alternatively, the client's message processing loop could acknowledge +the message to the master, but with an error code indicating that the +application's <b>send</b> function should not return to the Berkeley DB +library until a subsequent acknowledgement from the same client +indicates success.</p> +<p>The application send callback function invoked by Berkeley DB contains +an LSN of the record being sent (if appropriate for that record). +When <a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> method returns indicators that a permanent +record has been written then it also returns the maximum LSN of the +permanent record written.</p> +</ol> +<p>There is one final pair of failure scenarios to consider. First, it is +not possible to abort transactions after the application's <b>send</b> +function has been called, as the master may have already written the +commit log records to disk, and so abort is no longer an option. +Second, a related problem is that even though the master will attempt +to flush the local log if the <b>send</b> function returns failure, +that flush may fail (for example, when the local disk is full). Again, +the transaction cannot be aborted as one or more clients may have +committed the transaction even if <b>send</b> returns failure. Rare +applications may not be able to tolerate these unlikely failure modes. +In that case the application may want to:</p> +<ol> +<p><li>Configure the master to do always local synchronous commits (turning +off the <a href="../../api_c/env_set_flags.html#DB_TXN_NOSYNC">DB_TXN_NOSYNC</a> configuration). This will decrease +performance significantly, of course (one of the reasons to use +replication is to avoid local disk writes.) In this configuration, +failure to write the local log will cause the transaction to abort in +all cases. +<p><li>Do not return from the application's <b>send</b> function under any +conditions, until the selected set of clients has acknowledged the +message. Until the <b>send</b> function returns to the Berkeley DB library, +the thread committing the transaction on the master will wait, and so +no application will be able to act on the knowledge that the transaction +has committed. +</ol> +<p>The final alternative for applications concerned about these types of +failure is to use distributed transactions as an alternative means of +replication, guaranteeing full consistency at the cost of implementing +a Global Transaction Manager and performing two-phase commit across +multiple Berkeley DB database environments. More information on this topic +can be found in the <a href="../../ref/xa/intro.html">Distributed +Transactions</a> chapter.</p> +<table width="100%"><tr><td><br></td><td align=right><a href="../rep/logonly.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../toc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../rep/partition.html"><img src="../../images/next.gif" alt="Next"></a> </td></tr></table> -<p><font size=1><a href="http://www.sleepycat.com">Copyright Sleepycat Software</a></font> +<p><font size=1><a href="../../sleepycat/legal.html">Copyright (c) 1996-2003</a> <a href="http://www.sleepycat.com">Sleepycat Software, Inc.</a> - All rights reserved.</font> </body> </html> |