From b6f8dd49dbdbfa60a33bba3d4b766fe341109b4b Mon Sep 17 00:00:00 2001
From: Dave Chinner <dchinner@redhat.com>
Date: Tue, 13 Apr 2010 15:06:44 +1000
Subject: xfs: ensure that sync updates the log tail correctly
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Updates to the VFS layer removed an extra ->sync_fs call into the
filesystem during the sync process (from the quota code).
Unfortunately the sync code was unknowingly relying on this call to
make sure metadata buffers were flushed via a xfs_buftarg_flush()
call to move the tail of the log forward in memory before the final
transactions of the sync process were issued.

As a result, the old code would write a very recent log tail value
to the log by the end of the sync process, and so a subsequent crash
would leave nothing for log recovery to do. Hence in qa test 182,
log recovery only replayed a small handle for inode fsync
transactions in this case.

However, with the removal of the extra ->sync_fs call, the log tail
was now not moved forward with the inode fsync transactions near the
end of the sync procese the first (and only) buftarg flush occurred
after these transactions went to disk. The result is that log
recovery now sees a large number of transactions for metadata that
is already on disk.

This usually isn't a problem, but when the transactions include
inode chunk allocation, the inode create transactions and all
subsequent changes are replayed as we cannt rely on what is on disk
is valid. As a result, if the inode was written and contains
unlogged changes, the unlogged changes are lost, thereby violating
sync semantics.

The fix is to always issue a transaction after the buftarg flush
occurs is the log iѕ not idle or covered. This results in a dummy
transaction being written that contains the up-to-date log tail
value, which will be very recent. Indeed, it will be at least as
recent as the old code would have left on disk, so log recovery
will behave exactly as it used to in this situation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
---
 fs/xfs/xfs_log.c | 38 ++++++++++++++++++++++++++------------
 1 file changed, 26 insertions(+), 12 deletions(-)

(limited to 'fs')

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index e8fba92d7cd..2be01913628 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -745,9 +745,16 @@ xfs_log_move_tail(xfs_mount_t	*mp,
 
 /*
  * Determine if we have a transaction that has gone to disk
- * that needs to be covered. Log activity needs to be idle (no AIL and
- * nothing in the iclogs). And, we need to be in the right state indicating
- * something has gone out.
+ * that needs to be covered. To begin the transition to the idle state
+ * firstly the log needs to be idle (no AIL and nothing in the iclogs).
+ * If we are then in a state where covering is needed, the caller is informed
+ * that dummy transactions are required to move the log into the idle state.
+ *
+ * Because this is called as part of the sync process, we should also indicate
+ * that dummy transactions should be issued in anything but the covered or
+ * idle states. This ensures that the log tail is accurately reflected in
+ * the log at the end of the sync, hence if a crash occurrs avoids replay
+ * of transactions where the metadata is already on disk.
  */
 int
 xfs_log_need_covered(xfs_mount_t *mp)
@@ -759,17 +766,24 @@ xfs_log_need_covered(xfs_mount_t *mp)
 		return 0;
 
 	spin_lock(&log->l_icloglock);
-	if (((log->l_covered_state == XLOG_STATE_COVER_NEED) ||
-		(log->l_covered_state == XLOG_STATE_COVER_NEED2))
-			&& !xfs_trans_ail_tail(log->l_ailp)
-			&& xlog_iclogs_empty(log)) {
-		if (log->l_covered_state == XLOG_STATE_COVER_NEED)
-			log->l_covered_state = XLOG_STATE_COVER_DONE;
-		else {
-			ASSERT(log->l_covered_state == XLOG_STATE_COVER_NEED2);
-			log->l_covered_state = XLOG_STATE_COVER_DONE2;
+	switch (log->l_covered_state) {
+	case XLOG_STATE_COVER_DONE:
+	case XLOG_STATE_COVER_DONE2:
+	case XLOG_STATE_COVER_IDLE:
+		break;
+	case XLOG_STATE_COVER_NEED:
+	case XLOG_STATE_COVER_NEED2:
+		if (!xfs_trans_ail_tail(log->l_ailp) &&
+		    xlog_iclogs_empty(log)) {
+			if (log->l_covered_state == XLOG_STATE_COVER_NEED)
+				log->l_covered_state = XLOG_STATE_COVER_DONE;
+			else
+				log->l_covered_state = XLOG_STATE_COVER_DONE2;
 		}
+		/* FALLTHRU */
+	default:
 		needed = 1;
+		break;
 	}
 	spin_unlock(&log->l_icloglock);
 	return needed;
-- 
cgit v1.2.3