summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorTim Pepper <timothy.c.pepper@linux.intel.com>2012-10-03 11:40:02 -0700
committerTim Pepper <timothy.c.pepper@linux.intel.com>2012-10-03 11:50:47 -0700
commit3886bcf7edcb344f517f163764d0f2b608365d9b (patch)
tree24d617a4f441d21bf1c98a33d69b3399dfeed1bc
parent285c61073dbfb39f25f013ede0da33a7c1f1bcec (diff)
downloadcorewatcher-3886bcf7edcb344f517f163764d0f2b608365d9b.tar.gz
corewatcher-3886bcf7edcb344f517f163764d0f2b608365d9b.tar.bz2
corewatcher-3886bcf7edcb344f517f163764d0f2b608365d9b.zip
Add README skipped in prior commit
We finally have a general README describing the software at a high level, how to configure, build, and run, and giving some basic internal design information. Signed-off-by: Tim Pepper <timothy.c.pepper@linux.intel.com>
-rw-r--r--Makefile.am1
-rw-r--r--README107
2 files changed, 108 insertions, 0 deletions
diff --git a/Makefile.am b/Makefile.am
index 4531c66..e0ffeb3 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -14,4 +14,5 @@ dist_systemdunit_DATA = src/corewatcher.service
EXTRA_DIST = \
COPYING \
+ README \
$(man_MANS)
diff --git a/README b/README
new file mode 100644
index 0000000..01fcb1f
--- /dev/null
+++ b/README
@@ -0,0 +1,107 @@
+ README for corewatcher
+
+
+The corewatcher package provides a daemon for monitoring a system for
+crashes. Crashes are analyzed and summary crash report information is
+sent to a crashdb server.
+
+The daemon is managed by a systemd unit file, corewatcher.service.
+
+Configuration is stored in /etc/corewatcher.
+
+Corefiles are assumed to be written to /var/lib/corewatcher by the
+kernel with /proc/sys/kernel settings of:
+ core_pattern=/var/lib/corewatcher/core_%e_%t
+ core_uses_pid=1
+
+To build and run, use the standard autotools workflow like:
+ ./configure
+ make
+ sudo make install
+
+
+===========================================================================
+
+
+The corewatcher daemon can be considered to be a state machine with the
+following 5 possible states and the listed major functions called for
+state transitions:
+
+S1: core_folder has no core_*
+ |
+ | crash happens leading to inotification
+ |
+S2: core_folder has core_* present
+ |
+ | scan_core_folder()
+ | get_appfile()
+ | move_core(fullpath, "to-process")
+ |
+S3: processed_folder has some core_*.to-process
+ |
+ | scan_processed_folders()
+ | process_corefile()
+ | process_new()
+ | (calls gdb, creates report summary *.txt)
+ | queue_backtrace()
+ |
+S4: processed_folder has some core_*.processed
+ |
+ | scan_processed_folder()
+ | reprocess_corefile()
+ | process_old()
+ | queue_backtrace()
+ .
+ .
+ .
+unqueueing
+ |
+ | submit_loop(): a sleepy thread whose work condition is set in
+ | queue_backtrace() and in the period timer
+ | "cleanup" thread
+ |
+S5: processed_folder has only core_*.submitted and *.txt
+
+
+NOTES:
+o at daemon start any of the states in the filesystem could exist, so we
+ need to do all of get_appfile()/move_core(), process_new(), process_old()
+ and submit_loop()
+o during submission, crash reports are removed from the in-memory pending
+ work list for submission, then if curl POST fails, the associated cores
+ stay in the filesystem as "processed" files, and re-added to the in-memory
+ work list
+ - if client network is down and comes back up, an event notifier
+ could trigger resubmit via reprocess_corefile() and submit_loop()
+ - if server or intermediate connectivity was the problem, only a
+ periodic timer can attempt to resubmit via reprocess_corefile() and
+ setting the work condition for submit_loop()
+ - failed submissions should hang out at the end of the work queue in
+ case there is something truly wrong with them so new reports have a
+ better chance of getting through
+
+
+===========================================================================
+
+
+Internals: locking & global state
+
+ o core_status is a global struct
+ o ordering:
+ "processing_mtx -> gdb_mtx ->processing_queue_mtx"
+ o core_status.processing_mtx:
+ - protects: core_status.processing_oops GHashTable
+ o processing_queue_mtx: (coredump.c)
+ - protects: processing_queue array of corefile fullpath strings
+ o gdb_mtx: (coredump.c)
+ - intent was to insure gdb doesn't run concurrently, under an assumption
+ that simultaneously processing multiple cores is too resource intensive
+ and system-unfriendly
+ o bt_mtx: (submit.c)
+ - protects:
+ o bt_list struct oops linked list
+ o bt_work GCond condition variable
+ o bt_hash GHashTable of core file names
+ o A struct oops may exist off of bt_list and still referenced by name
+ in be in bt_hash. Such a struct oops must exist if the core name
+ is in the bt_hash.