README


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101

                          README for corewatcher


The corewatcher package provides a daemon for monitoring a system for
crashes.  Crashes are analyzed and summary crash report information is
sent to a crashdb server.

The daemon is managed by a systemd unit file, corewatcher.service.

Configuration is stored in /etc/corewatcher.

Corefiles are assumed to be written to /var/lib/corewatcher by the
kernel with /proc/sys/kernel settings of:
   core_pattern=/var/lib/corewatcher/core_%e_%t
   core_uses_pid=1

To build and run, use the standard autotools workflow like:
   ./configure
   make
   sudo make install


===========================================================================


The corewatcher daemon can be considered to be a state machine with the
following 5 possible states and the listed major functions called for
state transitions:

S1: core_folder has no core_*
 |
 |	crash happens leading to inotification
 |
S2: core_folder has core_* present
 |
 |	scan_core_folder()
 |	move_core(fullpath, "to-process")
 |
S3: processed_folder has some core_*.to-process
      or
    processed_folder has some core_*.processed, but no associated *.txt
 |
 |	scan_processed_folder()
 |	create_report()
 |		(calls gdb, creates report summary *.txt)
 |
S4: processed_folder has some core_*.processed, and associated *.txt
 |
 |	queue_backtrace()
 .
 .
 .
unqueueing
 |
 |	submit_loop(): a sleepy thread whose work condition is set in
 |	               queue_backtrace() and in the period timer
 |	               "cleanup" thread, submits *.txt and where
 |	               successful moves associated core_*.processed
 |	               to core_*.submitted
 |
S5: processed_folder has only core_*.submitted and *.txt


NOTES:
o At daemon start any of the states in the filesystem could exist, so we
  need to do all of scan_core_folder(), scan_processed_folder() and
  submit_loop().
o During submission, crash reports are removed from the in-memory pending
  submission work list.  If the curl POST then fails, the associated cores
  stay in the filesystem as "processed" files, and are placed back on the
  in-memory submission work list.
  -  if client network is down and comes back up, an event notifier
     could trigger resubmit by toggling the submit_loop() condition
     variable
  -  if server or intermediate connectivity was the problem, only a
     periodic timer can trigger resubmission by setting the work condition
     for submit_loop()
  -  failed submissions should hang out at the end of the work queue in
     case there is something truly wrong with them so new reports have a
     better chance of getting through


===========================================================================


Internals: locking & global state

  o  bt_mtx: (submit.c)
     - protects:
        o  bt_work GCond condition variable
        o  bt_list struct oops linked list
        o  bt_hash GHashTable of core file names
        o  A struct oops may exist off of bt_list and still be referenced by
           name in be in bt_hash.  Such a struct oops must exist if the core
           name is in the bt_hash.
  o  pq_mtx: (coredump.c)
     - protects:
        o  pq "processing queue" boolean: the actual queue is represented
           by the presence of files in filesystem, but this allow threads
           to signal there are new ones to process
        o  pq_work GCond condition variable