The idea behind this is task is to create "snapshots", or archived Command Log segments, which can be applied to a provisioned server on startup in a way similar to the SQL-based output of the drizzledump program.
Concepts
The Command Log is a tail-write-only log, protected by an atomic<off_t> which is moved by the CommandLog::apply() method along the log file while Command GPB messages are written to the log.
This has a couple ramifications:
- Once a Command message is written to the "active" command.log file, it can never be updated.
- The atomic<off_t> guarding the tail is always monotonically increasing
Given the above two conditions, it seems feasible to have a lock-free solution which contructs "archives" or "snapshots" that will produce an exact dataset of a server at a specific point in time.
Snapshot Process
The following steps need to be done in the archival/snapshot process:
- Determine the global transaction identifier corresponding to the timestamp that needs to be snapshotted up to
- Read through the active command log, building an index of the log up to the last transaction to be applied in the snapshot
- Use the index to do the following, in order:
- Determine the definitions of the tables in each schema at the latest transaction
- For each schema:
- For each table in schema:
- Create an in-memory primary key for the table, containing all key values
- Purge all keys for deleted records
- Construct a single Command message with an InsertRecord submessage for each key, with the latest values of all columns for that record
- Determine the last "change record" for records in each schema table
- Add the new change record as a Command message with an InsertRecord submessage to the archive