30 January 2019 · 2m42s to read

AFIS, or Anti-Forensic Information Splitter is an algorithm designed to support secure data destruction crucial for secure on-disk key management. The key idea is to bloat information and therefor improving the chance of destroying a single bit of it. The information is bloated in such a way, that a single missing bit causes the original information become unrecoverable. The default diffusion element is based on SHA-1, but different hashing algorithms may be selected by the user.

An implementation in Go is available in maze.io/x/crypto/afis.

Even if you take all known precautions to delete a file from disk, chances are
that (bits of the) data still remains on the disk. If the probability \(p\)
to destroy a certain block of data is \(0 < p < 1\), then the chance of a block
survives is \(1-p\). Given a dataset of size \(l\), the probability to destroy
the whole block becomes \(p^l\). But the probability the entire block survives,
\((1-p)^l\), becomes smaller with increasing values of \(l\). If \((1-p)^l\) is
becoming smaller, then \(1-(1-p)^l\) must become larger, which reflects the
chance that the block will **not** survive.

AFIS addresses the probability of a block to **not survive** by creating an
interdependency for a data set \(S\), \(S=s_1, s_2,\dotsb s_n\), by generating
\(s_1 \dotsb s_n−1\) random data items and computing \(s_n\) so that
\(s_1 \bigoplus s_2 \bigoplus s_3\dotsb\bigoplus s_n = D\). The reconstruction is
done by carring out the left-side of the equation, XOR-ing all data items
together. If one item \(s_i\) is missing, \(D\) can’t be reconstructed, since an
arbitrary \(s_i\) effects the entire \(D\).

This scheme can be enhanced to include diffusion of information, so that the k-th bit of an arbitrary \(s_i\) does not only affect the k-th bit of \(D\) but the entire \(D\). To achieve this diffusion, we insert a diffusion element in the chain of XORs. A crytographic hash function can be used as such element, but since it might not output sufficiently large data, it will be processed a few times with an increasing number, similar to an initial vector, prepended to the complete dataset to produce enough data. As a hash function is usually required to be non-invertible, we can not choose it’s output. Therefor, the last diffusion will be omitted. This will degrade security slightly, so when computing destruction probabilities the last element shall never be taken into account.

As we can destroy a single undeterminated data item quite easily as shown in the previous paragraph, and as a single missing data item makes the base information unrecoverable, data items can be made reliable erasable. As illustration, you can find the overall composition above. \(H\) denotes the diffusion element, which is likely to be a hash and \(Z\) denotes the zero vector. In the splitting phase, \(s_1\) to \(s_n−1\) are random generated and the intermediate result \(I\) is computed. Then \(s_n\) computed as \(s_n = D \bigoplus I\). When recovering the base information, the whole chain is computed as shown resulting in \(D\), the original data item.

You may read the full theory and specification in the TKS1 draft.