The BARF compressor will compress any nonempty file by at least one byte. Thus, by compressing already compressed files over and over again, it is possible to eventually reduce any file to 0 bytes.
BARF has been tested on the Calgary corpus, a well known benchmark. In just one pass, it achieves the best known result of any compressor, compressing all 14 files to 1 byte each. Run time is under 1 second on a 750 MHz PC. Of course these files can be compressed again down to 0 bytes each, just like any other file. (Yes, they can be decompressed correctly).
BARF is free, open source software, released under the GNU GPL.
Download BARF! (Windows executable, about 1 MB).
Source code:
barf.cpp (includes compilation insructions)
mkstring.cpp
barf c file1 file2 ... (Compresses, adds a .x or .x?? extension) barf d file1.x file2.x ... (Decompresses, removes the extension)If there are no filenames on the command line, then BARF will read them standard input. For example, to compress every file in the current directory (in Windows):
dir/b | barf c (Compress all files) dir/b | barf d (Decompress all files)
During compression, BARF adds an extension to the file name and deletes the original file, similar to the way COMPRESS or GZIP adds a .Z or .gz extension. Decompression works in reverse, removing the extension and restoring the file to its original name, size, and contents.
BARFER (BARF for Extracting Random data) meets the challenge. The decompressor is in source form. The data file and two source code files below are compressed with barfer.exe (Windows executable).
random.bin.x1b.x5z.x2l (415,238 bytes)
barf.cpp.x (1 byte) (in IE, right click to download)
mkstring.cpp.x (1 byte)
To decompress, restoring and verifying random.bin (AMillionRandomDigits.bin)
barfer d random.bin.x1b.x5z.x2l barfer d random.bin.x1b.x5z barfer d random.bin.x1b fc random.bin AMillionRandomDigits.binThe two source code files are exactly the same as barf.cpp and mkstring.cpp above, but the compilation steps are different. I created barfer.exe using Borland C++ 5.5.1 and UPX 1.2.4w as follows.
bcc32 mkstring.cpp dir/b *.cpp | mkstring >barf.h bcc32 barf.cpp upx barf.exe rename barf.exe barfer.exeTo verify that barf.cpp.x and mkstring.cpp.x are indeed the compressed source files for barfer.exe, you may create barfer.exe from the original sources using any C++ compiler, then decompress the .x files and compare with the original sources. For example, in UNIX: (I haven't tested this):
g++ mkstring.cpp -o mkstring ls barf.cpp mkstring.cpp | ./mkstring >barf.h g++ barf.cpp -o barfer ./barfer d random.bin.x1b.x5z.x2l ./barfer d random.bin.x1b.x5z ./barfer d random.bin.x1b diff random.bin AMillionRandomDigits.bin (should compare the same) mv barf.cpp.x temp_barf.cpp.x mv mkstring.cpp.x temp_mkstring.cpp.x ./barfer d *.x diff barf.cpp temp_barf.cpp (should compare the same) diff mkstring.cpp temp_mkstring.cpp (should compare the same)Even if you don't buy the argument that you can compress the decompressor, then you could still use barfer.exe (39,936 bytes) to recursively compress random.bin 39,937 times so that the total size is still one byte less than the original.
If you just want to compress the file and not worry about the size of the compressor, then just use the Better Archiver with Recursive Functionality with Extreme Squeezing Technology (BARFEST). It compresses AMillionRandomDigits.bin to 1 byte in 1 pass or 0 bytes in 2 passes. Windows executable: barfest.exe.
Note that we can combine different compression algorithms to achieve results better than either one alone. For example, a program could try both COMPRESS and GZIP and choose the smaller file (either file.Z or file.gz) and never do worse than either program alone. BARF extends this idea by trying 257 different algorithms and choosing the best one, with remarkable results.
Recursive compression has long been thought to be impractical. The pigeonhole principle states that it is impossible for a single algorithm to compress all messages of n bits or more, no matter what n is. This is because there are 2n possible messages of n bits, but only 2n-1 possible encodings of n-1 or fewer bits. This means that at least two messages must code to the same encoding. Such an encoding could not be decompressed unambiguously. To avoid this, some messages must inevitably "compress" to an equal or larger size than the original. This limitation has stymied the development of recursive file compression technology because we eventually reach a point at which the "compressed" file is not any smaller.
However, the pigeonhole principle does not apply to a set of algorithms. BARF solves the recursive compression problem by using multiple algorithms arranged in such a way that every nonempty file can be compressed by at least one of them. By choosing the best algorithm, it is guaranteed that every nonempty file can compressed by at least one byte. By repeated compression, any file can be eventually compressed to 0 bytes.
BARF tries 257 compression algorithms, numbered 1 through 257, then picks the best one. Files compressed by method 1 have a .x extension. This algorithm is:
if the input is the i'th (1-14) file of the Calgary corpus then encode it as a 1 byte file with byte value i-1 else compress using a byte oriented LZ77 code with a 1 byte header of 255: Byte x = 0-31 means that the next x bytes represent themselves Byte x = 32-255 represents a copy of the two bytes from x-32 places backBARF tries this method first. If the result is not smaller than the original file, then it applies one of the remaining methods 2-257. Method n is as follows:
if the first byte is n-2 then remove itMethod n (2-257) is indicated by an extension of the form .x[0-9][a-z] as a base 26 encoding of n-2. For example, method 2 (.x0a) removes a leading 0 byte, method 3 (.x0b) removes a leading 1 byte, up through method 257 (.x9v) which removes a leading 255 byte.
Although BARF does not compress every file to one byte on the first pass, it can be easily modified to do this on any benchmark of up to 255 files. Instructions are found in the source code.
Of course, a properly designed test, such as the Calgary corpus compression challenge is not susceptable to such tricks. The decompression program (about 1 MB for barf.exe) is added to the size of the compressed files, and the file names (which can get very long) also have to be stored by packing everything into an archive.
Last updated Sept. 27, 2003 by Matt Mahoney