On 7/20/2016 1:58 PM, Thomas Richter wrote:
> Am 18.07.2016 um 23:37 schrieb Eli the Bearded:
>> They claim 22% reduction on average and decently fast speeds. I built a
>> copy and threw some files at it and got 21% reduction, with verified
>> lossless decompress.
>> I'm posting here in hopes of soliciting comments from people more
>> knowledgeable than myself.
> This looks like another attempt on JPEG recompression. This has been
> seen before, for example in tools like packJPG, or the stuffIt JPEG
> compression. Up to now, attempts to commercialize such tools, as useful
> as they might seem, have not had much success.
personally, I haven't had too much of an issue with JPEG's file sizes.
> Even within the JPEG standard itself, (smaller) improvements are
> possible. So for example, one can losslessly transcode a Huffman-coded
> JPEG file to an arithmetically coded one, and convert it back. I would
> expect an improvement of about 10% on average for this simple method.
IME feeding a JPEG through a bitwise arithmetic coder generally gets in
this area as well (~ 5-10%).
generally I have considered it not particularly worthwhile, given the
relatively steep cost this has in terms of decoder performance.
my own JPEG encoders tend to use trellis quantization, which tends to
both improve compression while also making decoding slightly faster at
an otherwise similar quality level. another trick (for improving decode
speed) is detecting some special cases (ex: no coefficients past a
certain point), and using alternate versions of the inverse quantization
and IDCT transform which omit these coefficients.
potential improvements to the format I can think of are mostly things
like a slightly tweaked VLC coding, and using a more advanced DC
predictor. maybe a few "special case" features (mostly involving block
coding, such as explicitly subsampled blocks).
in the past, I have noted it is pretty hard to get JPEG-like designs
decoding much faster than about 80-100 megapixels/second. (faster may
exist, dunno, this is just the fastest I have gotten it).
like, they are nice/elegant, but this is a limiting factor.
in all though, it is still faster than what I have typically managed
with PNG-like designs.
big factors I think tend to be the number of entropy coded symbols, and
tendency of "scanlines" to have a harder time fitting into cache, as
opposed to small heavily-reused buffers used for a JPEG-style block
though, yes, PNG like designs are intuitively pretty simple, and seem
like they should be pretty fast, this hasn't really matched my experiences.
> The transcoded file can be used to recover the original, without loss,
> though it is surprisingly even conforming to the JPEG standard. However,
> I would not expect that many decoders will actually be able to recover a
> graphics from the AC-coded file, so transcoding it back to Huffman coded
> data is likely the most compatible option.
generally it mostly matters that the original can be recovered I think,
and in-general JPEG works pretty good as a "general purpose" format.
my own recent codec efforts have mostly been focusing on speed, and
specialized use-cases, rather than getting the best possible
quality/bitrate, nor making something which competes directly with JPEG
in the general-purpose use-case (sort of like how both PNG and JPEG have
certain things they do well, but neither can fully address the others'
for example, after resurrecting my previously stalled-out BTIC4B effort,
mostly re-purposing it for texture compression (want reasonably fast
decoding for quick load times), I have single-thread RGBA decode speeds
of often around 250-400 megapixels/second (depends some on image and
settings), which is a fair bit faster than I have been able to get from
a JPEG decoder. though, image quality and bitrate are a bit more hit and
miss vs JPEG, they are usually "not drastically worse".
currently, the RGBA/BGRA/RGBx/BGRx decode paths are the fastest paths,
with BGRx being the fastest among these (note that the format supports
and alpha channel, and RGBA/BGRA will try to decode alpha information).
in some cases, the BGRx path can often reach memcpy speeds.
in a current test, with my (still in development) HDR / float16 decode
path, getting around 150-200 megapixels/second (to R11_G11_B10). also
have a basically working BC7 and S3TC decode paths (which can also
generate mipmapped output, at some speed cost), which are also currently
still a little slower than the RGB decode paths.
still, TBD, I may also go add a BC6H decode path.
IME, WRT quality/bitrate, it does a little better than JPEG at lower
quality levels (most likely due to Paeth prediction for the block color
vectors, vs using the prior value as a predictor, where at lower quality
levels the color-vector or DC coefficients tend to dominate over pixel
bits or AC coeffs), albeit it typically does worse at higher quality levels.
while initially the cost of the block-level Paeth predictor was pretty
steep, ended up using the trick of using the Y prediction to also select
the U and V predictors. effectively the color vector works as 2 vectors
((Y, U, V), (Dy, Du, Dv)), with Y and Dy working as the primary predictors.
prediction accuracy seems still pretty close to that of selecting each
component independently. in my tests, the 2nd and 3rd place (non-Paeth)
predictors were: (3A+3B+2C)/8, and (3A+3B-2C)/4, which curiously both
seemed to do better on average than (A+B)/2. none did quite as good as
Paeth, but did generally do better than always using the last value.
compression is working out a bit better than my original estimates,
though still falling short of clearly "beating" JPEG. there is a
drawback in that the codec is currently a bit more complicated than JPEG.
some common features:
both use 8x8 blocks and chroma subsampled YUV.
it uses fixed-format blocks and interpolation
rather than DCT and a VLC coding;
use of STF2+AdRiceLL rather than static Huffman
though, most values use raw AdRiceLL
use of a little-endian bitstream
this allows for cheaper bitstream operations
most bitstream operations are branch-free,
but this avoids needing to byte-swap
STF2 is similar to my prior SMTF trick, but is a little bit faster
because it eliminates the use of branches. instead, a symbol index I is
swapped with 7I/8. experimentally, this seemed to be the most effective
(vs either 3I/4 or 15I/16). also, unlike SMTF, there is no need for the
table to be able to rotate (my prior SMTF design required several
branches and zero-extended arithmetic to rotate a table).
note that "STF" here stands for "Swap Towards Front". a prior STF
swapped I with I-1, but needed to make sure I wasn't 0, whereas STF2
avoids this check (it just swaps index 0 with itself).
the AdRiceLL coder basically uses length-limited Rice codes, with the K
value updated dynamically based on Q. this whole process is driven by a
lookup table (thus isn't that much different than a typical table-driven
Huffman coder, except that the lookup tables tend to be constant).
for deltas, AdRiceLL is used by itself, as while AdRice with an
extra-bits suffix can give better compression, it is also slower (and an
exponential length-limiting scheme can serve a roughly similar effect,
even if less optimally).
the STF2+AdRiceLL combination is pretty close in terms of speed and
compression with that of static Huffman (though typically it does
slightly worse on both fronts on average), but it is adaptive, and its
low per-context cost tends to make contexts fairly cheap, which tends to
be a net win.
well, and also I can understand how it works.
it is possible to feed the entropy-coded data through a bitwise
range-coder for some additional compression, but this tends to come at a
pretty steep cost in terms of speed.
note, a version of the format I am working on is here:
and a basic spec format: