gzip - How do the compression codecs work in Python? -
i'm querying database , archiving results using python, , i'm trying compress data write log files. i'm having problems it, though.
my code looks this:
log_file = codecs.open(archive_file, 'w', 'bz2') id, f1, f2, f3 in cursor: log_file.write('%s %s %s %s\n' % (id, f1 or 'null', f2 or 'null', f3))
however, output file has size of 1,409,780. running bunzip2
on file results in file size of 943,634, , running bzip2
on results in size of 217,275. in other words, uncompressed file smaller file compressed using python's bzip codec. is there way fix this, other running bzip2
on command line?
i tried python's gzip codec (changing line codecs.open(archive_file, 'a+', 'zip')
) see if fixed problem. still large files, gzip: archive_file: not in gzip format
error when try uncompress file. what's going on there?
edit: had file opened in append mode, not write mode. while may or may not problem, question still holds if file's opened in 'w' mode.
as other posters have noted, issue codecs
library doesn't use incremental encoder encode data; instead encodes every snippet of data fed write
method compressed block. horribly inefficient, , terrible design decision library designed work streams.
the ironic thing there's reasonable incremental bz2 encoder built python. it's not difficult create "file-like" class correct thing automatically.
import bz2 class bz2streamencoder(object): def __init__(self, filename, mode): self.log_file = open(filename, mode) self.encoder = bz2.bz2compressor() def write(self, data): self.log_file.write(self.encoder.compress(data)) def flush(self): self.log_file.write(self.encoder.flush()) self.log_file.flush() def close(self): self.flush() self.log_file.close() log_file = bz2streamencoder(archive_file, 'ab')
a caveat: in example, i've opened file in append mode; appending multiple compressed streams single file works bunzip2
, python can't handle (although there is patch it). if need read compressed files create python, stick single stream per file.
Comments
Post a Comment