gzip - How do the compression codecs work in Python? -


i'm querying database , archiving results using python, , i'm trying compress data write log files. i'm having problems it, though.

my code looks this:

log_file = codecs.open(archive_file, 'w', 'bz2') id, f1, f2, f3 in cursor:     log_file.write('%s %s %s %s\n' % (id, f1 or 'null', f2 or 'null', f3)) 

however, output file has size of 1,409,780. running bunzip2 on file results in file size of 943,634, , running bzip2 on results in size of 217,275. in other words, uncompressed file smaller file compressed using python's bzip codec. is there way fix this, other running bzip2 on command line?

i tried python's gzip codec (changing line codecs.open(archive_file, 'a+', 'zip')) see if fixed problem. still large files, gzip: archive_file: not in gzip format error when try uncompress file. what's going on there?


edit: had file opened in append mode, not write mode. while may or may not problem, question still holds if file's opened in 'w' mode.

as other posters have noted, issue codecs library doesn't use incremental encoder encode data; instead encodes every snippet of data fed write method compressed block. horribly inefficient, , terrible design decision library designed work streams.

the ironic thing there's reasonable incremental bz2 encoder built python. it's not difficult create "file-like" class correct thing automatically.

import bz2  class bz2streamencoder(object):     def __init__(self, filename, mode):         self.log_file = open(filename, mode)         self.encoder = bz2.bz2compressor()      def write(self, data):         self.log_file.write(self.encoder.compress(data))      def flush(self):         self.log_file.write(self.encoder.flush())         self.log_file.flush()      def close(self):         self.flush()         self.log_file.close()  log_file = bz2streamencoder(archive_file, 'ab') 

a caveat: in example, i've opened file in append mode; appending multiple compressed streams single file works bunzip2, python can't handle (although there is patch it). if need read compressed files create python, stick single stream per file.


Comments

Popular posts from this blog

c++ - Convert big endian to little endian when reading from a binary file -

C#: Application without a window or taskbar item (background app) that can still use Console.WriteLine() -

unicode - Are email addresses allowed to contain non-alphanumeric characters? -