xlutils.filter

This framework is designed to filter and split Excel files using a series of modular readers, filters, and writers all tied together by the process() function. These components are described below, followed by documentation for the currently available readers, filters and writers. Right at the bottom is an example that puts all the various pieces together.

Readers

Readers are components that get data from a source and parse it into a series of xlrd.Book objects before making calls to the appropriate methods on the first filter in the chain.

Because it is usually only the source of the data to be processed that changes, a handy base class is provided for readers.

Here’s an example reader that uses this base class to process the file it is initialised with:

>>> import os
>>> from xlrd import open_workbook
>>> from xlutils.filter import BaseReader
>>> class MyReader(BaseReader):
...
...    def __init__(self, filename):
...       self.filename = filename
...
...    def get_filepaths(self):
...       return (os.path.join(test_files, self.filename),)

If you need to create a more unorthodox reader, such as one that reads its data from a network socket or that needs to pass special parameters to xlrd.open_workbook(), then read the implementation of BaseReader.

Filters

Filters are the important bits of this framework. They are built up in chains to achieve the results required for a particular task. A filter must define certain methods, the full set of these is shown in the example below. The implementation of these methods can do whatever the filter requires, but generally they end up calling the appropriate methods on the next filter.

Here’s an example filter that does nothing but print messages when its methods are called and then call the next filter in the chain:

>>> from __future__ import print_function
>>> class MyFilter:
...
...    def __init__(self,name):
...        self.name = name
...
...    def start(self):
...        print(self.name, 'start')
...        self.next.start()
...
...    def workbook(self,rdbook,wtbook_name):
...        print(self.name, 'workbook', rdbook, wtbook_name)
...        self.next.workbook(rdbook, wtbook_name)
...
...    def sheet(self,rdsheet,wtsheet_name):
...        print(self.name, 'sheet', rdsheet, wtsheet_name)
...        self.next.sheet(rdsheet, wtsheet_name)
...
...    def set_rdsheet(self,rdsheet):
...        print(self.name, 'set_rdsheet', rdsheet)
...        self.next.sheet(rdsheet, wtsheet_name)
...
...    def row(self,rdrowx,wtrowx):
...        print(self.name, 'row', rdrowx,wtrowx)
...        self.next.row(rdrowx, wtrowx)
...
...    def cell(self,rdrowx,rdcolx,wtrowx,wtcolx):
...        print(self.name, 'cell', rdrowx, rdcolx, wtrowx, wtcolx)
...        self.next.cell(rdrowx, rdcolx, wtrowx, wtcolx)
...
...    def finish(self):
...        print(self.name, 'finish')
...        self.next.finish()

For full details of when each of these methods are called, see the AP documentation for BaseFilterInterface.

Writers

Writers are components that get handle calls from the appropriate methods on the last filter in the chain. It is the writer that actually does the work of copying all the information from the appropriate sources and writing them to the output files.

Because there is a lot of work involved in this and it is only usually the method of writing the binary data to its destination that differs, a handy base class is provided for writers.

Here’s an example writer that just writes the data to a temporary directory:

>>> from xlutils.filter import BaseWriter
>>> our_temp_dir = TempDirectory().path
>>> class MyWriter(BaseWriter):
...
...    def get_stream(self,filename):
...       return open(os.path.join(our_temp_dir, filename), 'wb')

Available Readers

Several reader implementations are included that cover many common use cases:

GlobReader

If you’re processing files that are on disk, then this is probably the reader for you. It returns all files matching the path specification it’s created with. Here’s an example:

>>> from xlutils.filter import GlobReader
>>> r = GlobReader(os.path.join(test_files,'test*.xls'))
>>> sorted([p[len(test_files)+1:] for p in r.get_filepaths()])
['test.xls', 'testall.xls', 'testnoformatting.xls']

All the other functionality is provided by BaseReader:

>>> isinstance(r,BaseReader)
True

XLRDReader

If you want to “save” an xlrd.Book object which you’ve already created in some other way:

>>> from xlrd import open_workbook
>>> wb = open_workbook(os.path.join(test_files, 'testall.xls'))

Then the XLRDReader is likely what you’re after:

>>> from xlutils.filter import XLRDReader
>>> r = XLRDReader(wb, 'foo.xls')

Note

You must supply a filename as shown above as the original filename is not stored in the xlrd.Book object

Most functionality is provided by BaseReader:

>>> isinstance(r, BaseReader)
True

However, its get_workbooks() method just returns the values it was instantiated with:

>>> tuple(r.get_workbooks())
((<xlrd...Book object at ...>, 'foo.xls'),)
>>> tuple(r.get_workbooks())[0][0] is wb
True

To show it working, here we send the book straight to a writer:

>>> from xlutils.filter import DirectoryWriter, process
>>> os.listdir(temp_dir)
[]
>>> process(r,DirectoryWriter(temp_dir))
>>> os.listdir(temp_dir)
['foo.xls']

TestReader

This reader is specifically designed for testing filter implementations with known sets of cells. This example should give a good idea of how to use it:

>>> from mock import Mock
>>> from pprint import pprint
>>> from xlutils.tests.test_filter import TestReader
>>> from xlrd import XL_CELL_NUMBER
>>> r = TestReader(
...         ('Sheet1',(('R0C0','R0C1'),
...                    ('R1C0','R1C1'))),
...         ('Sheet2',(((XL_CELL_NUMBER,0.0),),),)
...     )
>>> c = Mock()
>>> r(c)
>>> pprint(c.method_calls)
[call.start(),
 call.workbook(<xlutils.tests.fixtures.DummyBook...>, 'test.xls'),
 call.sheet(<xlrd.sheet.Sheet...>, 'Sheet1'),
 call.row(0, 0),
 call.cell(0, 0, 0, 0),
 call.cell(0, 1, 0, 1),
 call.row(1, 1),
 call.cell(1, 0, 1, 0),
 call.cell(1, 1, 1, 1),
 call.sheet(<xlrd.sheet.Sheet...>, 'Sheet2'),
 call.row(0, 0),
 call.cell(0, 0, 0, 0),
 call.finish()]

Available Writers

Several writer implementations are included that cover many common use cases:

DirectoryWriter

If you’re processing files that are on disk, then this is probably the writer for you. It stores files in the directory passed to it during creation. Here’s an example:

>>> from xlutils.filter import DirectoryWriter
>>> temp_dir = TempDirectory().path
>>> w = DirectoryWriter(temp_dir)

Most of the functionality is provided by BaseWriter:

>>> isinstance(w, BaseWriter)
True

The get_stream() method makes sure the files end up in the directory specified:

>>> os.listdir(temp_dir)
[]
>>> f = w.get_stream('test.xls')
>>> _ = f.write(b'some \r\n data')
>>> f.close()
>>> os.listdir(temp_dir)
['test.xls']
>>> open(os.path.join(temp_dir,'test.xls'),'rb').read()
b'some \r\n data'

StreamWriter

If you want to write exactly one workbook to a stream, then this is the writer for you:

>>> from tempfile import TemporaryFile
>>> from xlutils.filter import StreamWriter
>>> tf = TemporaryFile()
>>> w = StreamWriter(tf)

Most of the functionality is provided by BaseWriter:

>>> isinstance(w, BaseWriter)
True

The get_stream() method makes sure the excel data is written to the stream provided:

>>> f = w.get_stream('test.xls')
>>> _ = f.write(b'xls data')
>>> _ = tf.seek(0)
>>> tf.read()
b'xls data'

Note

Only one file may be written to a StreamWriter, further attempts will result in an exception being raised:

>>> w.get_stream('test2.xls')
Traceback (most recent call last):
...
Exception: Attempt to write more than one workbook

StreamWriter also doesn’t close any streams passed to it:

>>> tf = TemporaryFile()
>>> process(TestReader(('Sheet1',[['R0C0']])),StreamWriter(tf))
>>> _ = tf.seek(0)
>>> len(tf.read())
5632

XLWTWriter

If you want to change cells after the filtering process is complete then an XLWTWriter can be used to obtain the xlwt.Workbook objects that result:

>>> from xlutils.filter import XLWTWriter
>>> w = XLWTWriter()
>>> process(TestReader(('Sheet1',[['R0C0']])),w)

The objects can then be manipulated and saved as required:

>>> w.output
[('test.xls', <xlwt.Workbook.Workbook object at ...>)]
>>> book = w.output[0][1]
>>> book.get_sheet(0).write(0,1,'R0C1')
>>> temp_dir = TempDirectory()
>>> temp_dir.listdir()
No files or directories found.
>>> book.save(os.path.join(temp_dir.path,w.output[0][0]))
>>> temp_dir.listdir()
test.xls

As with previous writers, most of the functionality is provided by BaseWriter:

>>> isinstance(w, BaseWriter)
True

Available Filters

A selection of filters are included as described below:

BaseFilter

This is a “do nothing” filter that makes a great base class for your own filters. All the required methods are implemented such that they just call the same method on the next filter in the chain.

ColumnTrimmer

This filter will strip columns containing no useful data from the end of sheets. For example:

>>> from xlutils.filter import process, ColumnTrimmer
>>> r = TestReader(
...         ('Sheet1',(('','',''),
...                    ('R1C0','',''),
...                    ('R2C0','R2C1',' \t\r\n'))),
...         ('Sheet2',(('R0C0',),),),
...         ('Sheet3',(('R0C0','',(XL_CELL_NUMBER,0.0)),)),
...     )
>>> c = Mock()
>>> process(r,ColumnTrimmer(),c)
>>> pprint(c.method_calls)
[call.start(),
 call.workbook(<xlutils.tests.fixtures.DummyBook...>, 'test.xls'),
 call.sheet(<xlrd.sheet.Sheet...>, 'Sheet1'),
 call.row(0, 0),
 call.row(1, 1),
 call.row(2, 2),
 call.cell(0, 0, 0, 0),
 call.cell(0, 1, 0, 1),
 call.cell(1, 0, 1, 0),
 call.cell(1, 1, 1, 1),
 call.cell(2, 0, 2, 0),
 call.cell(2, 1, 2, 1),
 call.sheet(<xlrd.sheet.Sheet...>, 'Sheet2'),
 call.row(0, 0),
 call.cell(0, 0, 0, 0),
 call.sheet(<xlrd.sheet.Sheet...>, 'Sheet3'),
 call.row(0, 0),
 call.cell(0, 0, 0, 0),
 call.finish()]

When sheets are trimmed, a message is also logged to aid debugging:

>>> from testfixtures import LogCapture
>>> l = LogCapture()
>>> process(r, ColumnTrimmer(), c)
>>> print(l)
xlutils.filter DEBUG
  Number of columns trimmed from 3 to 2 for sheet b'Sheet1'
xlutils.filter DEBUG
  Number of columns trimmed from 3 to 1 for sheet b'Sheet3'

The definition of ‘no useful data’ can also be controlled by passing in a function that returns True or False for each value:

>>> def not_useful(cell):
...     if not cell.value or cell.value=='junk': return True
>>> r = TestReader(
...         ('Sheet1',(('R1C0','','junk'),)),
...     )
>>> c = Mock()
>>> process(r, ColumnTrimmer(not_useful), c)
>>> pprint(c.method_calls)
[call.start(),
 call.workbook(<xlutils.tests.fixtures.DummyBook...>, 'test.xls'),
 call.sheet(<xlrd.sheet.Sheet...>, 'Sheet1'),
 call.row(0, 0),
 call.cell(0, 0, 0, 0),
 call.finish()]

ErrorFilter

This filter caches all method calls in a file on disk and will only pass them on to filters further down in the chain when its finish method has been called and no error messages have been logged to the python logging framework.

Note

To be effective, this filter must be the last in the chain before the writer!

Here’s an example of how to set one up. We need to be able to see what messages are logged, so we use a LogCapture:

>>> h = LogCapture()

Now, we install the filter:

>>> from xlutils.filter import process,ErrorFilter
>>> f = ErrorFilter()

To show the filter in action, we need a little helper:

>>> import logging
>>> from xlutils.filter import BaseFilter
>>> class Log(BaseFilter):
...    def __init__(self,level):
...        self.level = level
...    def workbook(self,rdbook,wtbook_name):
...        if wtbook_name=='test.xls':
...            logging.getLogger('theLogger').log(self.level,'a message')
...        self.next.workbook(rdbook,wtbook_name)

So, when we have errors logged, no methods other than finish are passed on to the next filter:

>>> c = Mock()
>>> process(MyReader('test.xls'), Log(logging.ERROR), f, c)
>>> len(c.method_calls)
0

As well as the error message logged, we can also see the ErrorFilter logs an error to that that the method calls have not been passed on:

>>> print(h)
theLogger ERROR
  a message
xlutils.filter ERROR
  No output as errors have occurred.
>>> h.clear()

This error message can be controlled when the ErrorFilter is instantiated:

>>> f = ErrorFilter(message='wingnuts! errors have occurred!')
>>> process(MyReader('test.xls'), Log(logging.ERROR), f, c)
>>> print(h)
theLogger ERROR
  a message
xlutils.filter ERROR
  wingnuts! errors have occurred!

However, when no errors are logged, all method calls are passed:

>>> c = Mock()
>>> process(MyReader('test.xls'), Log(logging.WARNING), f, c)
>>> len(c.method_calls)
17

In addition to the logging of error messages, error cells will also cause all methods to be filtered:

>>> from xlrd import XL_CELL_ERROR
>>> r = TestReader(
...         (u'Price(\xa3)',(((XL_CELL_ERROR,0),),)),
...         )
>>> c = Mock()
>>> h.clear()
>>> process(r,ErrorFilter(),c)
>>> len(c.method_calls)
0
>>> print(h)
xlutils.filter ERROR
  Cell A1 of sheet b'Price(?)' contains a bad value: error (#NULL!)
xlutils.filter ERROR
  No output as errors have occurred.

You can also configure the log level at which messages prevent the ErrorFilter from passing its method calls on to the next filter in the chain:

>>> f = ErrorFilter(logging.WARNING)

Now, warnings will cause methods to not be passed on:

>>> c = Mock()
>>> process(MyReader('test.xls'),Log(logging.WARNING),f,c)
>>> len(c.method_calls)
0

But if only debug messages are logged, the method calls will still be passed on:

>>> c = Mock()
>>> process(MyReader('test.xls'),Log(logging.DEBUG),f,c)
>>> len(c.method_calls)
17

An example which may prove useful is how to set up a filter such that if any errors are logged while processing one workbook, that workbook is filtered out but other subsequent workbooks are not filtered out.

This is done by inserting a filter such as the following earlier in the chain:

>>> class BatchByWorkbook(BaseFilter):
...    started = False
...    def start(self): pass
...    def workbook(self,rdbook,wtbook_name):
...       if self.started:
...           self.next.finish()
...       self.next.start()
...       self.next.workbook(rdbook,wtbook_name)
...       self.started = True

Here it is at work, starting with an empty output directory:

>>> temp_dir = TempDirectory().path
>>> os.listdir(temp_dir)
[]

Now test.xls, testall.xls and testnoformatting.xls are processed, but errors are only logged while processing test.xls:

>>> process(
...     GlobReader(os.path.join(test_files,'test*.xls')),
...     BatchByWorkbook(),
...     Log(logging.ERROR),
...     ErrorFilter(),
...     DirectoryWriter(temp_dir)
... )

So, the output directory contains testall.xls, but no test.xls:

>>> sorted(os.listdir(temp_dir))
['testall.xls', 'testnoformatting.xls']

Echo

This filter will print calls to the methods configured when the filter is created along with the arguments passed.

>>> from xlutils.filter import Echo, process
>>> r = TestReader(
...         ('Sheet1',(('R0C0','R0C1'),
...                    ('R1C0','R1C1'))),
...     )
>>> process(r, Echo(methods=('workbook',)), Mock())
workbook:(<...DummyBook...>, 'test.xls')

If True is passed instead of a list of method names, then all methods called will be printed:

>>> process(r, Echo(methods=True), Mock())
start:()
workbook:(<...DummyBook...>, 'test.xls')
sheet:(<xlrd.sheet.Sheet...>, 'Sheet1')
row:(0, 0)
cell:(0, 0, 0, 0)
cell:(0, 1, 0, 1)
row:(1, 1)
cell:(1, 0, 1, 0)
cell:(1, 1, 1, 1)
finish:()

If you need to see what’s happening at various points in a chain, you can also give an Echo a name:

>>> process(r,Echo('first'), Echo('second'), Mock())
'first' start:()
'second' start:()
'first' workbook:(<...DummyBook...>, 'test.xls')
'second' workbook:(<...DummyBook...>, 'test.xls')
'first' sheet:(<xlrd.sheet.Sheet...>, 'Sheet1')
'second' sheet:(<xlrd.sheet.Sheet...>, 'Sheet1')
'first' row:(0, 0)
'second' row:(0, 0)
'first' cell:(0, 0, 0, 0)
'second' cell:(0, 0, 0, 0)
'first' cell:(0, 1, 0, 1)
'second' cell:(0, 1, 0, 1)
'first' row:(1, 1)
'second' row:(1, 1)
'first' cell:(1, 0, 1, 0)
'second' cell:(1, 0, 1, 0)
'first' cell:(1, 1, 1, 1)
'second' cell:(1, 1, 1, 1)
'first' finish:()
'second' finish:()

MemoryLogger

This filter will dump stats to the path it was configured with using the heapy package if it is available. If it is not available, no operations are performed.

For example, with a MemoryLogger configured as follows:

>>> from xlutils.filter import MemoryLogger
>>> m = MemoryLogger('/some/path', methods=('sheet','cell'))

The equivalent of the following call:

from guppy import hpy; hpy().heap().stat.dump('/some/path')

will be performed whenever the MemoryLogger’s sheet() and cell() methods are called.

A MemoryLogger configured as followed will log memory usage whenever any of the MemoryLogger’s methods are called:

>>> m = MemoryLogger('/some/path',True)

For more information on heapy, please see:

http://guppy-pe.sourceforge.net/#Heapy

Example Usage

Here’s an example that makes use of all the types of components described above to filter out odd numbered rows from an original workbook’s sheets. To do this we need one more filter:

>>> from xlutils.filter import BaseFilter
>>> class EvenFilter(BaseFilter):
...
...     def row(self,rdrowx,wtrowx):
...         if not rdrowx%2:
...             self.next.row(rdrowx,wtrowx)
...
...     def cell(self,rdrowx,rdcolx,wtrowx,wtcolx):
...         if not rdrowx%2:
...             self.next.cell(rdrowx,rdcolx,wtrowx,wtcolx)

Now we can put it all together with a call to the process() function:

>>> from xlutils.filter import process
>>> process(
...     MyReader('test.xls'),
...     MyFilter('before'),
...     EvenFilter(),
...     MyFilter('after'),
...     MyWriter()
...     )
before start
after start
before workbook <xlrd...Book object at ...> test.xls
after workbook <xlrd...Book object at ...> test.xls
before sheet <xlrd.sheet.Sheet object at ...> Sheet1
after sheet <xlrd.sheet.Sheet object at ...> Sheet1
before row 0 0
after row 0 0
before cell 0 0 0 0
after cell 0 0 0 0
before cell 0 1 0 1
after cell 0 1 0 1
before row 1 1
before cell 1 0 1 0
before cell 1 1 1 1
before sheet <xlrd.sheet.Sheet object at ...> Sheet2
after sheet <xlrd.sheet.Sheet object at ...> Sheet2
before row 0 0
after row 0 0
before cell 0 0 0 0
after cell 0 0 0 0
before cell 0 1 0 1
after cell 0 1 0 1
before row 1 1
before cell 1 0 1 0
before cell 1 1 1 1
before finish
after finish

As you can see if you’ve read this far, there’s quite a lot of output, but it’s certainly informative! However, just to be on the safe side, we can see that the output file was actually written:

>>> os.listdir(our_temp_dir)
['test.xls']