Entropy Reduction

"the essential thing in metabolism is that the organism succeeds in freeing itself from all the entropy it cannot help producing while alive..."
Erwin Schrödinger, 1944

Entropy Reduction is the mechanism whereby life consumes energy to create order out of its surroundings. The design philosophy of Entropy Reduction Algorithmics follows a similar ethos. We endevor to develop mathematical solutions that reflect that, in orderly, well-designed and energy efficient (read: fast, parallelized) code. Please visit the blog for our thoughts on performant scientific computing, new developments in science, and whatever else strikes our interest. ERA is the sole proprietorship of Robert A. McLeod, registered in the province of British Columbia, Canada.



NumExpr 3 Virtual Machine

Python is a popular language for scientific computing and data analysis, among other applications, but suffers in performance. NumPy is the de-facto standard for vector-based computation in Python. However NumPy is With NumExpr a series of instructions, or a program, can be executed in parallel on a C-based virtual machine which supports a subset of the Python langauge. NumExpr also takes advantage of SIMD vectorized instructions on modern CPUs to further improve speed. Calculations can be performed on NumPy arrays with a thread pool, with broadcasting. The virtual machine, written largely in C, uses L1 cache-sized blocks to store intermediate temporary results. The use of small temporary arrays reduces memory consumption and relieves memory-bandwidth limitations. As the principal developer for the major 3.0 refactor, I added a number of major new features. A summary may be found at Introduction to NumExpr-3 (Alpha).

NumExpr @ Github NumExpr @ ReadTheDocs

Work on NumExpr 3.0 was supported by NumFOCUS: numFocus

MRCZ Fast Compression

MRCZ is an extension of the venerable MRC image file format, which is widely used as a portable and open-source data container in microscopy. MRC does not feature any support for data compression, which has become a challenge as microscopy is one of the biggest data generators on most university campuses. Techniques such as light-sheet microscopy, and cryo-electron microscopy, can generate Terrabytes of data every data. With spinning storage typically costing in the range of US$10/TB/month, the cost can easily overwhelm budgets. Furthermore, the shear amount of file and network IO can slow data acquisition.

To mitigate these problems we developed an extension using the blosc meta-compression library. blosc supports a range of compression codecs and lossless filtering. Using a combination of zStandard compression and bit-shuffling filter, we found we could obtain compression ratios of 10:1. Furthermore, blosc is threaded and blocked, so it was able to compress and write a 4.5 GB data file to disk in 0.8 s. In comparison, writing the uncompressed data to disk takes ~15 s. We also added asynchronous read and write operations to the Python version of the library, so that files can be saved in a background thread with the GIL released for GIL and file operations. As automated data collection on a microscope can take an image every 60 s, the savings in IO time significantly improves the image stack collection rate.

MRCZ @ Github
MRCZ manuscript

Zorro Image Stacking

In cryo-electron microscopy proteins are embedded in a vitreous ice thin film so they can be imaged in the transmission electron microscope. Proteins are extremely radiation sensitive, such that only ~20 electrons are recorded per detector pixel. Direct electron detectors are CMOS-based systems that can read out at 400 Hz, counting individual electrons as they arrive.

However, the proteins drift substantially over the twenty second exposure. Therefore one fractions the exposure into a movie, and Zorro aims to register the drift in frames with an average of one electron per pixel. To do this it uses a combination of multiple-reference cross-correlation, with weighted Basin hopping minimization of the resulting matrix of drift guesses. It also dynamically filters images to maximize the cross-correlation of each image pair in the stack.

The package also includes a PySide GUI for controlling pipeline processing during automated data collection. When paired with automated data collection, it can enable a workstation to align Terrabytes of microscopy data every 24 hours.

Zorro @ Github

A description of the algorithms used in Zorro are published in the Journal of Structural Biology, a author's preprint can be downloaded here:
Zorro pre-print


Robert A. McLeod earned a Ph.D. in condensed matter physics from the University of Alberta and has worked professionally as an electron microscopist for the CEA (Commissariat à l'énergie atomique et aux énergies alternatives, Grenoble, FR) where he work with scientists working on semiconductors and energy systems, and at Universitat Basel (Basel, CH) where, in collaboration with ETH Zurich, he worked on structure determination of proteins by cryogenic electron microscopy. Robert has a strong interest in algorithm design, optimization problems, and high-performance computing. In his spare time Robert enjoys whitewater kayaking and backcountry skiing. He also volunteers his time to teach whitewater kayaking. He is not so adept at skiing.