Comparying various methods for saving and loading NumPy arrays

NumPy is a Python library that can efficiently manipulate large multi-dimensional arrays. This recommends it in all sort of scenarios where the core of an algorithm can be expressed as a sequence of operations on arrays and matrices (in particular in mathematics, sciences and engineering). One common problem for those who handle such large data is how to quickly save  and load the arrays to/from disk. This article looks at several ways of dealing with large bi-dimensional arrays.

The data and benchmarking

The data was taken from Kaggle’s Digit Recognizer competition and consists of a 73MB csv file. This file is read and stored in memory in a 42,000 by 785 bi-dimensional numpy array of integers.

The code that run the test can be found here.

The test was run on Amazon Web Services on a m1.medium instance (3.75GR RAM) on Ubuntu 12.04 Server 32-bit.

cProfile was used to measure the time taken to write the data and load it. cProfile is a profiler, a program that describes the run time performance of a program, providing a variety of statistics. It should be noted that the profiler adds a small overhead to the execution time of the piece of code being tested.

import cProfile
import pstats

def profile_code(cmd, no_iter=7, pfile=None):
    '''Profiles the code (cmd) no_iter times and returns the average.'''
    if pfile is None:
        pfile = 'profiles/profile_' + cmd[:cmd.find('(')]    # dirty hack ...
    total_time = 0
    results = []
    for i in range(no_iter):
        cProfile.run(cmd, pfile)
        p = pstats.Stats(pfile)
        results.append(p.total_tt)
        print p.total_tt
    # remove min, max outliers and compute average
    avg_time = (sum(results) - min(results) - max(results)) / (no_iter -2 )
    print pfile, avg_time
    return avg_time

Approaches

csv files

This is the “brute force” approach – no external libraries, just write the data to a file and then read that file later. Here’s the code for that:

import numpy

def read_csv(file_path, has_header=True, verbose=True):
    '''Reads a csv file with all fields numerical (int).'''
    if verbose: print "reading %s" % file_path
    with open(file_path, 'r') as f:
        if has_header: f.readline()
        data = []
        for line in f:
            line = line.strip().split(',')
            data.append(map(int, line))
        return numpy.array(data)

def write_csv(file_path, data, header=None, verbose=True):
    '''Writes the data to file.'''
    if verbose: print "writing %s" % file_path
    with open(file_path, 'w') as f:
        if header:
            f.write(str(header) + '\n')
        if isinstance(data[0], collections.Iterable):
            for line in data:
                f.write(','.join(map(str,line)) + '\n')
        else:
            f.write('\n'.join(map(str, data)))

pickle

pickle is the standard Python serialization module.  To serialize means to convert the object or data structure into a form that can be stored (in a file, in memory, transmitted over the network, etc.) and loaded back later in exactly the same format. In our case, we can serialize the numpy array, save it to file, then load it back directly as a numpy array.

import pickle

def save_pickle(file_path, obj):
    '''Saves an object using pickle.'''
    with open(file_path, 'wb') as f:
        pickle.dump(obj, f)

def load_pickle(file_path):
    '''Saves an object using pickle.'''
    with open(file_path, 'rb') as f:
        return pickle.load(f)

All the magic happens in pickle.dump(object, ‘path/to/file.pkl’) which serializes the object and writes it to file. The object can be loaded back later using pickle.load(‘path/to/file.pkl’).

joblib from scikit-learn

scikit-learn is a machine learning library for Python that has it’s own version of pickle. The syntax is similar with the one of pickle:

from sklearn.externals import joblib

def save_joblib(file_path, obj):
    '''Saves an object using joblib.'''
    joblib.dump(obj, file_path)

def load_joblib(file_path):
    '''Loads a model that was saved with joblib.'''
    return joblib.load(file_path)

NumPy’s save and load functions

NumPy itself has methods for saving binary arrays in .npy format.

import numpy

def save_numpy(file_path, obj):
    '''Saves an object using numpy.save.'''
    numpy.save(file_path, obj)

def load_numpy(file_path):
    '''Loads an object that was saved with numpy.load'''
    return numpy.load(file_path)

SciPy savemat and loadmat functions

SciPy is a library for mathematics, science, and engineering build on top of numpy. It has several functions available for reading data from and writing data to file. The code below uses the savemat and loadmat functions which save the data in a MATLAB format.

import scipy.io

def save_scipy_matlab(file_path, obj):
    '''Saves an object using numpy.save.'''
    scipy.io.savemat(file_path, {'x':obj})

def load_scipy_matlab(file_path):
    '''Loads an object that was saved with numpy.load'''
    return scipy.io.loadmat(file_path)['x']

Using HDF5

HDF5 was developed at the National Center for Supercomputing Applications as a data model, library, and file format for storing and managing data efficiently. h5py is a Python interface for HDF5.

import h5py

def save_hdf5(file_path, obj):
    '''Saves an object using HDF5.'''
    f = h5py.File(file_path, 'w')
    dataset = f.create_dataset('MyDataset', data=obj)
    f.close()

def load_hdf5(file_path):
    '''Saves an object using HDF5.'''
    f = h5py.File(file_path, 'r')
    x = numpy.array(f['MyDataset'].value)
    f.close()
    return x

Results

Method Reading time (sec) Writing time (sec) Space on disk (MB)
csv 27.60 * 420.34 74
pickle 11.01 66.50 500
joblib 0.60 4.58 126
numpy 0.57 4.43 126
scipy 0.64 5.75 126
hdf5 1.14 4.15 126

* using numpy.loadtxt(): 22.08 sec

Conclusions

The manual read/write method is the most time inefficient approach, but it is the most efficient in  terms of used space.  Pickle has a huge space overhead. NumPy, SciPy, joblib and hdf5 are the recommended options for quickly saving and loading data, all having a comparable time and space complexity.

One last point – depending on the system that the test is performed on, the results might be slightly different. However, the general conclusions still seem to hold.

2 Replies to “Comparying various methods for saving and loading NumPy arrays”

  1. Hi, nice comparison. But in case of pickle you should specify a more efficient protocol because the defulat one is there just for backward compatibility. So: pickle.dump(obj, f, protocol=pickle.HIGHEST_PROTOCOL). I expect the results to change a lot. Note that you do not need to specify the protocol parameter in pickle.load().

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.