Data Feeding Mechanism

General Approach

As briefly mentioned in Quick Start, this library adopts a particular data feeding mechanism that requires the user to give a function that returns a data generator and the number of steps, in other words, the number of mini-batches in an epoch.

The reference about python generator can be found here. Generally, a python generator is defined in a very similar way as a function, but with the yield statement. An example of a generator that takes 2 numpy arrays X, Y and produces mini-batch of size batch_size is given below:

import numpy as np

def example_generator(X, Y, batch_size, no_mb):
"""An example generator that takes 2 numpy arrays X, Y and batch size and number of mini-batches

"""

    # assume 1st dimension of X and Y is the number of samples
    N = X.shape[0]

    # this while statement allows this generator to generate data infinitely
    while True:
        for step in range(no_mb):
            start_idx = step*batch_size
            stop_idx = min(N, (step+1)*batch_size) # dont allow index out of range

            x = X[start_idx:stop_idx]
            y = Y[start_idx:stop_idx]

            """
            Potentially some processing steps here
            """

            yield x, y # produce pair (x,y)

Note that after generating data for N steps, the for loop finishes and the while loop continues to run new iteration. The sequence of N mini-batches is exactly the same for each iteration of the while loop (or each epoch). This behavior, however, should be avoided when using stochastic gradient descend methods. There should be randomness at each iteration in the way data is generated. Below is a slight modification of the example_generator that introduces randomness:

import numpy as np
import random

def example_generator(X, Y, batch_size, no_mb):
"""An example generator that takes 2 numpy arrays X, Y and batch size and number of mini-batches

"""
    # assume 1st dimension of X and Y is the number of samples
    N = X.shape[0]

    # this while statement allows this generator to generate data infinitely
    while True:

        # generate the list of indices and shuffle
        indices = range(N)
        random.shuffle(indices)

        for step in range(no_mb):
            start_idx = step*batch_size
            stop_idx = min(N, (step+1)*batch_size) # dont allow index out of range

            x = X[indices[start_idx:stop_idx]]
            y = Y[indices[start_idx:stop_idx]]

            """
            Potentially some processing steps here
            """

            yield x, y # produce pair (x,y)

Using this definition of data generator, the user needs also to define a function that returns data generators and the number of mini-batches. Let assume that the data is stored on disk in a pickled format. We can write a simple data_func as follows:

import pickle
import numpy as np

def data_func(filename):
"""An example of data_func that returns example_generator and number of batch
"""

    with open(filename, 'r') as fid:
        data = pickle.load(fid)

    # assume that X, Y is stored as elements in dictionary data
    X, Y = data['X'], data['Y']

    N = X.shape[0] # number of samples
    batch_size = 128 # size of mini-batch
    no_mb = int(np.ceil(N/float(batch_size))) # calculate number of mini-batches

    # get an instance of example_generator
    gen = example_generator(X, Y, batch_size, no_mb)

    # return generator and number of mini-batches
    return gen, no_mb

The above example of data_func takes the path to the data file, performs data loading, calculates the number of mini-batches and returns an instance of example_generator and number of mini-batches.

Since data_func and data_argument will be serialized and written to disk during computation, it is recommended to pass small parameters through data_argument such as filename. Although it is possible to pass the actual data as data_argument, doing so would incur overhead computation

Working Examples

Here we demonstrate how to write data_func and its argument data_argument using the available Mnist dataset in keras.dataset.mnist:

def data_func(data_argument):
    """ Data function of mnist for PyGOP models which should produce a generator and the number
    of steps per epoch

    Args:
        data_argument: a tuple of batch_size and split ('train' or 'test')

    Return:
        generator, steps_per_epoch

    """

    batch_size, split = data_argument

    # load dataset from keras datasets
    (x_train, y_train), (x_test, y_test) = mnist.load_data()

    if split == 'train':
        X = x_train
        Y = y_train
    else:
        X = x_test
        Y = y_test

    # reshape image to vector
    X = np.reshape(X, (-1, 28 * 28))
    # convert to one-hot vector of classes
    Y = to_categorical(Y, 10)
    N = X.shape[0]

    steps_per_epoch = int(np.ceil(N / float(batch_size)))

    def gen():
        while True:
            indices = list(range(N))
            # if train set, shuffle data in each epoch
            if split == 'train':
                random.shuffle(indices)

            for step in range(steps_per_epoch):
                start_idx = step * batch_size
                stop_idx = min(N, (step + 1) * batch_size)
                idx = indices[start_idx:stop_idx]
                yield X[idx], Y[idx]

    # it's important to return generator object, which is gen() with the bracket
    return gen(), steps_per_epoch

This code excerpt is taken from our Hand-written Digits Recognition with Mnist dataset . Here data_argument is the list of hyperparameters in order to generate mini-batch of data, including the batch_size and the set of data split. Computation-wise, this approach is very efficient to pass the data to PyGOP’s models.

Below, we also give another example of data_func that only takes the path to data files and the generator hyper-parameters. All of data loading, processing activities reside within data_func:

def load_miniCelebA(arguments):
    """
    Data loading function of miniCelebA to be used with PyGOP's algorithms

    Args:
        arguments (list): A list of arguments including:
                            - x_file (string): path to X (.npy file)
                            - y_file (string): path to Y (.npy file)
                            - batch_size (int): size of mini batch
                            - shuffle (bool): whether to shuffle minibatches

    Returns:
        gen (generator): python generator that generates mini batches of (x,y)
        steps (int): number of mini batches in the whole data

    """

    x_file, y_file, batch_size, shuffle = arguments
    X = np.load(x_file)
    Y = np.load(y_file)

    N = X.shape[0]
    steps = int(np.ceil(float(N) / batch_size))

    def gen():
        indices = list(range(N))
        while True:
            if shuffle:
                random.shuffle(indices)

            for step in range(steps):
                start_idx = step * batch_size
                stop_idx = min(N, (step + 1) * batch_size)
                batch_indices = indices[start_idx:stop_idx]

                yield X[batch_indices], Y[batch_indices]

    return gen(), steps

The complete example that uses load_miniCelebA can also be found in Face Recognition with CelebA dataset