How to load a large dataset during Training in Tensorflow efficiently?

Aditya Mangal
5 min readMar 11, 2023
Gfycat

Deep learning has revolutionized the world of machine learning and Tensorflow has emerged as a popular platform for creating and training deep learning models. However, the size of the datasets used for training can pose a significant challenge. Limited resources make it difficult to load the entire dataset into memory during training, forcing us to use batch processing to achieve optimal results. In this blog post, we’ll delve into the techniques you can use to load larger datasets efficiently during training in Tensorflow. So, if you’re struggling with large dataset processing, read on to find out how you can optimize your training process and achieve your desired results.

I will discuss the below methods by which we can train the model with a large dataset with pros and cons.

1. Load data from a directory
2. Load data from numpy array
3. Load data from ImageDataGenerator
4. Load data from batch

First, hats off to Google Researchers who built Tensorflow.You can check out its official website to read more about Tensorflow and its functionalities.

Let’s move to the practical part and execute the training of the model on different methods.

giphy

Dataset

In this blog post, I will walk through the process of training a simple image classification model using a Convolution Neural Network. and monitoring the resources used by each method. I will use a dataset of 10,000 images (image size is 150*150) from each of the two categories.

Defining the Model Architecture

Next, I will define the architecture of the CNN model. For this simple image classification task, I will use a relatively small model with two convolutional layers and two fully connected layers. The first convolutional layer will have 32 filters, while the second convolutional layer will have 32 filters. Both convolutional layers will use ReLU activation functions and max pooling to reduce the spatial dimensions of the output feature maps. The fully connected layers will have 128 and 1 neuron, respectively, with a softmax activation function at the output.

Now that I have discussed the theory and methodology of training an image classification model with CNNs, let’s move on to the practical implementation by picking the methods mentioned above.

1. Load data from a Directory

I will load the dataset from the directory in which there are two folders for each category and will observe the time and space consumption during the loading of the dataset. The training model architecture will be the same for all methods.

Let’s take a look at the time and space consumed during data loading and training of the model.

Left for Data Loading and Right for Model Training

2. Load data from Numpy Array

This method takes a good amount of memory as we are loading the image and then convert into a numpy array. But training the model with this method takes less time compared to other methods.

Let’s take a look at the time and space consumed during data loading and training of the model.

Left for Data Loading and Right for Model Training

3. Load data from ImageDataGenerator

This method seriously takes a huge amount of memory as we are loading the image and then convert into a numpy array and then making the generator. But most of the time, we use it when we do image augmentation.

Let’s take a look at the time and space consumed during data loading and training of the model.

Left for Data Loading and Right for Model Training

4. Load data from batch

This method works well with my use case (1 million images and GAN model). As it injects the data into the model in batches so it does not load all data into GPU memory.

Let’s take a look at the time and space consumed during data loading and training of the model.

Left for Data Loading and Right for Model Training

Conclusion

In conclusion, we have explored various methods for loading data into our models, each with its own advantages and use cases. While loading data from a directory may be the most efficient in terms of time and space consumption. Ultimately, the choice of method will depend on the specific requirements and constraints of your project. We will explore data parallelism in the next blog with tf.distribute.Strategy API. I encourage you to try out each method and share your results in the comments below. Thank you for reading!

--

--

Aditya Mangal

My Personal Quote to overcome problems and remove dependencies - "It's not the car, it's the driver who win the race".