how to read batches in one hdf5 data file for training?

Multi tool use
Multi tool use


how to read batches in one hdf5 data file for training?



I have a hdf5 training dataset with size (21760, 1, 33, 33). 21760 is the whole number of training samples. I want to use the mini-batch training data with the size 128 to train the network.


(21760, 1, 33, 33)


21760


128



I want to ask:



How to feed 128 mini-batch training data from the whole dataset with tensorflow each time?


128




3 Answers
3



You can read the hdf5 dataset into a numpy array, and feed slices of the numpy array to the TensorFlow model. Pseudo code like the following would work :


import numpy, h5py
f = h5py.File('somefile.h5','r')
data = f.get('path/to/my/dataset')
data_as_array = numpy.array(data)
for i in range(0, 21760, 128):
sess.run(train_op, feed_dict={input:data_as_array[i:i+128, :, :, :]})





Thank you. But when the number of training iterations i is large, e.g. 100000, how to feed it ?
– karl_TUM
Jul 6 '16 at 14:45


i





If you only have 21760 training samples, you only have 21760/128 distinct mini-batches. You have to write an outer loop around the i loop and run many epochs over the training dataset.
– keveman
Jul 6 '16 at 14:47



21760


21760/128


i





I have one point confusing. When the original data is shuffled and then extract mini-batches, does it mean that the number of mini-batch is more than 21760/128 ?
– karl_TUM
Jul 6 '16 at 16:42


21760/128



If your data set is so large that it can't be imported into memory like keveman suggested, you can use the h5py object directly:


import h5py
import tensorflow as tf

data = h5py.File('myfile.h5py', 'r')
data_size = data['data_set'].shape[0]
batch_size = 128
sess = tf.Session()
train_op = # tf.something_useful()
input = # tf.placeholder or something
for i in range(0, data_size, batch_size):
current_data = data['data_set'][position:position+batch_size]
sess.run(train_op, feed_dict={input: current_data})



You can also run through a huge number of iterations and randomly select a batch if you want to:


import random
for i in range(iterations):
pos = random.randint(0, int(data_size/batch_size)-1) * batch_size
current_data = data['data_set'][pos:pos+batch_size]
sess.run(train_op, feed_dict={inputs=current_data})



Or sequentially:


for i in range(iterations):
pos = (i % int(data_size / batch_size)) * batch_size
current_data = data['data_set'][pos:pos+batch_size]
sess.run(train_op, feed_dict={inputs=current_data})



You probably want to write some more sophisticated code that goes through all data randomly, but keeps track of which batches have been used, so you don't use any batch more often than others. Once you've done a full run through the training set you enable all batches again and repeat.





This approach seems logically right but I have not gotten any positive results using it. My best guess is this: Using code sample 1 above, In every iteration, the network trains afresh, forgetting all that has been learned in the previous loop. So if we are fetching at 30 samples or batches per iteration, at every loop/iteration, only 30 data samples are being used, then at the next loop, everything is overwritten.
– rocksyne
Jun 30 at 14:18



alkamen's approach seems logically right but I have not gotten any positive results using it. My best guess is this: Using code sample 1 above, in every iteration, the network trains afresh, forgetting all that has been learned in the previous loop. So if we are fetching at 30 samples or batches per iteration, at every loop/iteration, only 30 data samples are being used, then at the next loop, everything is overwritten.



Find below a screenshot of this approach



Training always starting afresh



As can be seen, the loss and accuracy always start afresh. I will be happy if anyone could share a possible way around this, please.





You tagged in some other user, my name is spelled with an 'n', not an 'm' =)
– alkanen
Jun 30 at 16:59





Your accuracy isn't reset, it does improve with each iteration, it doesn't go back to zero. Are you sure that you get an entirely new batch every time to fetch a batch, and that they aren't highly overlapping? That would explain why your accuracy improves so much initially, because you basically re-use the same training data for each iteration. And then when you reset the data and get new batches you possibly randomise things again and get a new set of overlapping batches with data your net hasn't seen before.
– alkanen
Jun 30 at 17:02





Thanks for your comment. Yes, I fetch new batches everytime per my algorithm and yes the data is shuffled but that is what I end up with and ( I may be wrong) but I have a feeling that my previous answer is what is happening. I will keep looking around. If I do find anything, I will be glad to share. And..... I am sorry I didn't get your name right. Thanks for your time. Cheers!
– rocksyne
2 days ago





Okay. If it does reset and you're sure your batches don't overlap, it's probably not the data fetching that is wrong, but the model weight handling. I hope you find the problem, best of luck.
– alkanen
yesterday





Thanks a lot for your input. Very much appreciated.
– rocksyne
yesterday






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

ii28R4c
U3bnLAdxq3x Y1OyUAe0R9xEzke3TBzra,uz3CsN8JE,ElmhP,K3bo24,CBKXEi l,CN09uH,HsgLXrBJGkDF7lI

Popular posts from this blog

PySpark - SparkContext: Error initializing SparkContext File does not exist

django NoReverseMatch Exception

List of Kim Possible characters