Updated: reshape each row data into a (x, 1) array

Recently I was reading neural network and deep learning by Michael Nielsen (link) and wanted to test the neural network on loan default data. However after quite a few tries I still did not manage to transform my csv format data into the required matrix format by the script.

The csv file contains 769 variables and 1 boolean default entry. looks like this:

. v1 v2 v3 ... v770; 1. 1 2 3 ... 0; 2. 2 1 2 ... 1; ...

This is how I do my import:

import numpy as np tr_input = [np.reshape(genfromtxt('training.csv', delimiter=','), (769,10000))] tr_res = np.reshape(genfromtxt('training2.csv', delimiter=','),(1, 10000)) tr_test = [np.reshape(genfromtxt('testing.csv', delimiter=','), (769,2000))] tr_test2 = np.reshape(genfromtxt('testing2.csv', delimiter=','), (1, 2000)) test_data = list(zip(tr_test, tr_test2)) training_data = list(zip(tr_input, tr_res))

However it returns

Traceback (most recent call last): File "<ipython-input-8-de046f78e8ed>", line 3, in <module> net.SGD(training_data, 30, 300, 3.0, test_data = test_data) File "/Users/Neal/Documents/Sources/network.py", line 71, in SGD self.update_mini_batch(mini_batch, eta) File "/Users/Neal/Documents/Sources/network.py", line 85, in update_mini_batch delta_nabla_b, delta_nabla_w = self.backprop(x, y) File "/Users/Neal/Documents/Sources/network.py", line 105, in backprop z = np.dot(w, activation)+b ValueError: shapes (30,769) and (10000,) not aligned: 769 (dim 1) != 10000 (dim 0)

Updated:

Don't know why genfromtxt gives the first entry NAN but pandas works just fine.

After studying what the original tutorial data, I think I might need to reshape each row data into a (769, 1) array but I don't know how to.

Attached are the links to download the neural network and my data:

Neural network: https://github.com/MichalDanielDobrzanski/DeepLearningPython35

Data: https://drive.google.com/drive/folders/1bQEqgb1o9kKNyv8_IBPlNRci5cfSYwFL?usp=sharing
(testing and training are variables and testing2 and training2 are default information booleans, 0 for no default and 1 for default).

Where is the genfromtxt() functioning coming from? And what is it returning?
– AChampion
Jun 29 at 14:05

genfromtxt()

Hi AChampion! Sorry for being ambiguous, genfromtxt() is from numpy to import csv to arrays!
– N_R
Jun 29 at 14:08

@MaxU Hi MaxU thanks a million for letting me know this!!! Now it's public and no sign in required. =)
– N_R
Jun 29 at 14:37

You not supose to update the question with another problem, u need to create new question because reading data is not related to your matrix multiplications.
– 0709_
Jul 1 at 14:12

For the task you need the data format (30,10000) and (10000,) in order to can do calculations ... but you passed (30,769) so the dot product won’t work (this is simple matrix to vector or matrix to matrix multiplication in linear algebra). Create new question and post the code so we can see what you try to acomplish.
– 0709_
2 days ago

2 Answers
2

You can use either Numpy or Pandas for reading such CSV files - I prefer Pandas:

import pandas as pd X_train = pd.read_csv(r'D:downloadtraining.csv', header=None, dtype='float64')

result:

In [18]: X_train.shape Out[18]: (10000, 769)

if you want to transpose it, so it has a shape: 769 x 10000

In [19]: X_train = X_train.T In [20]: X_train.shape Out[20]: (769, 10000)

Thanks for taking your time going through my problem! It is actually better to use Pandas - dk why Numpy gives me the first entry NA so that I had to manually edit it to 0. But I think the form is not exact what np.dot() wants and it says 'shapes (1,769) and (30,769) not aligned: 769 (dim 1) != 30 (dim 0)'. It would be much appreciated if you could help me with this issue and thanks so much already!
– N_R
Jun 29 at 17:28

I checked the original data provided by the tutorial and it should be something like tuples in a list and inside the tuple each number should be in its own list. Do you think there is somewhere I should be looking at? Thanks!
– N_R
Jun 29 at 17:31

In your code you have some errors:

One spotted in the comments is genfromtxt , this function already does read in the numpy array object so based on your import statement you didn't actually referring to the module, here is a cleaner way to read the data and reshape. Here is the numpy documentation for the reading txt files and reshape : numpy.genfromtxt and numpy.reshape

Here is the more "readable" way of reading the data and reshaping (I hate to get lost into so many parentheses and is not so easy to read and need to be carefull about the orders), so as you see this is the proper way of reading data as genfromtxt you need ...

So will "translate" the steps:

for the tr_input variable: read the raw data, based on the file (10000, 769), and perform a reshape so the data to look (769,10000)

for the tr_res variable: read the raw data, based on the file (10000,), and perform a reshape so the data to look (1, 10000)

for the tr_test variable: read the raw data, based on the file (2000, 769), and perform a reshape so the data to look (769, 2000)

for the tr_test2 variable: read the raw data, based on the file (2000,), and perform a reshape so the data to look (1, 2000)

Note: As can be see the errors are solved, one was using the packages
module and your error pointed int he question is the shape of the data
... you can't reshape an array of (2000,) into (769,10000) ... your testing.csv is of the format (2000,)

# readable format prefered by me, not a standard. # https://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html # https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html import numpy as np tr_input = np.genfromtxt('/Users/sb0709/Desktop/dqn-pytorch/data_loan/training.csv', delimiter=',').reshape(769,10000) tr_res = np.genfromtxt('/Users/sb0709/Desktop/dqn-pytorch/data_loan/training2.csv', delimiter=',').reshape(1, 10000) tr_test = np.genfromtxt('/Users/sb0709/Desktop/dqn-pytorch/data_loan/testing.csv', delimiter=',').reshape(769,2000) tr_test2 = np.genfromtxt('/Users/sb0709/Desktop/dqn-pytorch/data_loan/testing2.csv', delimiter=',').reshape(1, 2000) # creating the lists test_data = list(zip(tr_test, tr_test2)) training_data = list(zip(tr_input, tr_res))

Also here listing the output:

tr_test data format and output

test_data and training_data print what is inside

Note: if the data has header/variable names, than pass skip_header=1 to skip first row when reading it and the data shape also will chage to -1.

Thank you @0709_ for the explicit answer! I think it returns the same error as 'ValueError: shapes (30,769) and (10000,) not aligned: 769 (dim 1) != 10000 (dim 0)'. My guess is that numpy needs to specify (x,1) from (x,) but I don't know how to array each row of the table to a new array with dimension of (x,1). Also noted that using genfromtxt the first entry would be NAN but it doesn't occur to pandas.
– N_R
Jul 1 at 9:49

if you have the header than numpy will provide NAN (I don't see the case here because once reading the data is all clear). For the other problem you have you need to open new question and provide the code (you can't just bring new data in that tutorial code without performing modification, simply won't work because you are doing from scratch the NN and seems you miss the dot product part, will suggest to check the matrix/vector multiplication because is some rules and you try to do on different size matrixes docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html )
– 0709_
Jul 1 at 17:45

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Search between a Gas Station