0

1

I am working on a dataset comprised of almost 17000 data points. Since it's a financial dataset and the components are many different companies, I need necessarily to split it by date. Therefore, supposing I have 10 years of data, I am training over the first 8 years and testing over the remaining 2. This approach I am pretty sure is consistent with the classification problem I need to do.

I am using LSTM network for predicting the direction of financial returns, depending on a bunch of features which are derived from companies' financial statements. Starting from the fact I am obtaining training accuracy greater than test accuracy with almost any architectures and hyperparameter configuration, I suppose there is something wrong in the way I have manipulated the dataset.

Here comes my concerns. I labelled my dataset looking at the median returns and putting 1 if the return for a single data point (company value at a specific date) is above such median, 0 otherwise. Am I correct if I compute two different medians? So that I labelled the training set using its median return and in the same way the test set using its own median return? Should I compute the median over the entire dataset, label it and then splitting?

Moreover, I scaled the training data to be in a range of (0,1). Should I do the same kind of normalization with my test set? I did it, but I wasn't sure about it.

It's kind of my first application of neural networks and I need those clarification about hwo treating the dataset, without influencing the results.

If I understand correctly, I should compute the median only on the train set and then label both train and test sets according to such median. This is reasonable to me. The scaling I did is simply bounded each features of the train set in a range (0,1) using

`MinMaxScaler`

from scikit learn. I suppose therefore that I have to do the same kind of scaling on the test set. – Alexbrini – 2019-01-15T11:16:54.737Yes. For the scaler, you use

`fit`

method with training data, it learns the bounds, and use`transform`

method to transform your both training and test data. – gunes – 2019-01-15T11:26:40.2901I have corrected my code to dot he following:

`scaler = MinMaxScaler() minmax_scale = scaler.fit(df_train) x_train = minmax_scale.transform(df_train) x_test = minmax_scale.transform(df_test)`

. It doesn't change too much in my out of sample accuracy, but I think it's also due to the nature of my data. Anyway, now I'm working correctly on the test set. – Alexbrini – 2019-01-15T13:41:37.177