The ultimate guide for time series

Henrique Peixoto Machado
5 min readAug 10, 2020

It’s normal at the beginning of the Data Science journey you start studying some real basic stuff like housing price trends, and then you study a little bit of time series and after that you go straight into computer vision and some other really hard stuff.

So far as a Data Analyst I have never faced a problem in the workplace that I used computer vision to solve it, but every week there is always a Time Series problem.

So let’s make a deep dive here into time series:

The life of all data beginners

First of all, this was always a question in my mind, how much data does it need to be worth to make a time series analysis?

So sometimes in corporations people want to make a ML project but without having enough data, in time series this is a real issue, because it is not always easy to discover new data to things that happened some time ago. So with this in mind is recommend that to start a time series analysis you need at least 50 points of interest, but it’s better to have at least 100 points. Somes specialists also recommend that if you are going to analyse something that happens every month you need at least 2 years of data.

That is because for your model to learn, he needs to see all the periods at least more than once to start learning about the seasonality that your data may have.

So if you don’t have this much data is highly recommended that you just make a simple linear regression.

Another thing you need to pay attention is when to split the data into train and test, instead of doing randomly you need to make a cut on the timeline, and this cut should always respect the seasonality of your data, to make it more clear let’s look in the example below:

Let’s say you need to analyse some data that happens every 3 months (Sansonality), you would split like this:

So this way you have the seasons as a whole in both train and test and the cut is not random.

Another tip that helps your predictions get more accurately is, after you train the model and make all adjustments, you should use all your data as train before making the predictions, so on the exemple above would be like this:

That is because different that other ML models, in time series the data right before the prediction may cause impact over the result right after it, think like this if every black friday sales increase around 10%, so for you to make a good prediction on how much you gonna sell it would be good to know how much you sold in october.

With these simple steps I believe that is possible to put into action a good time series model, so now I’ll be doing a kaggle competition using a LSTM so you can see this things in action:

If you want to just check the code go straight to the end. :)

First we will start importing thet packages that we will use:

And then we will prepare the data for to run our model:

Please notice that when separating the data in train and test I used the option shuffle=False, that way your data won’t be randomized. And then we normalized the data because all machine learning models tend to work better with normalized data.

Now let’s create the model:

So for the model I used Tensorflow (❤), the model created is 2 LSTM with 64 neurons and 3 Neural nets with 32, 16 and 1 neurons, about this model there is some details I would like to point out:

First, everytime you use more than one LSTM layer, you need to keep return_sequences=True or the LSTM won ‘t return the values for the next one.

The last dense layer will have only 1 neuron because this is the layer that will predict the outcome.

And for the model we use Adam as an optimizer, Huber as a loss function and Mean square error as an evaluation metric.

Now let’s check the performance:

Since I’m a very visual person, I like to plot everything in graphs, as you can see our predicted values (Blue)were pretty close, we even were able to predict that outlier value.

How was the learning rate of the model:

As you can see on the graph above, after 25 epochs our model just stops learning, so would be recommended to set the number of epochs to somewhere around 25 instead of 200 and rerun the model. This is the proof that sometimes bigger is not always better! 👊

What are the next steps?

What I did here was just to show the concepts above, but to improve the results it would be needed to do feature engineering and also adjust the number of epochs.

Just to point out, as everything in machine learning feature engineering is the most important step of all.

This code is on kaggle and on github, here are the links:

https://github.com/henriquepeixoto/Data-science-cool-stuff/blob/master/Kaggle%20stuff/httpswww.kaggle.comcweb-traffic-time-series-forecasting%20-%20Kaggle/Tensorflow%20-%20LSTM.ipynb

https://www.kaggle.com/henriquepeixoto/tensorflow-lstm/edit

And if you wanna read more about data stuff stay tuned to my medium page https://medium.com/@h.peixoto.m and my linkedin is always open too: https://www.linkedin.com/in/peixoto-henrique/

--

--