Let’s dive in. A standard dense layer assumes no temporal order. It doesn't know that the word following "I ate" is likely food-related, or that yesterday's stock price influences today's. RNNs solve this with a hidden state — a vector that gets passed from one time step to the next. The Simple RNN (Vanilla RNN) The simplest form has a loop. At each time step t , it takes the current input x_t and the previous hidden state h_t-1 , and produces a new hidden state h_t .
They can remember information for hundreds of steps, making them ideal for text generation, speech recognition, and complex time series. GRU (Gated Recurrent Unit) GRUs are a simpler, faster alternative to LSTMs. They merge the forget and input gates into a single "update gate" and combine the cell state with the hidden state. GRUs perform similarly to LSTMs on many tasks but with fewer parameters. Let’s dive in
h_t = T.tanh(T.dot(x_t, W_xh) + T.dot(h_prev, W_hh) + b_h) RNNs solve this with a hidden state —
import numpy as np from keras.models import Sequential from keras.layers import GRU, Dense def generate_sine_wave(seq_length, num_samples): X, y = [], [] for _ in range(num_samples): start = np.random.uniform(0, 4*np.pi) seq = np.sin(np.linspace(start, start + seq_length, seq_length + 1)) X.append(seq[:-1].reshape(-1, 1)) y.append(seq[-1]) return np.array(X), np.array(y) They can remember information for hundreds of steps,
In Python (with Theano-style tensors), a naive implementation looks like:
Vanilla RNNs suffer from the vanishing/exploding gradient problem — they can't learn long-range dependencies (e.g., information from 50 steps ago). This is where LSTM and GRU come in. LSTM (Long Short-Term Memory) LSTMs introduce a cell state (a conveyor belt of information) and three gates: forget, input, and output. These gates learn what to remember, what to write, and what to output.
h_t = tanh(W_x * x_t + W_h * h_t-1 + b)