Learn time embeddings?!

  • Main idea: provide a model-agnostic representation of time, specifically a learned vector representation of time
  • Related work:
    • model time directly with bins, point processes and flows
    • time decomposition techniques that encode a temporal signal into a set of frequencies: like Fourier transforms (but allow frequencies to be learned)
    • concatenate time with input
    • new neural architectures that take into account time (TimeLSTM)
  • LSTM + T: time is another feature that is concatenated with the input and use standard LSTM
  • TimeLSTM: add time gates to the architecture of the LSTM with peepholes
  • Desirable properties:
    1. Periodicity
    2. Invariance to rescaling
    3. Simplicity
  • time2vec formulation \(\mathbf{t} \mathbf{2} \mathbf{v}(\tau)[i]=\left\{\begin{array}{ll}{\omega_{i} \tau+\varphi_{i},} & {\text { if } i=0} \\ {\mathcal{F}\left(\omega_{i} \tau+\varphi_{i}\right),} & {\text { if } 1 \leq i \leq k}\end{array}\right.\)
    • $\mathcal F$ is a periodic activation function, $\mathcal w, \varphi$ are the frequency and phase-shift respectively (which can be learned)
    • linear term used for capturing non-periodic patterns that depend on time.
  • different datasets used:
    • synthetic dataset with integers to represent the day of the year, positive class if divisible by 7 (classification)
    • event mnist (classification): input pixel intensities as time series
    • tidigits (classification): input time and channel indices (out of 64) that are activated
    • stack overflow: sequence of badges, predict next badge (recommendation)
    • last.fm: listening history (recommendation)
    • citeulike: what and when posts (recommendation)
  • metrics: accuracy and recall@q
    • recall@q: generate a recommendation list of k items (k-1 random and correct), and model ranks. recall@q is % of lists that have the correct item in top q
  • Main questions:
    • is it effective?
      • on all datasets, t2v results in no harm and in most, there is improvement
      • can be used with long sequences and time horizons
    • used with other architectures?
      • used for TimeLSTM, yes
    • what do the sine functions learn?
      • learns correct period and phase-shifts
    • can non-periodic activation functions be learned?
      • non-periodic cannot capture periodic behaviors, also as time becomes large saturation occurs and for ReLU exploding/vanishing gradients
    • should the frequencies even be learned?
      • not always clear that is better, but for asynchronous time series might be better to learn them
      • parameter overhead is not large, and it doesn’t hurt for sure (attention is all you need)
  • Conclusion: a learned vector representation of time that can capture periodic and non-periodic patterns, is invariant to scale and is portable across models