Hierarchical and Dynamic Prompt Compression for Efficient Zero-shot API Usage

🚀 Technical Novelty

  • Mechanism: Introduces “HD-Gist tokens” (Gistarg for arguments, Gistvalue for categorical values) paired with dynamic attention masking that selectively unmask value-level tokens only when needed, alongside a reconstruction loss to preserve compressed information.
  • Nuance: Differs from prior static gist-token compression by using hierarchical granularity and dynamic zoom-in/out capabilities, avoiding the severe accuracy drop of fixed-compression methods while maintaining KV-cache efficiency without external encoders.

💡 Yield

  • Achieves 56.68% joint-goal accuracy on SGD (vs ~41% for static gist) and outperforms uncompressed LLaMA baseline by 4.5% on paraphrased SGD-X test set, demonstrating compression acts as a regularization bottleneck that improves out-of-domain generalization.
  • Reduces inference memory by ~32.5%, compute by ~30%, and cuts average attended documentation tokens from ~109 to ~5 while maintaining comparable CUDA decoding time to static gist caching.

⚠️ Limitations

  • Speedup is inherently limited because Transformer FLOPs are dominated by feed-forward layers during decoding, not just attention computation with cached keys/values.
  • Requires structured hierarchical formatting and ground-truth labels during training to supervise the reconstruction objective and dynamic masking scheme.
  • Generalization to highly unstructured free-text compression is theoretically proposed but not empirically validated in this work.