Preprocessing for PPM: Compressing UTF-8 Encoded Natural Language Text | Teahan | AIRCC's International Journal of Computer Science and Information Technology