A CRYPTOGRAPHIC APPROACH TO LANGUAGE IDENTIFICATION: PPM

Ebru Celikel

2005

Abstract

The problem of language discrimination may arise in situations when many texts belonging to different source languages are at hand but we are not sure to which language each belongs to. This might usually be the case during information retrieval via Internet. We propose a cryptographic solution to the language identification problem: Employing the Prediction by Partial Matching (PPM) model, we generate a language model and then use this model to discriminate languages. PPM is a cryptographic tool based on an adaptive statistical model. It yields compression rates (measured in bits per character –bpc) to far better levels than that of many other conventional lossless compression tools. Language identification experiment results obtained on sample texts from five different languages as English, French, Turkish, German and Spanish Corpora are given. The rate of success yielded that the performance of the system is highly dependent on the diversity, as well as the target text and training text file sizes. The results also indicate that the PPM model is highly sensitive to input language. In cryptographic aspect, if the training text itself is kept secret, our language identification system would provide security to promising degrees.

Download


Paper Citation


in Harvard Style

Celikel E. (2005). A CRYPTOGRAPHIC APPROACH TO LANGUAGE IDENTIFICATION: PPM . In Proceedings of the Seventh International Conference on Enterprise Information Systems - Volume 2: ICEIS, ISBN 972-8865-19-8, pages 213-219. DOI: 10.5220/0002556102130219

in Bibtex Style

@conference{iceis05,
author={Ebru Celikel},
title={A CRYPTOGRAPHIC APPROACH TO LANGUAGE IDENTIFICATION: PPM},
booktitle={Proceedings of the Seventh International Conference on Enterprise Information Systems - Volume 2: ICEIS,},
year={2005},
pages={213-219},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002556102130219},
isbn={972-8865-19-8},
}


in EndNote Style

TY - CONF
JO - Proceedings of the Seventh International Conference on Enterprise Information Systems - Volume 2: ICEIS,
TI - A CRYPTOGRAPHIC APPROACH TO LANGUAGE IDENTIFICATION: PPM
SN - 972-8865-19-8
AU - Celikel E.
PY - 2005
SP - 213
EP - 219
DO - 10.5220/0002556102130219