Document representations for classification of short web-page descriptions

Authors

  • Miloš Radovanović Faculty of Science, Department of Mathematics and Informatics, Novi Sad
  • Mirjana Ivanović Faculty of Science, Department of Mathematics and Informatics, Novi Sad

DOI:

https://doi.org/10.2298/YJOR0801123R

Keywords:

text categorization, document representation, machine learning

Abstract

Motivated by applying Text Categorization to classification of Web search results, this paper describes an extensive experimental study of the impact of bag-of- words document representations on the performance of five major classifiers - Naïve Bayes, SVM, Voted Perceptron, kNN and C4.5. The texts, representing short Web-page descriptions sorted into a large hierarchy of topics, are taken from the dmoz Open Directory Web-page ontology, and classifiers are trained to automatically determine the topics which may be relevant to a previously unseen Web-page. Different transformations of input data: stemming, normalization, logtf and idf, together with dimensionality reduction, are found to have a statistically significant improving or degrading effect on classification performance measured by classical metrics - accuracy, precision, recall, F1 and F2. The emphasis of the study is not on determining the best document representation which corresponds to each classifier, but rather on describing the effects of every individual transformation on classification, together with their mutual relationships. .

References

Aggarwal, C.C., Hinneburg, A., and Keim, D.A., "On the surprising behavior of distance metrics in high dimensional spaces", Proc. ICDT’01, 8th International Conference on Database Theory, LNCS 1973, London, UK, Springer-Verlag, 2001, 420–434.

Aha, D., Kibler, D., and Albert, M.K., "Instance-based learning algorithms", Machine Learning, 6(1) (1991) 37–66.

Beyer, K.S., Goldstein, J., Ramakrishnan, R., and Shaft, U., "When is “nearest neighbor” meaningful?", Proc. ICDT’99, 7th International Conference on Database Theory, LNCS 1540, Jerusalem, Israel, Springer-Verlag, 1999, 217–235.

Chen, H., and Dumais, S.T., "Bringing order to the Web: Automatically categorizing search results", Proc. CHI’00, Human Factors in Computing Systems, 2000, 145–152.

Debole, F., and Sebastiani, F., "Supervised term weighting for automated text categorization", in: S. Sirmakessis (ed.), Text Mining and its Applications, Studies in Fuzziness and Soft Computing 138, Physica-Verlag, Heidelberg, Germany, 2004, 81–98.

Ferragina, P., and Gulli, A., "A personalized search engine based on Web-snippet hierarchical clustering", Proc. WWW’05, 14th International World Wide Web Conference, Chiba, Japan, 2005, 801–810.

Forman, G., "An extensive empirical study of feature selection metrics for text classification", Journal of Machine Learning Research, 3 (2003) 1289–1305.

Freund, Y., and Schapire, R.E., "Large margin classification using the perceptron algorithm", Machine Learning, 37(3) (1999) 277–296.

Gabrilovich, E., and Markovitch, S., "Text categorization with many redundant features: Using aggressive feature selection to make SVMs competitive with C4.5", Proc. ICML’04, 21st International Conference on Machine Learning, Banff, Canada, 2004, 41–48.

Grobelnik, M., and Mladenić, D., "Simple classification into large topic ontology of Web documents", Proc. ITI’05, 27th International Conference on Information Technology Interfaces, Cavtat, Croatia, 2005, 188–193.

Keerthi, S.S., Shevade, S.K., Bhattacharyya, C., and Murthy, K.R.K., "Improvements to Platt’s SMO algorithm for SVM classifier design", Neural Computation, 13(3) (2001) 637–649.

Kibriya, A.M., Frank, E., Pfahringer, B., and Holmes, G., "Multinomial naive Bayes for text categorization revisited", Proc. AI’04, 17th Australian Joint Conference on Artificial Intelligence, LNAI 3339, Cairns, Australia, Springer-Verlag, 2004, 488–499.

Lan, M., Tan, C.-L., Low, H.-B., and Sung, S.-Y., "A comprehensive comparative study on term weighting schemes for text categorization with support vector machines", Proc. WWW’05, 14th International World Wide Web Conference, Chiba, Japan, 2005, 1032–1033.

Leopold, E., and Kindermann, J., "Text categorization with support vector machines. How to represent texts in input space?", Machine Learning, 46(1–3) (2002) 423–444.

McCallum, A., and Nigam, K., "A comparison of event models for naive Bayes text classification", Proc. AAAI’98 Workshop on Learning for Text Categorization, 1998, 41–48.

Mladenić, D., "Machine Learning on non-homogenous, distributed text data", PhD thesis, University of Ljubljana, Slovenia, 1998.

Mladenić, D., "Text-learning and related intelligent agents", IEEE Intelligent Systems, Special Issue on Applications of Intelligent Information Retrieval, 14(4) (1999) 44–54.

Mladenić, D., and Grobelnik, M., "Mapping documents onto Web page ontology", in: B. Berendt et al. (eds.), Web Mining: From Web to Semantic Web, LNAI 3209, Springer-Verlag, 2004, 77–96.

Nadeau, C., and Bengio, Y., "Inference for the generalization error", Machine Learning, 52(3) (2003) 239–281.

Nunzio, D.M., "A bidimensional view of documents for text categorization", Proc. ECIR’04, 26th European Conference on IR Research, LNCS 2997, Sunderland, UK, Springer-Verlag, 2004, 112–126.

Peng, F., and Schuurmans, D., "Combining naive Bayes and n-gram language models for text classification", Proc. ECIR’03, 25th European Conference on IR Research, LNCS 2663, Pisa, Italy, Springer-Verlag, 2003, 335–350.

Platt, J., "Fast training of support vector machines using sequential minimal optimization.", in: B. Schoelkopf, C. Burges, A. Smola (eds.), Advances in Kernel Methods—Support Vector Learning, MIT Press, 1998, 185–208.

Porter, M.F., "An algorithm for suffix stripping", Program, 14(3) (1980) 130–137.

Quinlan, R., C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, San Mateo, CA, 1993.

Radovanović, M., and Ivanović, M., "CatS: A classification-powered meta-search engine", in: M. Last, P. S. Szczepaniak, Z. Volkovich, A. Kandel (eds.), Advances in Web Intelligence and Data Mining, Studies in Computational Intelligence 23, Springer-Verlag, 2006, 191–200.

Rennie, J.D.M., Shih, L., Teevan, J., and Karger, D.R., "Tackling the poor assumptions of naive Bayes text classifiers", Proc. ICML’03, 20th International Conference on Machine Learning, 2003.

Ribeiro, A., Fresno, V., Garcia-Alegre, M.C., and Guinea, D., "Web page classification: A soft computing approach", Proc. AWIC’03, 1st Atlantic Web Intelligence Conference, LNAI 2663, Madrid, Spain, Springer-Verlag, 2003, 103–112.

Salton, G. (ed.), The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, 1971.

Sebastiani, F., "Machine learning in automated text categorization", ACM Computing Surveys, 34(1) (2002) 1–47.

Sebastiani, F., "Text categorization", in: Alessandro Zanasi (ed.), Text Mining and its Applications, WIT Press, Southampton, UK, 2005, 109–129.

Soucy, P., and Mineau, G.W., "Beyond TFIDF weighting for text categorization in the vector space model", Proc. IJCAI’05, 19th International Joint Conference on Artificial Intelligence, Edinburgh, UK, 2005, 1130–1135.

Stricker, M., Vichot, F., Dreyfus, D., and Wolinski, F., Vers la conception automatique de filtres d’informations efficaces. Reconnaissance des Formes et Intelligence Artificielle RFIA2000, 2000, 129–137.

Witten, I.H., and Frank, E., Data Mining: Practical Machine Learning Tools and Techniques. Second Edition, Morgan Kaufmann Publishers, San Francisco, 2005.

Wu, X., Srihari, R., and Zheng, Z., "Document representation for one-class SVM", Proc. ECML’04, 15th European Conference on Machine Learning, LNAI 3201, Pisa, Italy, Springer-Verlag, 2004, 489–500.

Yetisgen-Yildiz, M., and Pratt, W., "The effect of feature representation on MEDLINE document classification", Proc. AMIA’05, American Medical Informatics Association Fall Symposium, Washington D.C., 2005.

Zadrozny, S., Kacprzyk, J., "Computing with words for text processing: An approach to the text categorization", Information Sciences, 176 (2006) 415–437.

Downloads

Published

2008-03-01

Issue

Section

Research Articles