An automatic indexing technique for Thai texts using frequent max substring

T. Chumwatana; K.W. Wong; H. Xie

Back

An automatic indexing technique for Thai texts using frequent max substring

Conference paper

Open access

An automatic indexing technique for Thai texts using frequent max substring

T. Chumwatana, K.W. Wong and H. Xie

IEEE

8th International Symposium on Natural Language Processing, SNLP '09 (Bangkok Thailand, 20/10/2009–22/10/2009)

2009

Files and links (2)

pdf

an_automatic_indexing_technique.pdfDownload View

Published (Version of Record) Open Access

url

Link to Published Version *Subscription may be requiredView

Abstract

Thai language is considered as a non-segmented language where words are a string of symbols without explicit word boundaries, and also the structure of written Thai language is highly ambiguous. This problem causes an indexing technique has become a main issue in Thai text retrieval. To construct an inverted index for Thai texts, an index terms extraction technique is usually required to segment texts into index term schemes. Although index terms can be specified manually by experts, this process is very time consuming and labor-intensive. Word segmentation is one of the many techniques that are used to automatically extract index terms from Thai texts. However, most of the word segmentation techniques require linguistic knowledge and the preparation of these approaches is time consuming. An n-gram based approach is another automatic index terms extraction method that is often used as indexing technique for Asian languages including Thai. This approach is language independent which does not require any linguistic knowledge or dictionary. Although the n-gram approach out performs many indexing techniques for Asian languages in term of retrieval effectiveness, the disadvantage of n-gram approach is it suffers from large storage space and long retrieval time. In this paper we present the frequent max substring mining to extract index terms from Thai texts. Our method is language-independent and it does not rely on any dictionary or language grammatical knowledge. Frequent max substring mining is based on text mining that describes a process of discovering useful information or knowledge from unstructured texts. This approach uses the analysis of frequent max substring sets to extract all long and frequently-occurred substrings. We aim to employ the frequent max substring mining algorithm to address the drawback of n-gram based approach by keeping only frequent max substrings to reduce disk space requirement for storing index terms and to reduce the retrieval time in order to deal with the rapid growth of Thai texts.

Details

Title: An automatic indexing technique for Thai texts using frequent max substring
Authors/Creators: T. Chumwatana (Author/Creator)
K.W. Wong (Author/Creator)
H. Xie (Author/Creator)
Conference: 8th International Symposium on Natural Language Processing, SNLP '09 (Bangkok Thailand, 20/10/2009–22/10/2009)
Publisher: IEEE
Identifiers: 991005543151107891
Murdoch Affiliation: School of Information Technology
Language: English
Resource Type: Conference paper
Note: Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

Metrics

715 File views/ downloads

73 Record Views