Proposing a tokenizer for Farsi words, by using regular expressions (The paper was originally written in Persian)
In 5th International Conference on Electrical Engineering and Computer with emphasis on indigenous knowledge, 2017
This abstract is translated from the original abstract of the paper, written in Persian: This paper presents a novel word tokenizer that utilizes regular expressions to split words in a given text. The tokenizer is built upon the concept of replaceability in regular expressions. The proposed method is capable of accurately recognizing and processing various elements such as Farsi and English words, symbols, and other unique expressions. The algorithm aims to effectively identify and isolate words while accounting for their respective frequencies. Consequently, the output of the system includes the processed text, a word count with repetition (Words), a distinct vocabulary count (Vocabulary), and a list that presents each word alongside its frequency of occurrence. This list is sorted alphabetically and by frequency to provide an efficient summary of the processed text. The tokenization process is crucial for natural language processing applications, and this novel tokenizer offers an effective and adaptable solution.