Wednesday, April 3, 2019
Automatic Encoding Detection And Unicode Conversion Engine Computer Science Essay
Automatic convert sleuthing And Unicode Conversion engine Computer Science EssayIn calculators, book of factss argon represented employ tots. Initi on the onlyy the encode synopsiss were designed to give birth the English alphabet, which has a trammel act of symbols. Later the takement for a worldwide oddb every(prenominal) convert synopsis to support multi lingual reason was identified. The solving was to come up with a 16 encryption scheme to represent a roughage so that it digest support up to deep role focalise. The on-line(prenominal) Unicode version contains 107,000 fibers covering 90 rule books. In the current con school textual matter in operation(p) arrangings such as Windows 7, UNIX base direct establishments applications such as newsworthiness dealors and data exchange technologies do support this measurement enabling inter sateisation in the IT industry. Even though this standard has been the de facto standard, still in that loc ation support be seen certain applications apply branded encoding schemes to represent the data. As an work bulge out, famous Sinhalese news invests still do non vary Unicode standard based showcases to represent the content. This ca maps issues such as the requirement of downloading proprietorship guinea pigs, browser dep stop overencies make the efforts of Unicode standard in vain. In do-gooder to the nett site content itself at that place are collections of data included in reputationss such as PDFs in non Unicode fonts making it difficult to search d genius search engines unless the search term is entered in that particular font encoding.This has assumption the requirement of automatic in ally marking the encoding and trans casting into the Unicode encoding in the correspondent manner of speaking, so that it avoids the tasks menti unityd. In case of wind vane sites, a browser plug-in executing to support the automatic non-Unicode to Unicode innovatio n would eliminate the requirement of downloading bequest fonts, which uses proprietary flake encodings. Although just about web sites offer up the source font in coifion, there are certain web applications, which do not give this information, making the auto spying process more difficult. Hence it is required to chance upon the encoding first, before it has been fed to the transformation process. This has given the rise to a research area of auto sight the vocabulary encoding for a given text based on language exampleistics.This business entrust be addressed based on a statistical language encoding staining mechanism. The technique would be demonstrated with the support for all the Singhalese Non Unicode encodings. The implementation for the demonstration lead make sure that it is an extendible resultant role for separate languages making it support for both given language based on a future requirement.Since the beginning of the computer age, some encoding schemes assimilate been created to represent various composing scripts/ denotations for computerized data. With the advent of world(prenominal)isation and the development of the Internet, information exchanges crossing both language and regional boundaries are becoming ever more important. However, the existence of double coding schemes presents a authoritative barrier. The Unicode has provided a universal coding scheme, just now it has not so far re fixed vivacious regional coding schemes for a variety of reasons. Thus, todays global software applications are required to handle multiple encodings in addition to supporting Unicode.In computers, qualitys are encoded as numbers. A typeface is the scheme of letterforms and the font is the computer file or program which physically embodies the typeface. bequest fonts use different encoding systems for assigning the numbers for fictional eccentrics. This leads to the fact that devil legacy font encodings defining different numbers f or the equal character. This whitethorn lead to conflicts with how the characters are encoded in different systems and leave alone require maintaining multiple encoding fonts. The requirement of having a standard to ludicrous character ap back breakerment was satisfied with the introduction of Unicode. Unicode enables a individual software product or a single website to be tar get under ones skined across multiple platforms, languages and countries without re-engineering.UnicodeUnicode is a computing industry standard for the consistent encoding, re manifestation and handling of text show in most of the worlds committal to writing systems. The latest Unicode has more than 107,000 characters covering 90 scripts, which consists of a set of code charts. The Unicode Consortium co-ordinates Unicodes development and the goal is to ultimately replace existing character encoding schemes with Unicode and its standard Unicode slip coif (UTF) schemes. This standard is being support ed in many an(prenominal) recent technologies including scheduling Languages and modern direct systems. All W3C recommendations have employ Unicode as their entry character set since HTML 4.0. Web browsers have supported Unicode, particularly UTF-8, for many years 4, 5.Sinhala Legacy Font Conversion indispensability for Web ContentSinhala language usage in computer technology has been present since 1980s scarce the lack of standards in character representation system resulted in proprietary fonts. Sinhala was added to Unicode in 1998 with the intent of overcoming the limitations in proprietary character encodings. Dinamina, DinaminaUniWeb, Iskoola Pota, KandyUnicode, KaputaUnicode, Malithi Web, Potha are some Sinhala Unicode fonts which were veritable so that the numbers assigned with the characters are the analogous(p). Still some major(ip) news sites which display Sinhala character content have not adapted the Unicode standards. The Legacy Fonts encoding schemes are u tilize instead cause the conflicts in content representation. In determine to minimize the problems, font families were created where the fig of characters only differs but the encoding remains the same. FM Font Family, DL Font Family are some examples where a font family apprehension is apply as a meetinging of Sinhala fonts with similar encodings 1, 2.Adaptation of non Unicode encodings causes a lot of compatibility issues when viewed in different browsers and operating systems. Operating systems such as Windows Vista, Windows7 come with Sinhala Unicode support and do not require outside fonts to be installed to read Sinhalese script. Variations of GNU/Linux distributions such as Dabian or Ubuntu besides provide Sinhala Unicode support. Enabling non Unicode applications especially web confine with the support for Unicode fonts will allow the users to view contents without installing the legacy fonts.Non Unicode PDF DocumentsIn addition to the contents in the web, there e xists a whole lot of government documents which are in PDF format but their contents are encoded with legacy fonts. Those documents would not be searchable through search engines by entering the search terms in Unicode. In enjoin to dominate the problem it is important to convert such documents in to a Unicode font so that they are searchable and its data kitty be used by different applications consistently, irrespective of the font. As another(prenominal) part of the project this problem would be addressed through a converter tool, which creates the Unicode version of existing PDF document which are currently in legacy font.The ProblemSections 1.3, 1.4 take in two domains in which the Non Unicode to Unicode conversion is required. The conversion involves identification of non-Unicode contents and substitute it with the corresponding Unicode contents. The content replacement requires a Mapping engine, which would do the suitable segmentation of the input text and map it with the corresponding Unicode code. The chromosome mapping engine can perform the mapping task only if it knows what is the source text encoding. In planetary, the encoding is stipulate along with the content so that the mapping engine could guide it directly. However, in certain cases the encoding is not specified along with the content. Hence detecting the encoding through an encoding the detection engine provides a research area, especially with the non-Unicode content. In addition to that, incorporating the detection engine along with a conversion engine would be another part of the problem, to solve the application areas in 1.3, 1.4.Project ScopeThe system will be initially targeted for Sinhala fonts used by local sites. Later the same mechanism will be extended to support other languages and scripts (Tamil, Devanagaree).Deliverables and outcomesWeb Service/Plug-in to Local Language web site Font Conversion which automatically converts website contents from legacy fonts to Unic ode.PDF document conversion tool to convert legacy fonts to UnicodeIn both implementations, the language encoding detection would use the proposed encoding detection mechanism. It can be considered as the core for the implementations in addition to the translation engine which performs the Non Unicode to Unicode mapping.Literature reexamineCharacter EncodingsCharacter Encoding SchemesEncoding refers to the process of representing information in some form. Human language is an encoding system by which information is represented in terms of sequences of lexical units, and those in terms of die or gesture sequences. Written language is a derivative system of encoding by which those sequences of lexical units, sounds or gestures are represented in terms of the graphical symbols that make up some writing system.A character encoding is an algorithm for presenting characters in digital form as sequences of eights. on that point are hundreds of encodings, and many of them have different names. There is a standardized procedure for registering an encoding. A primary name is assigned to an encoding, and maybe some alias names. For example, ASCII, US-ASCII, ANSI_X3.4-1986, and ISO646-US are different names for an encoding. There are excessively many unregistered encodings and names that are used widely. The character encoding names are not case sensitive and indeed ASCII and Ascii are equivalent 25. direct 2.1 Character encoding Example iodine Octet EncodingsWhen character repertoire that contains at most 256 characters, assigning a number in the range 0255 to from each one character and use an octet with that value to represent that character is the most simplest and obvious way. Such encodings, called single-octet or 8-bit encodings, are widely used and will remain important 22.Multi-Octet EncodingsIn multi octet encodings more than one octet is used to represent a single character. A simple two-octet encoding is sufficient for a character repertoire that cont ains at most 65,536 characters. Two octet schemes are wasteful if the text mostly consists of characters that could be presented in a single-octet encoding. On the other hand, the objective of supporting Universal character set is not achievable with just 65,536 unique codes. Thus, encodings that use a variable number of octets per character are more leafy vegetable. The most widely used among such encodings is UTF-8 (UTF stands for Unicode Transformation Format), which uses one to four octets per character.Principles of Unicode StandardUnicode has used as the universal encoding standard to encode characters in all living languages. To the end, is follows a set of fundamental principles. The Unicode standard is simple and consistent. It does not depend on subjects or modes for encoding special characters.The Unicode standard incorporates the character sets of many existing standards For example, it includes Latin-I, character set as its first 256 characters. It includes repertoir e of characters from numerous other corporate, national and international standards as well.In modern businesses needs handle characters from a wide variety of languages at the same time. With Unicode, a single internationalization process can produce code that handles the requirements of all the world markets at the same time. The data corruption problems do not occur since Unicode has a single definition for each character. Since it handles the characters for all the world markets in a uniform way, it avoids the complexities of different character code architectures. All of the modern operating systems, from PCs to mainframes, support Unicode now, or are actively developing support for it. The same is true of databases, as well.There are 10 design principles associated with Unicode.UniversilityThe Unicode is designed to be Universal. The repertoire must be large enough to encompass all characters that are managely to be used in general text interchange. Unicode needs to encompass a variety of essentially different collections of characters and writing systems. For example, it cannot postulate that all text is indite left-hand(a) to right, or that all letters have uppercase and lowercase forms, or that text can be divided into develops separated by spaces or other whitespace. economicalSoftware does not have to maintain state or look for special escape sequences, and character synchronization from any point in a character stream is quick and unambiguous. A placed character code allows for efficient sorting, searching, display, and editing of text. But with Unicode efficiency there exist certain tradeoffs made specially with the storage requirements needing four octets for each character. Certain representation forms such as UTF-8 format requiring linear touch of the data stream in order to identify characters. Unicode contains a large amount of characters and features that have been included only for compatibility with other standards. This may require preprocessing that deals with compatibility characters and with different Unicode representations of the same character (e.g., letter as a single character or as two characters).Characters, not glyphsUnicode assigns code points to characters as abstractions, not to visual appearances. A character in Unicode represents an abstract concept rather than the manifestation as a particular form or glyph. As shown in Figure 2.2, the glyphs of many fonts that render the Latin character A all correspond to the same abstract character a.Figure 2.2 Abstract Latin Letter a and Style VariantsAnother example is the Arabic presentation form. An Arabic character may be pen in up to four different regularizes. Figure 2.3 shows an Arabic character written in its isolated form, and at the beginning, in the middle, and at the end of a word. According to the design principle of encoding abstract characters, these presentation variants are all represented by one Unicode character.Figure 2.3 Arabic char acter with four representationsThe relationship between characters and glyphs is rather simple for languages desire English mostly each character is presented by one glyph, taken from a font that has been chosen. For other languages, the relationship can be very much more complex routinely combining several characters into one glyph.SemanticsCharacters have well-defined meanings. When the Unicode standard refers to semantics, it often means the properties of characters, such spacing, combinability, and directionality, rather than what the character really means.Plain textUnicode deals with plain texti.e., strings of characters without formatting or structuring information (except for things like line breaks).Logical orderThe default representation of Unicode data uses logical order of data, as opposed to approaches that handle writing direction by changing the order of characters.UnificationThe principle of uniqueness was also applied to decide that certain characters should not be encoded separately. Unicode encodes duplicates of a character as a single code point, if they belong to the same script but different languages. For example, the letter denoting a particular vowel sound in German is treated as the same as the letter in Spanish.The Unicode standard uses Han unification to consolidate Chinese, Korean, and Nipponese ideographs. Han unification is the process of assigning the same code point to characters historically perceived as being the same character but represented as unique in more than one eastern hemisphere Asiatic ideographic character standard. These results in a group of ideographs shared by several cultures and importantly reduces the number of code points needed to encode them. The Unicode Consortium chose to represent shared ideographs only once because the goal of the Unicode standard was to encode characters independent of the languages that use them. Unicode makes no distinctions based on pronunciation or meaning higher-level operating systems and applications must take that responsibility. Through Han unification, Unicode assigned about 21,000 code points to ideographic characters instead of the 120,000 that would be required if the Asian languages were treated separately. It is true that the same character king look slightly different in Chinese than in Japanese, but that difference in appearance is a font issue, not a uniqueness issue.Figure 2.4 Han Unification exampleThe Unicode standard allows for character composition in creating marked characters. It encodes each character and discriminating or vowel mark separately, and allows the characters to be combined to create a marked character. It provides single codes for marked characters when necessary to comply with preexisting character standard.Dynamic compositionCharacters with diacritic marks can be constitute dynamically, using characters designated as combining marks.Equivalent sequencesUnicode has a large number of characters that are preco mposed forms, such as . They have bunkums that are declared as equivalent to the precomposed form. An application may still treat the precomposed form and the decomposition differently, since as strings of encoded characters, they are distinct.ConvertibilityCharacter data can be accurately converted between Unicode and other character standards and specifications.South Asian legersThe scripts of South Asia share so many common features that a side-by-side compare of a few will often reveal structural similarities even off in the modern letterforms. With minor historical exceptions, they are written from left to right. They are all abugidas in which most symbols stand for a sympathetic plus an ingrained vowel (usually the sound /a/). Word-initial vowels in many of these scripts have distinct symbols, and word-internal vowels are usually written by juxtaposing a vowel sign in the vicinity of the affected consonant. Absence of the intrinsical vowel, when that occurs, is much mar ked with a special sign 17.Another fitting is preferred in some languages. As an example in Hindi, the word hal refers to the character itself, and halant refers to the consonant that has its inherent vowel cut backed. The virama sign nominally serves to suppress the inherent vowel of the consonant to which it is applied it is a combining character, with its normal varying from script to script.Most of the scripts of South Asia, from north of the Himalayas to Sri Lanka in the south, from Pakistan in the west to the easternmost islands of Indonesia, are derived from the ancient Brahmi script. The oldest lengthy inscriptions of India, the edicts of Ashoka from the third nose candy BCE, were written in two scripts, Kharoshthi and Brahmi. These are both ultimately of Semitic origin, probably deriving from Aramaic, which was an important administrative language of the Middle East at that time. Kharoshthi, written from right to left, was supplanted by Brahmi and its derivatives. The d escendants of Brahmi spread with incalculable changes throughout the subcontinent and outlying islands. There are said to be some 200 different scripts deriving from it. By the eleventh century, the modern script known as Devanagari was in ascendancy in India proper as the major script of Sanskrit literature.The North Indian branch of scripts was, like Brahmi itself, chiefly used to keep Indo-European languages such as Pali and Sanskrit, and eventually the Hindi, Bengali, and Gujarati languages, though it was also the source for scripts for non-Indo-European languages such as Tibetan, Mongolian, and Lepcha.The South Indian scripts are also derived from Brahmi and, therefore, share many structural characteristics. These scripts were first used to write Pali and Sanskrit but were later adapted for use in writing non-Indo-European languages including Dravidian family of grey India and Sri Lanka.Sinhala LanguageCharacteristics of SinhalaThe Sinhala script, also known as Sinhalese, i s used to write the Sinhala language, by the majority language of Sri Lanka. It is also used to write the Pali and Sanskrit languages. The script is a descendant of Brahmi and resembles the scripts of South India in form and structure. Sinhala differs from other languages of the region in that it has a serial of prenasalized stops that are distinguished from the combination of a nasal followed by a stop. In other nomenclature, both forms occur and are written differently 23.Figure 2.5 Example for prenasalized stop in SinhalaIn addition, Sinhala has separate distinct signs for both a short and a long low front vowel sounding similar to the initial vowel of the English word apple, usually represented in IPA as U+00E6 latin small letter ae (ash). The independent forms of these vowels are encoded at U+0D87 and U+0D88.Because of these extra letters, the encoding for Sinhala does not exactly follow the pattern established for the other Indic scripts (for example, Devanagari). It does use the same general structure, making use of phonetic order, matra reordering, and use of the virama (U+0DCA sinhala sign al-lakuna) to foreshadow conjunct consonant clusters. Sinhala does not use half-forms in the Devanagari manner, but does use many ligatures.Sinhala Writing SystemThe Sinhala writing system can be called an abugida, as each consonant has an inherent vowel (/a/), which can be changed with the different vowel signs. Thus, for example, the radical form of the letter k is ka. For ki, a small arch is placed over the . This replaces the inherent /a/ by /i/. It is also executable to have no vowel following a consonant. In order to produce such a pure consonant, a special marker, the hal kirma has to be added . This marker suppresses the inherent vowel.Figure 2.6 Character associative Symbols in SinhalaHistorical Symbols. Neither U+0DF4 sinhala punctuation kunddaliya nor the Sinhala numerals are in general use today, having been replaced by western sandwich-style p unctuation and Western digits. The kunddaliya was formerly used as a full stop or period. It is included for scholarly use. The Sinhala numerals are not presently encoded.Sinhala and UnicodeIn 1997, Sri Lanka submitted a proposal for the Sinhala character code at the Unicode working group meeting in Crete, Greece. This proposal competed with proposals from UK, Ireland and the USA. The Sri Lankan draft was in the end accepted with slight modifications. This was ratified at the 1998 meeting of the working group held at Seattle, USA and the Sinhala Code Chart was included in Unicode Version 3.0 2.It has been suggested by the Unicode consortium that ZWJ and ZWNJ should be introduced in Orthographic languages like Sinhala to achieve the following1. ZWJ joins two or more consonants to form a single unit (conjunct consonants).2. ZWJ can also alter shape of preceding consonants (cursiveness of the consonant).3. ZWNJ can be used to disjoin a single ligature into two or more units.Encoding a uto DetectionBrowser and auto-detectionIn designing auto detection algorithms to auto detect encodings in web pages it needs to depend on the following assumptions on input data 24.Input text is composed of words/sentences clean to readers of a particular language.Input text is from typical web pages on the Internet which is not an ancient dead language.The input text may contain extraneous noises which have no relation to its encoding, e.g. HTML tags, non-native words (e.g. English words in Chinese documents), space and other format/control characters.Methods of auto detectionThe paper24 discusses about 3 different methods for detecting the encoding of text data.Coding Scheme MethodIn any of the multi-byte encoding coding schemes, not all possible code points are used. If an illegal byte or byte sequence (i.e. unused code point) is encountered when straying a certain encoding, it is possible to immediately conclude that this is not the right guess. Efficient algorithm to detectin g character set using coding scheme through a parallel state machine is discussed in the paper 24.For each coding scheme, a state machine is implemented to verify a byte sequence for this particular encoding. For each byte the detector receives, it will feed that byte to every active state machine available, one byte at a time. The state machine changes its state based on its previous state and the byte it receives. In a typical example, one state machine will eventually provide a positive answer and all others will provide a negative answer.Character statistical distribution MethodIn any given language, some characters are used more often than other characters. This fact can be used to devise a data model for each language script. This is particularly useful for languages with a large number of characters such as Chinese, Japanese and Korean. The tests were carried out with the data for simplified Chinese encoded in GB2312, traditional Chinese encoded in Big, Japanese and Korean. I t was observed that a rather small set of coding points covers a significant percentage of characters used.Parameter called Distribution Ration was defined and used for the purpose separating the two encodings.Distribution Ratio = the Number of occurrences of the 512 most frequently used characters divided by the Number of occurrences of the rest of the characters.. Two-Char Sequence Distribution MethodIn languages that only use a small number of characters, we need to go further than counting the occurrences of each single character. compounding of characters reveals more language-characteristic information. 2-Char Sequence as 2 characters appearing immediately one after another in input text, and the order is significant in this case. Just as not all characters are used every bit frequently in a language, 2-Char Sequence distribution also turns out to be extremely language/encoding dependent.Current Approaches to Solve Encoding ProblemsSiyabas ScriptThe SiyabasScript is as an att empt to develop a browser plugin, which solves the problem using legacy font in Sinhala news sites 6. It is an appurtenance to Mozilla Firefox and Google Chrome web browsers. This solution was specifically designed for a restrict number of target web sites, which were having the specific fonts. The solution had the limitation of having to reengineer the plug-in, if a new version of the browser is released. The solution was not global since that id did not have the ability to support a new site which is using a Sinhala legacy font. In order to overcome that, the proposed solution will identify the font and encodings based on the content but not on site. There is a chance that the solution might not work if the site decided to adapt another legacy font, as it cannot detect the encoding scheme changes. There is a significant delay in the conversion process. The user would notice the display of the content with characters which are in legacy font before they get converted to the Unico de. This performance delay can be also identified as an area to improve in the solution. The conversion process does not provide the exact conversion specially when the characters need to be combined in Unicode. = ... , , , , + , , can be mentioned as the examples of words of such conversion issues.The plug-in supports the Sinhala Unicode conversion for the sites www.lankadeepa.lk, www.lankaenews.com and www.lankascreen.com. But the other websites mentioned in the paper does not get properly converted to Sinhala with Firefox version 3.5.17.Aksharamukha Asian Script converterAksharamukha is a South South-East-Asian script convertor tool. It supports transliteration between Brahmi derived Asian scripts. It also has the functionality to transliterate web pages from Indic Scripts to other scripts. The Convertor scrapes the HTML page, then transliterates the Indic Scripts and displays the HTML. There are certain issues in the tool when it comes to alignment with the original w eb page. Misalignments and missing images, unconverted hyperlinks are some of them.Figure 2.7 Aksharamukha Asian Script ConverterCorpus-based Sinhala LexiconThe Lexicon of a language is its vocabulary including higher order constructs such as words and expressions. In order to detect the encoding of a given text this can be used as a supporting tool. Corpus based Sinhala lexicon has nearly 35000 entries based on a corpus consisting of 10 million words from divers(a) genres such as technical writing, creative writing and news insurance coverage 7, 9. The text distribution across genres is given in table 1. put back 2.1 Distribution of Words across Genres 7GenreNumber of words lot of wordsCreative Writing234099923%Technical Writing435768043%News Reportage343377234%N-gram-based language, script, and encoding scheme-detectionN-Gram refers to N character sequences and is used as a well-established technique used in classifying language of text documents. The method detects language, sc ript, and encoding schemes using a target text document encoded by computer by checking how many byte sequences of the target match the byte sequences that can appear in the texts belonging to a language, script, and encoding scheme. N-grams are extracted from a string, or a document, by a sliding window that shifts one character at a time.Sinhala Enabled erratic Browser for J2ME PhonesMobile phone usage is rapidly increasing throughout the world as well as in Sri Lanka. It has become the most ubiquitous talk device. Accessing internet through the mobile phone has become a common activity of people especially for messaging and news items. In J2ME enabled phones Sinhala Unicode support yet to be developed. They do not allow instauration of fonts outside. Hence those devices will not be able to display Unicode contents, especially on the web, until Unicode is supported by the platform. Integrating the Unicode viewing support will provide a good opportunity to carry the technology t o outback(a) areas if it can be presented in the native language. If this is facilitated, in addition to the urban crowd, people from rural areas will be able to subscribe to a daily newspaper with their mobile. One major advantage of such an application is that it will provide a phone model independent solution which supports any Java enabled phone.Cillion is a Mini browser software which shows Unicode contents in J2ME phones. This software is an application developed with the fonts integrated wh
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment