Home        The Boat        Photo Essay        Contact

Eight Bit Storage, Screen Resolution and the Emoji

Just within the last few years Emojis have become endemic, pandemic some might say, especially on social media services and internet enabled wireless telephones. What is an emoji though? These days (Unicode 9.0) it is a list of 1297 standardized codes for pictographs that depict human faces, human figures and common objects. The word emoji comes from the English word "emotion" in a roundabout sort of way. The specific spelling "emoji" comes from the modern Japanese language, and the Japanese origin of many of the emojis is quite evident. The pictures themselves are anywhere from 12x12 pixles to 144x144 pixles in size and differ from platform to platform. How the pictures are embedded within text also varies somewhat from platform to platform.

Eight Bit Storage
Character Height
The Japanese Emoji
The Modern Emoji
Embedded Symbols



Eight Bit Storage

Understanding where this whole phenomenon of pictures embedded in text comes from requires understanding why it is that eight bit storage is universally used in all computer systems. Six bit storage just can't work, the 64 possible characters get entirely used up by upper case letters, lower case letters and the ten numerals leaving just two free characters for punctuation marks. Using six bit storage would mean that periods (.) and commas (,) would be the only available punctuation marks. The question mark (?), the exclamation mark (!), the colon (:), the semi-colon (;), the apostrophe ('), the quotation marks (") and parenthesis (()) would all be left out.

Seven bit storage would work, but instead the standard has been to go up to eight bit storage for several reasons. One is that seven is just such an odd number, eight bit storage seems much more even and uniform because the number of bits add up in nice recognizable patterns. The other is that seven bit storage means that only part of the Greek alphabet will fit. The 128 characters possible with seven bit storage is plenty for upper and lower case alphabets, the numerals, the punctuation marks and the standard special characters (@,#,$,%,^,&,*,(,),_,-.+,=,[,],{,},~,:,<,>,\,|,/). There is also plenty of room in seven bit storage for some special characters from the continental European languages such as the Spanish enye (ñ), the German umlaut vowels (Ä,ä,Ö,ö,Ü,ü), the German eszett (ß), and the French accented vowels (à,â,É,é,è,ê,Î,î,ô,û). The enye and the eszett are each just one additional character and the umlaut vowels are six more characters. The French accented vowels are ten more characters, but most are so extremely rarely used that the rationality of their inclusion in an international character set might be questioned. This comes to a total of 116 characters with the space and return, leaving only 12 characters for a partial Greek alphabet and the subscript numbers. Just having some of the common Greek characters is useful, and the original IBM "high bit" character set did include only 12 of the 24 Greek characters. The reality though is that there is just too much competition for the 128 characters available with seven bit storage.

One of the French accented vowels, the Î, is so unique that it gets used in just one French word . Le Île. It might then be said that the Î is in fact a symbol for "island", since that symbol does not get used for anything other than to indicate an island. The Î may be used in only one word, but Île is an important and frequently used word. There is now also the 🏝 emoji which perhaps rather appropriately is called the "Desert Island" to differentiate it from more substantial islands such as le Île de Wight, le Île du Levant et les Îles Britanniques. The û and â are more obscure yet, showing up in only a very few rather infrequently used French words, and there are in fact even more obscure characters that are used in written French. The ligatures (œ and ӕ) are good examples. These are not considered absolutely necessary for writing in French, but the œ symbol does in fact sometimes get used for certain words such as œfs and cœur. The letter C with a cedilla (ҁ) also gets used in place of the letter C in a few French words such as faҁade and faҁon. The German umlaut vowels and the German eszett on the other hand all see quite frequent use and are also entirely optional. In fact in modern German "Rechtschribung" the eszett is universally replaced by with "ss". Similarly any umlaut can be replaced with an "e" following the vowel, Äpfel becomes Aepfel. Being able to use the umlaut is however important because words are not otherwise as instantly recognizable. In Spanish the enye (ñ) is considered indispensable.

Going up to eight bit storage allows 256 characters so there is plenty of room for the entire Greek alphabet, and eight bit storage takes up less than 15% more space than seven bit storage. Even with the entire upper and lower case Greek alphabet though that is only 158 characters. What to do with the remaining 98 characters? It is this remaining space in the character set that has been used for additional special symbols and some pictographs. Just what additional special characters are available has varied from font to font and from platform to platform. Common additional symbols are the degree symbol (°), approximately equal to (≈), less than or equal to (≤), greater than or equal to (≥), the square root symbol (√), the copywrite symbol (©) and the two variations of the registered trademark symbol (® or a superscript ™). Both subscript (₀₁₂₃₄₅₆₇₈₉) and superscript numbers (⁰¹²³⁴⁵⁶⁷⁸⁹) are also very useful and are often included in expanded character sets. The irony is that even though eight bit storage has been universal a well designed universal eight bit character set has not come into being. The IBM high bit character set included 49 box drawing symbols and more than 20 other entirly useless (ë,Ҁ,å,ï,Å,ÿ,ƒ,Ñ,≡,⌠,⌡) and marginaly usless (æ,Æ,½,¼,∩,∈,«,») symbols so that the rest of the Greek alphabet, the subscript numbers and many other useful standard symbols (Δ,℄,≠, ∫ etc.) were omitted. A few of the useless symbols that were included in the IBM high bit set are so obscure that I could not even find them anywhere in the entire Unicode list. There is a small lower case letter A with a line under it and a small lower case letter O with a line under it as well as a mysterious "Pt" ligature. Of the marginally useless symbols in the IBM high bit character set the element symbol (∈) and the intersection symbol (∩) certainly might be of some use in set theory notation. For these symbols to be of any real use though their complements, the not an element symbol (∉) and the union symbol (∪) would also need to be included.

Not suprisingly the original IBM high bit character set has long since fallen out of common use. Newer eight bit character sets are ISO 8859-1 and it's more popular variant, Windows-1252. These new eight bit character sets would appear to be an improvement over the IBM high bit character set in that all of the useless box drawing characters have been removed. This is however a dubious improvement since most of the new characters are themselves useless. Instead of the partial Greek alphabet of the IBM high bit set the new character set has no Greek letters save for one lone mu (μ) which is rebadged as the "Micro sign". The subscript numbers are also still missing from the new character set. The new character set does have the Euro sign (€), the copyright symbol (©), both of the tradmark symbols (® and ™) and superscript one two and three numerals (¹²³) which were not included in the IBM high bit set. Missing from the new character set are pi (π), sigma (Σ and σ), gamma (γ), phi (φ), theta (Θ), omega (Ω) and delta (γ) as well as the greater than or equal to (≤) and less than or equal to (≥) symbols, the infinity sign (∞) and the slashed zerro (⌀) which were all present in the IBM high bit character set. What then takes up all of those 256 positions in the new eight bit character set? A whole bunch of accented characters that are not used in French. The new character set is an improvement in that it allows all of the European languages to be writen out in latin script with all of the requisite special characters, but other very common symbols are still missing. All six of the German umlaut vowel characters are pressent, as is the eszett and the enye. All of the ten French accented vowels are pressent, as are the upper case versions that are never used in normal writing. What is also pressent are a bunch of characters that are either totally usless or of use only for certain other European languages writen out in the latin alphabet.

Some of the usless symbols in the ISO 8859-1 character set are: Á,Ã,Ë,Ì,Í,Ï,Ð,Ò,Ó,Õ,Ù,Ú,Ý,Þ,á,ã,ë,ì,í,ï,ð,ò,ó,õ,ù,ú,ý,þ,ÿ,Š,š,Ž.

Character Height

The main reason that the potentialy useful additional symbols have not been more universally available across all computer systems has to do with screen resolution and character height. Text is often displayed at a height of just five or seven pixels on lower resolution computer screens and electronic readouts. A height of just five pixels is sufficient to display upper case letters and the numbers, but a combination of upper case and lower case letters requires a character height of seven pixels. Seven pixel character height also allows some of the symbols to display reasonably well, but a character height of around 10 or 12 pixels is more typical for displaying a wide range of symbols and some pictographs. For displays that operate at character heights of five or seven pixels the expanded character sets generally are not able to be used. Because lower resolution displays were so common for many decades the expanded character sets tended to be undesirable from the perspective of cross platform compatibility. Even though eight bit storage has been nearly universal the standard character sets have not been particularly complete or useful. Still with the potential for 256 characters imaginations have run wild. And yes, the first character of the standard 7 bit ASCII character set is a smiley face followed by the symbols for the suits of cards and a few other symbols.

These symbolic characters just have not gotten used much, largely because they do not display well on very low resolution text displays. The first 31 ASCII characters have hardly ever actually been used as characters, which resulted in ASCII characters 32 through 127 being referred to as the "printable ASCII characters". It has been a situation of presumed possibility of larger character sets without the actual availability of much in the way of universal availability of additional symbols or pictographs. A substitute of sorts was ASCII art, or the "emoticon". The origin of the word emoticon is the words "emotion" and "icon", and emoticon is an icon that expresses emotion. The most well known of these of course is the smiley :-), but dozens of others as bazaar as :'-) for crying and =0:] to represent Bill Clinton have been used over the decades. The popularity of the smiley and variations certainly indicates some widespread desire for a richer character set.

Unicode

As early as 1987 Apple and Xerox proposed a 16 bit character set to include all known current and historic linguistic symbols. In 1991 the formation of the Unicode Consortium made this a reality and in 1996 a "surrogate character" was added to increase the character set to over a million individual positions. As if the 65,000 individual positions of the original 16 bit Unicode were somehow insufficient. When Unicode characters are used it does not necessarily mean that the entire text is going to be stored with 16 bit storage. Eight bit storage can still be used if the program that is displaying the text can substitute the Unicode character for it's unique identifier written out in ASCII text.

Not only did this standard expanded character set allow the use of many different languages but it also provided a mechanism to embed many more symbols and pictographs into text. The main use of additional symbols was mathematical operators and other technical symbols. This Unicode standard allowed cross platform compatibility of symbols embedded in text. At least it was supposed to. The reality was that embedded symbols very often were available only in certain software packages and moving text from one platform to another while preserving the embedded symbols was not always easy.

The other limitation of the expanded character sets made possible by Unicode was that they were focused on linguistic characters and special symbols used in science and mathematics. This was very useful for many purposes, but speakers of only English with little technical knowledge were left out.

The latest development of Unicode is UTF-8. UTF-8 is an eight bit character system that is reverse compatible with the original 95 printable ASCII characters, but also provides more seamless support for embedded characters. Basically what UTF-8 does is allow ASCII text to be used just as it always has been and characters from all of Unicode can be embedded without the use of a special character to indicate that a Unicode character follows. One significant thing this does is save quite a bit of file size when lots of Unicode characters are embedded within a text. Not only is the special character to denote that a Unicode character follows not required, but the code for the Unicode character can also be rather short. Just two eight bit bytes for the first 1920 Unicode characters.

The other big improvement with UTF-8 is that the first few bits of the code for each Unicode character begin with an indication of how long the code itself is. What this does is allow Unicode characters to be embedded in UTF-8 text without any misinterpretations of characters. When embedding Unicode characters into ASCII text it is sometimes necessary to follow the code for a Unicode character with the Unicode code for the next ASCII character. Faҁade for example becomes Fa䠚de unless the code for the C-cedilla is followed by Unicode for the letter a.

The limitation of UTF-8 is that it is difficult to type directly. When typing code directly in a programming language generally all you have to work with are the 95 printable ASCII characters. UTF-8 requires other unique byte sequences for the insertion of Unicode characters, and these bytes sequences don't show up on the keyboard. The codes for Unicode characters are normally expressed in hexadecimal form, and likewise there is always a hexadecimal equivalent for the UTF-8 character code. The only real problem with using the hexadecimal form of the UTF-8 code is that it just ends up being an unnecessarily long string of ASCII characters. The A2 code for the cent symbol (¢) becomes C2A2 in hexadecimal UTF-8 form and the longer 144E code for ᑎ becomes E1918E in hexadecimal UTF-8 form.

One of the interesting consequences of this is that any complete eight bit character set would also be reverse compatible with UTF-8. Each code for the Unicode characters in UTF-8 would also have an equivalent two, three or four byte code in any complete eight bit character set. So the Unicode characters really are universal, they work in most existing computer systems and they would work in any likely future computer system. UTF-8 might not be terribly easy to use directly, but it is a bridge from ASCII based code to any complete eight bit character set that may be created in the future.

The problem with this type of reverse compatibility is that all 256 characters in the hypothetical new eight bit character set would have to be directly accessible for ¢ to be represented with two characters. Just as hexadecimal notation is the solution for typing UTF-8 codes into ASCII text the hexadecimal form would likely be the easiest way to enter UTF-8 codes using a complete eight bit character set. In any case the streamlined embedded character handling capabilities of UTF-8 come at the expense of the two character long A2 code being expanded into either a four character long hexadecimal representation or a two character code that requires actually using all 256 characters from an eight bit character set.

Any way it is looked at a complete eight bit character set would be an significant advantage. The trade off of UTF-8 resulting in somewhat more cumbersome keyboard entry of isolated embeded symbols does however always remain.

The Japanese Emoji

The origin of the modern emoji goes back to late 1990's Japan when Shigetaka Kurita of the telecom company NTT drew a 172 pictograph character set representing facial expressions and common objects. These original pictographs were 12x12 pixels, so they were able to display fairly well on common existing mobile phone and pager screens. For many years these additional characters were used mostly only in Japan, but then in 2010 the Japanese emoji and some new ones were incorporated into the Unicode standard and adoption by Apple and other mobile phone and personal computer platforms soon followed.

The Japanese legacy is preserved in the current prevalence of Japanese specific emoji.

The Modern Emoji

As of the end of 2015 there are 1297 characters in the Unicode 9.0 standard that are considered to be emoji. Many of them are variations on the smiley face. The official list of emoji from the Unicode Consortium (http://www.unicode.org) starts with 119 faces, all of them really just variations of the original smiley. The rest of the list is made up with images of animals, foods, vehicles, common objects, general symbols and the flags of 255 countries (including Antarctica and the European Union as well as regional flags such as Gibraltar, Isle of Man and the Canary Islands). The variety of symbols is staggering. There is both a bath tub, and a bath tub with a person in it. There is an hour glass, and an hour glass with sand falling. There are four types of clocks; a wristwatch, a stopwatch a timer clock and a mantelpiece clock.

What is missing are mathematical and technical symbols. But math and science is not what emojis are about. They are not symbols, they are pictures that represent an idea. Most of the ideas are emotional in nature, and even the images of objects are often stylized representations that indicate a certain concept about an object or a class of objects as opposed to something concrete. The definitions of what the emojis mean are fluid. The official list form the Unicode Consortium lists some annotations to help clarify what the image depicts, but these are not definitions of what the symbol actually means. Creative use of emojis is a big part of their popularity. The "Water Wave" emoji 🌊 and the "Ferris Wheel" emoji 🎡 put together as 🌊 🎡 might be taken to represent either a water wheel or a wave power generation system.

It is of interest that although the 1297 codes that represent the emojis are standardized the pictures that those codes call up vary from system to system. Most notably emojis show up as black and white images in many web browsers, but the newer phone operating systems from Apple, Android and Microsoft as well as some web browsers represent emojis with color images. Some websites even use their own emoji images which are independent of the web browser being used. The differences between these various platform specific images is in some cases quite dramatic.

A good example is the "Loudly Crying Face" 😭. The more or less standard black and white emoji has tear drop shaped tears, where the color emojis from Apple and Twitter are depicted with blue streams of water flowing freely from closed eyes. The color emojis from Google and Microsoft use the tear drop shaped individual tears. All of these images may be intended to convey the same idea, but the appearance is dramatically different. Someone not already familiar with these variations likely would think that the images from Apple and Google were two entirely different emojis.

Another example of significant variation from platform to platform is the "Weary Cat Face" emoji. Older web browsers might use something like which is just a simple black and white outline of a cat face with an open mouth and slanted closed eyes.

Color representations such as or are similar, although they do look considerably more cat like. It is in the color images from Apple and Twitter where this emoji gets confusing. The raised hands (paws) and open eyes look like alarm or surprise, a very different concept. Again someone not already familiar with these variations would likely think that the image from Twitter was an entirely different emoji from the closed eye weary cat face.

Then there are just some strange things about the emoji. There are emoji for various types of shoes. There is the man's shoe, athletic shoe, high heeled shoe, woman's sandal and woman's boots. What is missing though are generic sandals. Same goes for hats. There is a woman's hat and a top hat but no ball cap. The most common type of hat is conspicuously absent. There is a bicycle, a cyclist, a mountain biker and a street motorcycle. The dirt bike is however not represented.

The Japanese influence on the modern emoji is often described in terms of the many Japanese foods depicted. A bento lunch box, rice ball, dango skewer and a fish cake all are very Japanese sounding food items. The Japanese influence continues in other parts of the emoji list as well. Mount Fuji is depicted, and an image looking much like a miniature Eiffel Tower is actually the Tokyo Tower. The tanabata tree, mahjong tile, pagoda, moon viewing ceremony, Japanese dolls, izakaya lantern and even the Japanese Ogre all seem very Japan specific.

As far as the images themselves go it is the Apple images that are generally considered to be of the best quality. The detail and realism of many of the Apple emoji images is far superior to other image sets. Particularly in the foods and animals the Apple images stand out both as more pleasing to look at as well as more effective in conveying the message of the emoji.

In some cases though the Apple emoji images are vague and do not look much like the emoji they are intended to represent. The Apple pig nose looks like a European electrical outlet more than a part of an animal. Other images for the pig nose emoji are much more convincing. The high speed train looks much more like the space shuttle than a train. The articulated lorry (semi) looks nothing like a semi-trailer and looks entirely like a three axle box truck. The hour glass image from Apple is a nice rendering, but both variations show sand flowing. The title of the two hour glass emojis are "hourglass" and "hourglass with sand flowing". The other images of the "hourglass" show all the sand at the bottom. The Apple image for the cyclone emoji is also problematic. All of the other images for the cyclone emoji are a double spiral. The double spiral is widely considered the universal symbol for a cyclone. Apple on the other hand uses a very mundane single spiral which would be very difficult to identify as the cyclone emoji upon first viewing. For the three button mouse (pointing device) emoji the other images include three buttons and a cord. The apple image of a three button mouse is just a white obelisk with no buttons and no cord. Pointing devices like this have been available over the years to be sure, but the image itself is not readily identifiable as any type of pointing device. The trackball emoji is similarly well represented by other images where the Apple image for the trackball looks much more like a small blue crystal ball. There are separate emoji for an optical disk and a DVD disk. Other images differentiate with the letters DVD under the image of the DVD disk. The Apple images on the other hand are identical other than a slight color variation. Again it would be necessary to know in advance that Apple represents the DVD emoji with a slightly more golden color tone and the generic optical disk with a slightly more silver color tone. The Apple image for the nut and bolt emoji very confusingly depicts a lag bolt with a nut on it. Anybody knows that lag bolts don't use nuts, and this image is amazingly troubling in it's failure to accurately depict reality.

The images from Apple which look so appealing are still only 72x72 pixels, which is actually considerably smaller than the 128x128 and 144x144 pixel images from some other companies. The detailed Apple images are larger files though, taking up around 12 kilobytes to the 4 kilobytes of less detailed color images. The larger 12KB images are not really any kind of a problem though as the entire 1297 character font still is only about 16 megabytes, extremely small compared to the gigabyte size software packages common in modern computing. As with any font the images themselves are not transferred over an internet connection each time an emoji is used. It is not the 12KB or 4KB image that is transferred, but rather the seven character ASCII code that is transferred over an internet connection. The rather long code used to specify the Unicode code space for each emoji is cumbersome, but it is still just seven bytes for each emoji. Since embedded symbols normally make up only a small percentage of the total number of characters in a piece of text the total file size goes up only very slightly.

Embedded Symbols

In most cases the representation of emoji can be forced to a small black and white image with an FE0E code following the code for the emoji. This is known as "text" representation of emoji because the smaller black and white images embed into text somewhat less distractingly than larger color images. Here is the weary cat face 🙀︎ forced into a text representation.

Everything about the Unicode embedded symbols and the Emojis works quite well except for the codes themselves. A piece of text with lots of embedded symbols is just difficult to work with. It requires either manual entry of each of the codes for each Unicode symbol or it requires the use of some type of compiler program. Compilers on top of compilers get very cumbersome themselves. Any type of writing, programming or communication requires the use of at least one level of compiler. Any word processing program is a type of compiler. You type text, but that text is then converted and stored as formatted lines of code. When you open the same word processing file those lines of formatted code are then converted back into the text that you had typed. Because the code that a word processor program generates is still mostly just the ASCII characters that were keyed in the total file size is only very slightly larger than the minimum file size that would be required to store just the unformatted text. It is the same for any type of writing, programming or communication. Web based services such as Twitter, Facebook and all of the web sites that host forums and discussion threads do the same thing. You type text, and then that text is stored as some type of formatted lines of code. For all of these types of writing and communication the use of embedded symbols and emojis is controlled by the software being used. The normal means of dealing with embedded symbols is cut and paste. When a symbol is copied and pasted into another location the code for that symbol is transferred to the new location. Menus and on screen keyboards for symbols are also normally available, but displaying even the 1297 emojis in a drop down menu is a bit cumbersome and the many thousands of other Unicode symbols are even more difficult to format into menus and onscreen keyboards.

Where the codes for embedded symbols get cumbersome is when the actual lines of code are used in the writing and editing process. Each time that a symbol is used (ΔT for example) the code for that symbol has to be keyed in or cut and pasted from a list. What ΔT actually looks like is 39T with a special character in front of it to denote that a Unicode character follows. That is simple enough for the occasional use of just one special symbol, but when the density of symbols increases writing becomes difficult. Not only is the keying in of the codes time consuming, but then when the writing is read over it does not look like it is supposed to. In place of the symbols there are just a bunch of confusing looking strings of code. To see the writing formatted correctly it has to be viewed with some other program that it is intended to work with. It is easy enough to view finished HTML in a browser, but then the text can only be read and not edited. There have of course been many HTML compiler programs available over the years that allow writing to be done up a level so to speak where the text can be edited more or less as it is going to look in a web browser. These compiler programs are great for avoiding manually keying in codes for symbols and other formatting, but they are sort of dangerous. When something goes wrong, and something always seems to eventually go wrong, the only sure fire way to fix it is to go back to the standardized code. HTML itself is a higher level programming language, but at least the basic features of HTML have been widely standardized for many many decades. The HTML compiler programs have not been standardized, which means that editing an HTML document in multiple compiler programs generally has resulted in some really ugly looking code that was prone to failure.

The easy way to use HTML is as it was originally intended, as a markup language. A programming language designed and built for writing and communication. HTML is very powerful in that a document can be typed out in a basic text editing program and the required formatting can be added on easily with a minimum of additional lines of code. HTML programming is not effortless, the syntax of each formatting feature has to be done correctly, but it is just a few lines of code for each feature. Displaying formatted text with FORTRN or C++ requires many lines of code just to specify where the text will be located, and doing anything fancy like changing colors or fonts requires pages of interface code to take over the operation of the display device. FORTRAN and C++ were developed to do specific things. These are sophisticated programming languages with logic based syntax. They are all about "IF" statements and loops. The flow of the programs involve checks and iterations. Check to see if some condition exists, if it does then the program goes in one direction if it does not then the program goes in another direction. There can of course be more than two directions, and the checks involve mathematical operations. Value comparisons and calculations are used to determine what is happening. This all takes place within loops, and loops inside loops. Checks within a loop determine where the flow goes, and normally some combination of conditions will cause the flow to return to the top of the loop. There are also always stops. Loops can't just run round and round with nothing happening. In basic computational programs the stops are requests for user input. More sophisticated interfaces may require periodic checking of certain conditions. Where is the pointing device? What is the current level of processor and memory use of the entire system?

FORTRAN was developed for engineers. It's programming environment is similar to C and C++, but FORTRAN is both streamlined and beefed up with more powerful features. FORTRAN works great for crunching numbers, but is as cumbersome as working directly in C or C++ in terms of formatting and interface. If all inputs are numbers keyed in and all outputs are "printed" to a screen or a printer then FORTRAN is easy to use and very powerful. Something as simple as storing outputs for later use as input though requires pages and pages of sophisticated code to interface with specific devices and other software packages.

The point of all this is that doing anything substantive with computer systems requires using some type of programming language. What we all know as computer systems these days are built upon two very old rival software packages. Apple and Microsoft. The original operating systems were deceptively basic in their text line operator interfaces but very powerful in terms of their ability to operate a wide variety of hardware. Very quickly Apple abandoned the text line interface in favor of a pointing device (the mouse) and menus. This was a much better way to deal with information, but it also brought computer users farther away from anything real. Meanwhile Microsoft developed a competing "Windows" interface, but the text line interface remained the primary mode of use. Eventually both Apple and Microsoft operating systems got so cumbersome that nobody seems to know what they are doing. Getting anything done still requires the use of some programming language, some standard that can be counted on to continue to function. HTML was intended to meet this requirement. HTML runs on all computer systems. HTML is sophisticated in that there is quite a lot going on "behind the scenes" in server software and web browsers, but HTML is also simple and basic because it is built upon programming syntax that makes sense. The big thing about HTML is that short lines of code actually do something in terms of getting text formatted and displayed.

The limitation of HTML is that it was intended to do only certain very basic things. Format and display text and link text from one location to another. Just those features are very powerful, but trying to do other things with HTML gets quite cumbersome. The problem is that the very simplicity of HTML which allowed it to be so universally adopted also eventually resulted in a situation where programmers have been trying to do much more with HTML than it really can (or should) do. HTML is not an operating system, it is supposed to run within other software packages that directly operate the hardware. Doing more with HTML though has required expanding the capabilities of HTML and this has taken the form of calling external scripts. Taken to an extreme this ends up being a case of an HTML file doing nothing but calling an external scrip that then works with additional features of a web browser. Those additional browser features are where much of the modern internet and really a large part of all of modern computing take place. But it is not real. What is that programming language called? What it is is Java, but what is that? Java used to be a plug-in, and add-on that was in addition to HTML, web browsers and server software. All web browsers now support additional levels of sophistication beyond what HTML can do, but again it just is not real. Where is the standardized programming language? Where are even the lines of code? The syntax for operating those additional levels of browser features does not make much sense. Essentially it is not a programming language for use by humans, it is a programming language that interfaces with compiler programs. Not even so much a programming language as a short hand means of storing huge numbers of lines of code. Most of the text has been removed from the lines of code so that much more can be done with the same amount of disc storage. This is what the compiler programs require, efficient storage of large numbers of lines of code because each time a change is made with a compiler program more lines of code tend to be added.

Ultimately it is one little problem with HTML that has continued to drive the computer industry towards higher levels of sophistication in browser features. And that is the handling of embedded symbols. It is actually a shortcoming of the standard ASCII character set that is the root of the problem. The lack of subscript numbers and a few other common symbols such as the degree symbol means that what can be written directly in ASCII text falls short for many types of technical writing. Writers always chose a software package that allows them to just insert the required special symbols and see them displayed as the text is being written and edited. This tends to distance writers from any sort of use of a standardized programming language, and it is this divide that has rocked the computer industry.

People who need the extra symbols for clear representation of technical details are forced to use some other software package, but those people then are cutoff from anything real or substantive in the computer industry. They don't stop using electronic communication, they just don't have any direct control over their use of computers and the internet. Other people who stubbornly hold onto standard programming languages end up either focusing on areas that don't involve subscript numbers and special symbols or they spend considerable effort digging through confusing looking text full of codes for Unicode characters.

In a way emojis are a solution to this conundrum. The 1297 emojis are quite useful, but they obviously don't fit in any standard eight bit character set. It is this distinction between large numbers of useful additional symbols and the few very common symbols that really need to be included in a character set that is important. The emojis are fun to work with because they bring an additional level of creativity to writing. Even more importantly though the emojis make it clear that an extremely wide range of symbols does not fit in a standard character set. Eight bit storage has lots of room for common symbols, but it does not have room to waste on whimsical and useless symbols. There is no place for smiley faces in the eight bit character set. Those 256 characters are required for symbols that are actually used in writing. Oh, OK you can have one smiley face, it is after all the first character in the standard ASCII seven bit set.☺︎



Back to Index to Technical Articles




©2016 Michael Traum