Flexible Machine Translation and Information Architectu

Agents in Information Architecture

for Intelligent Distributed Multilingual Document Retrieval Service

Key-Sun CHOI

Korea Advanced Institute of Science and Technology

Center for Artificial Intelligence Research

Department of Computer Science

373-1 Kusong-dong Yusong-ku Taejon 305-701 Korea

kschoi@cs.kaist.ac.kr

Abstract

The intelligent document retrieval assumes the full understanding of documents. "Intelligence" means the user-friendliness and the complete understanding of documents just as the author thinks. This is embodied by a complete document which is the fully recovered representation of author's understanding. If the reader (or user) has the same contents with complete document, the right information can be gathered according to the reader's requests. Each level of recovering knowledge consists of the so-called "document architecture". The recovery process is explained under the paradigm of "information architecture". The participating processes and entities are autonomous nodes in network of information architecture. They are "distirbuted", competing and compensating objects. The "multi-lingual" service is realized by a paradigm of "flexible machine translation" under the concept of information architecture. The intrinsic property of information architecture is autonomously evolving and self-organizing. The flexibility comes from the fact that every level of document architecture invokes the transfer from the source sentence (or author's document) to the target language (or user's requests). Furthermore, the participating nodes are redundant, then they are "flexible" in sense of competing and compensating. Information retrieval and machine translation are explained and implemented under one paradigm of "information architecture" and its standardization.

1. Preface

The document retrieval service assumes that the right information is passed to the right location via right paths. The right information for a given location is gathered from the collection of documents by using some methods of search, extraction, filtering, browsing and so on. Here, a "location" means a user, an agent or a computer who requested information by a certain query statement. A unit of "Information" comes from "documents". A document is a surface of information which represents that document. A unit of information is gathered from a set of documents under a certain constraint of view. A "view" of location (or author) makes different the expression of information. Because every author has different view of expression, each document as a writing result of author is embodied in a different way even if given the same contents of information. That is why the same information can be extracted from a collection of documents each of which has different surface expression.

The right path of information is to find the efficient and shortest way from the location to the information (or document) or vice versa. If we stand in a standpoint of "information", information trips from a document to another, and finally arrives at a right location. We call it "information extraction (or retrieval)" or "document retrieval" according to the viewpoint of information and document respectively. On the other hand, standing from a "location", we navigates one document to anther by browsing them until a right information is found in a document. That is called "information searching" or "navigation" of information space (or document space).

"Information flow" is a term to describe both the information extraction and the information navigation. "Intelligent" document retrieval assumes the efficient way of information flow based on some "understanding" of documents. An intelligent way of document retrieval does not come from non-understanding of those documents. If documents are understandable to a location - whether the location is a person, an agent or a computer, we can find the right path to the right documents for the give query of right location.

"Understanding" states reside in both documents and both location. If a document is said to be understandable, that is written in well-formed sentences, well-presented styles, well-expressed contents of information, well-structured of scenario to explain, well-known knowledge to the assumed readers. well-fitted to the assumed readers' viewpoints, well-anchored to the assumed readers' situation, and so on. If that document is not enough for a location to understand, the location can ask again the authors' location for more information (or explanation). On the other hand, if some readers understand a document but others cannot, those locations are located in the different knowledge level. In other words, there is a knowledge gap between those locations. The first problem, here, is on what resides in a document for understanding it when we say about the efficient document retrieval. In this paper, a concept "document architecture" will be presented as an explanation model of "understanding" documents.

Document space is interspersed with "distributive" network nodes. Because documents are only the surface of information, we can say that the information is scattered in the distributive information space which is embodied physically in distributive network nodes, but is linked logically. To understand a document is to process (or transform) it from its surface to the understandable deeper form. Such processing knowledge is distributed in a network. A model "information architecture" is presented to explain the processing of documents. When entities in information architecture is distributed, we call it "information architecture network".

"Multi-lingual" service is also claimed to be supported by the concept "understanding" which can be explained in the concept "document architecture". Furthermore, the processing (or translation) is also in the information architecture.

Finally, the implementation of document architecture assumes the standardized specification of each of its layers. Here, the "document interchange format" is to say about the specification of each layer of document architecture.

2. Document Architecture for Intelligent Document Retrieval

2.1 Document Architecture and Understanding of Document

"Document architecture" is to describe the semantics of document. This document architecture contains everything for understanding a document. It consists of five layers inside document: character, layout, structure of data, structure of information, and knowledge. The first two layers (character and layout) are categorized into the "surface" of document. The layer of structure of data can be seen as a syntactic structure, and the remains as semantics of document.

A document is written in a character and in a layout of structure (for example, paragraph, title, etc.) When the author writes a document, he knows written characters and its layout structure. If he types in a computer, his writing is stored in a standardized character code. These are surface of documents, for every document is written in a character and in a layout. The reader is assumed to know them. If not, the another process called "multi-lingual" processing is involved which will be explained later.

However, the more understanding is involved, the more structure of data should be extracted. For example, the link between a figure and its related text units is such syntactic structure between data in the document. Whenever the reader wants to understand, the first job is to find such syntactic linkage inside of document; for example, links among text unit, footnote, reference, figure, and so on. Those structure is physically linked in the so-called hypertext.

Moreover, a location wants to read them, it knows the document's linguistic structure: their morphological, syntactic and semantic structure. Without such representation of processed documents, any location (computer or person) cannot understand the sentences in document. Finally, the full document requires the full knowledge of terminology and then the domain knowledge which the document writes on.

To understand a document is that the location has every knowledge in the document architecture of that document. The knowlege is assumed to recognize the character code, document layout, structure of data, structure of information (linguistic information), terminology and domain. Document retrieval is a kind of "communication" between information producer (writer) and consumer (reader). The action of retrieval involves the process of communication. Communication assumes the same level of knowledge between both sides, for communication is invoked after understanding. The communication with false understanding follows from the different knowledge level. We call that the "communication bottleneck" which entails the knowledge gap. The problem is how to overcome such knowledge gap for reducing the communication bottleneck. Consider again the knowledge state of author and reader locations of document.

The author location of document understands fully its writing document. It knows every knowledge level of document architecture inside of that document. The author also assumes some right knowledge level of reader when it writes the document. Whenever such expectation is not fulfilled, the communication bottleneck state holds.

Next, let's stand in the side of reader. For a given document, a reader gets understand the reader's own local knowledge. If the reader is a human, the human's cognitive memory will be used; otherwise, if the reader is a computer, it processes based on its local database. Unless the reader understands fully, they use a global knowledge in public: for example, dictionary or encyclopedia. If the reader is a computer, such global knowledge is written in a machine-understandable form instead of human-readable form on paper. The next step when they are not understandable yet, is that the reader asks questions to the author location. The author then responds by an appropriate answer. Such answer may fill the gap in document architecture of reader side. Such feedback information is helpful to understand the author's intention.

2.2 Complete Document and Document Interchange Format Standards

The "compete document" is defined as complete description of knowledge levels inside of document under the concept of document architecture. Whenever communication invokes, both sides of document assumes that they recover the complete document from the surface structure of document, that is, the author's writing.

The complete document is the completed description of each layer specification of document architecture. Such specification is the assumption of right communication. We call that "document interchange format." The complete document is assumed to be recovered from both sides' kwowledge. Because both sides cooperate for reader location's understanding, they are compensating each other. Furthermore, they are also competing for the completeness. The next section will present a concept "information architecture" to clarify the entities and processes in both sides' kwowledge for communication.

2.3 Information Architecture as a Paradigm of Information Processing

2.3.1 Definition of Information Architecture

The concept "information architecture" is introduced to refine the processes under the document architecture. The configuration of information architecture (figure 1) consists of entities, transformatons and constraints between transformation. Entities are data, information, and knowledge. The transformation between entities is four processes: information extraction, knowledge extraction, idea generation and data presentation. The constraints are two-kinds: view and situation.

----------------------------------------------------------------------------------------------------------------------------------

Figure 1. Configuration of Information Architecture

-----------------------------------------------------------------------------------------------------------------------------------

The first entity of information architecture is "data". The data is just the surface of document architecture. The examples are text, wordprocessing output or multimedia document. The author writes data, not information in the form of document. The second entity is "information" which is the result of structuring of data. When a document is converted to its state of information, surface units in the document have cross-refential links between them and linguistic units are annotated by linguistic tags for the further processing. Examples of information are hypertext in a sense of structured data, and morphological/syntactic tagged sequence of units of sentences as a processed data which is neared toward understanding. Finally, the last entity is "knowledge". That is in the deepest entity of cognition of human, and in the somewhat normalized form in the location (whether computer or human). Such kwowledge contains the terminological knowledge, domain knowledge and so on.

Information is extracted from data by cutting off the view of information. "Information extraction" is one of data transformation from data to information. The contents of many documents are summarized to one information. The summarisation is also one of processes of information extraction. "View" is analogous to a "clothes" of information to be displayed to the reader of document as data. Views of authors force the same information to be written in different styles of documents. When the reader is a class of children, the document is written in a very easy style of writing and that document contains lots of figure. The document for experts consists of many formal formula to convey the very technical facts. Those depend on the "view" of "location". The process of "data generation" from information to data is to produce the good looking documents to some level of readers.

Knowledge is formalized from information after normalizing by "situation" factor. That process is called the "knowledge extraction". This is used to be called the "learning" or "knowledge acquisition". However, those latter terms were used for the direct transformation from data to knowledge. The reverse process "idea generation" creates information appropriate to a given situation. Situation anchors the parameter in a unit of knowledge and generates instances of knowledge to information. A unit of information can be generated by a given situation and by a process of assembling of a collection of knowledge. That is called the idea generation. Such process helps the users to make their own idea. However, because such idea is the final form of data, the representation may not be understandable to others at the standpoint of "data".

2.3.2 Normalized Entity of Information Architecture and Standardizaton

Each entity of information architecture can be claimed to have its best specification. Each entity is embodied to many application. An application of data is "document". The best specification of document is the best style sheet and the best writing based on the best presentation method depending on the view of data. Let us define the "well-presented data" as the best data form for the user of data. If we define such well-presented data for the best persuasive documents, we have a goal to pursuit in the study of data and information under the paradigm of information architecture.

The normalized form of information is called "well-structured information". For example, a well-linked hypertext without waste is in the normalized state. Every information unit can be found in an optimal way. The entity "knowledge" has its normalized form as "well-formed representation". The normalized logical form is one of exmples.

The standardization of document interchange format can be defined under the concept of normlized entity. Every layer of document architecture is projected to the information architecture. The next section returns to the discussion of document retrieval in distributive environment.

3. Information Architecture Network for Intelligent Distributed Document Retrieval

As seen in the former section, the author has complete knowledge to understand the given document but the author does not provide the complete document. Complete document is a prerequisite of intelligent document retrieval service. The ultimate purpose of document retrieval is to acquire the efficient communication between writers and authors. The problem is where the complete document can be recovered. It is claimed that the information architecture is one of solution.

3.1 Information Architecture Network to Recover the Document Architecture

The authors have complete knowledge to understand their writings but they do not provide the complete documents. The source to fill up the blanks in complete document is not in the location of author, but in the analyzing side of information architecture. Such knowledge location is different from the author's location. Each knowledge is in the information architecture network whose physical form is very versatile. If the location is like computer, such physical form is in machine-readable form, and they are located in either a network or its own stroage. Such information architecture of author's side construct and recover the complete document architecture like in the left side of figure 2. On the other hand, the reader's understanding is based on the complete document in the reader's side like in the right side of figure 2.

--------------------------------------------------------------------------------------------------------------------------------------

Figure 2. Distributed Information Architecture

--------------------------------------------------------------------------------------------------------------------------------------

The dynamics of information architecture network is different from the static view in figure 2. The document architecture of reader and writer are compensating and competing objects. As shown in figure 3, if the reader has complete knowledge to understand the writer's document, the writer has only to provide the surface form of document. The reader's side of document architecture can recover every layer of document architecture by using their own kwonwledge. However, if the reader has only partial knowledge, the author's side should compensate for the reader's knowledge.

--------------------------------------------------------------------------------------------------------------------------------------

Figure 3. Cases of Information Architecture Network

-------------------------------------------------------------------------------------------------------------------------------------

3.2 Configuration of Information Architecture Network and Normalization

Knowledge for recovering the complete document is interspersed with nodes in information architecture network as shown in figure 4. When the knowledge is incomplete to recover the complete document, the reader (or user) questions to the author's location and the author gives a solution. Such feedback processes and contents are logged in an interim node, which may be implemented as a logging server physically in network. The information architecture network evolves after its self-organizing mechanism of learning and restructuring. The logging server is the source of such automatic evolution. It was claimed that the information architecture has processes between entities for information extraction and knowledge extraction. Because each process and entity is autonomous, nodes participated in the information architecture evolves autonomously. The process is not centralized but distributive. The participating node may not be one but redundant (not duplicated) for the same function. They are so compensating and competing each other.

-----------------------------------------------------------------------------------------------------------------------------------

Figure 4. Information Architecture Network: Configuration

---------------------------------------------------------------------------------------------------------------------------------------

The information flow is incremental. At first, the surface form of document will be passed to the reader. When the reader cannot understand, it is rejected and returned to the author's location. At that time, the reader's side has an evaluator to measure whether the give form of document has enough for the reader to understand or not.

In figure 4, the document architecture is different and separate from each other. However, if there is a standardized form and process of document interchange format, they will be combined into one like figure 5. The physical construction cost will be reduced and the operation will be more efficient.

--------------------------------------------------------------------------------------------------------------------------------------

Figure 5. Distributed Information Architecture Network Based on Document Interchange Format Standards

--------------------------------------------------------------------------------------------------------------------------------------

4. Flexible Machine Translation for Multi-lingual Document Retrieval

4.1 Configuration of Flexible Machine Translation and Information Architecture

In fact, the process of translation does not assume the understanding. In some occasion, a set of trasnlation templates is enough to translate. For example, articles in stock news use only the special usage and special domain expression. That does not invoke the process to recover the complete document. If such process fails, then the next process in the deeper level of document architecture starts to analyze and translate. That process fills up the next level of document architecture, their result is transferred to the reader's langauage side by using the appropriate transfer knowledge. The sequence of processes to recover the complete document is awaken incrementally in a demand-based way. That is flexible in a sense that the process evocation is flexible. We call this "Flexible Machine Translation" (FMT). As shown in figure 6, after failure from the first process "morphology analyzer", that process either suggests the alternative solution or passes to the next process "syntactic analyzer". The result of syntactic analyzer is transferred by based on syntactic pattern transfer knowledge. The evaluator of the corresponding node of syntactic generator in the reader's side measures whether the output of syntactic transfer is possible to be generated upto the final surface form. These processes continues whenever the reader's side send the rejection signal. The feedback information is also logged just like the information architecture network (figure 4). The process will progress until the interlingua meets. However, if the reader's side accepts the transfered document, the translation ends without going to the direction toward the point of interlingua.

---------------------------------------------------------------------------------------------------------------------------------------

Figure 6. Flexible Machine Translation: Configuration

-------------------------------------------------------------------------------------------------------------------------------------

4.2 Distributed Flexible Machine Translation

As shown in the last section, every module in flexible machine translation system can be linked to (reader's) target langauge generation everytime that module produces the result. Because every feedback is logged and stored onto the interim node, the system grows. Flexible machine translation systems are those of evolutionary and self-organizing networks.

Flexible machine translation is embodied in a "distributed" way. Every module can be a node in network. They are fault-tolerant because each level is competing nodes of the so-claimed same function. Every node is also competing and compensating processes or entities as shown in figure 7.

-------------------------------------------------------------------------------------------------------------------------------------

Figure 7. DFMT: Distributed Flexible Machine Translation Operation

-------------------------------------------------------------------------------------------------------------------------------------

When we develop based on the flexible machine translation paradigm, the full system can operate from the beginnig stage. The transfer knowledge is a kind of document. The (author's) source sentence is a query to seek its component-wise patterns to be linked (or transfered) to the (reader's) target sentence. Such processes are the same as the processes of those of document retrieval.

5. Conclusion: Toward the Service of Intelligent Distributed Multi-lingual Document Retrieval

The balanced information flow is claimed to be possibly embodied by the document architecture and its standardization of document interchange format. The overall picutre is drawn under the paradigm of information architure. The mono-linugal document retrieval and the multi-lingual translation service is in one concept of information architecture. The "intelligence" of document retrieval assumes the full understanding, and that is shown to be a complete document.

In an operation of this paradigm, the standardization issue is one of practical objects for the successful communication. Assuming such standards, the knowledge to recover the complete document can be locaeted in a "distributed" network regardless of whether those are embodied virtually or physically. The practical cooperation in a distributed environment is possible under this paradigm. The operation starts from the begging stage of development. These practical points support this paradigm: "document architecture under information architecture network" and its application: "intelligent distributed multi-lingual document retrieval".