ebook img

UIMA Tutorial and Users Guide - Apache UIMA - The Apache PDF

164 Pages·2009·2.67 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview UIMA Tutorial and Users Guide - Apache UIMA - The Apache

UIMA Tutorial and Developers' Guides Written and maintained by the Apache UIMA Development Community Version 2.3.0-incubating Copyright © 2004, 2006 International Business Machines Corporation Copyright © 2006, 2009 The Apache Software Foundation Incubation Notice and Disclaimer. Apache UIMA is an effort undergoing incubation at the Apache Software Foundation (ASF). Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF. License and Disclaimer. The ASF licenses this documentation to you under the Apache License, Version 2.0 (the "License"); you may not use this documentation except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, this documentation and its contents are distributed under the License on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. Trademarks. All terms mentioned in the text that are known to be trademarks or service marks have been appropriately capitalized. Use of such terms in this book should not be regarded as affecting the validity of the the trademark or service mark. Published December, 2009 Table of Contents 1. Annotator & AE Developer's Guide ............................................................................ 1 1.1. Getting Started ................................................................................................. 2 1.1.1. Defining Types ...................................................................................... 3 1.1.2. Generating Java Source Files for CAS Types ........................................... 5 1.1.3. Developing Your Annotator Code .......................................................... 6 1.1.4. Creating the XML Descriptor ................................................................. 9 1.1.5. Testing Your Annotator ........................................................................ 12 1.2. Configuration and Logging ............................................................................ 14 1.2.1. Configuration Parameters .................................................................... 14 1.2.2. Logging ............................................................................................... 18 1.3. Building Aggregate Analysis Engines ............................................................. 21 1.3.1. Combining Annotators ......................................................................... 21 1.3.2. AAEs can also contain CAS Consumers ............................................... 25 1.3.3. Reading the Results of Previous Annotators ......................................... 26 1.4. Other examples .............................................................................................. 27 1.5. Additional Topics ........................................................................................... 28 1.5.1. Annotator Methods .............................................................................. 28 1.5.2. Reporting errors from Annotators ........................................................ 30 1.5.3. Throwing Exceptions from Annotators ................................................. 30 1.5.4. Accessing External Resource Files ........................................................ 33 1.5.5. Result Specifications ............................................................................ 40 1.5.6. Class path setup when using JCas ........................................................ 42 1.5.7. Using the Shell Scripts ......................................................................... 43 1.6. Common Pitfalls ............................................................................................. 44 1.7. UIMA Objects in Eclipse Debugger ................................................................ 44 1.8. Analysis Engine XML Descriptor .................................................................... 45 1.8.1. Header and Annotator Class Identification .......................................... 46 1.8.2. Simple Metadata Attributes ................................................................. 46 1.8.3. Type System Definition ........................................................................ 46 1.8.4. Capabilities .......................................................................................... 47 1.8.5. Configuration Parameters (Optional) .................................................... 47 2. CPE Developer's Guide ............................................................................................. 51 2.1. CPE Concepts ................................................................................................. 52 2.2. CPE Configurator and CAS viewer ................................................................. 53 2.2.1. Using the CPE Configurator ................................................................ 53 2.2.2. Running the CPE Configurator from Eclipse ........................................ 57 2.3. Running a CPE from Your Own Java Application ........................................... 58 2.3.1. Using Listeners .................................................................................... 59 2.4. Developing Collection Processing Components .............................................. 59 2.4.1. Developing Collection Readers ............................................................ 59 2.4.2. Developing CAS Initializers ................................................................. 66 2.4.3. Developing CAS Consumers ................................................................ 66 2.5. Deploying a CPE ............................................................................................ 69 UIMA Tutorial and Developers' Guides iii UIMA Tutorial and Developers' Guides 2.5.1. Deploying Managed CAS Processors ................................................... 71 2.5.2. Deploying Non-managed CAS Processors ............................................ 72 2.5.3. Deploying Integrated CAS Processors .................................................. 74 2.6. Collection Processing Examples ...................................................................... 75 3. Application Developer's Guide ................................................................................. 77 3.1. The UIMAFramework Class ........................................................................... 77 3.2. Using Analysis Engines .................................................................................. 78 3.2.1. Instantiating an Analysis Engine .......................................................... 78 3.2.2. Analyzing Text Documents .................................................................. 79 3.2.3. Analyzing Non-Text Artifacts .............................................................. 80 3.2.4. Accessing Analysis Results .................................................................. 80 3.2.5. Multi-threaded Applications ................................................................ 81 3.2.6. Multiple AEs & Creating Shared CASes ............................................... 83 3.2.7. Saving CASes to file systems ............................................................... 84 3.3. Using Collection Processing Engines .............................................................. 84 3.3.1. Running a CPE from a Descriptor ........................................................ 85 3.3.2. Configuring a CPE Descriptor Programmatically ................................. 85 3.4. Setting Configuration Parameters ................................................................... 87 3.5. Integrating Text Analysis and Search .............................................................. 88 3.5.1. Building an Index ................................................................................ 89 3.5.2. Semantic Search Query Tool ................................................................. 92 3.6. Working with Remote Services ....................................................................... 94 3.6.1. Deploying as SOAP Service ................................................................. 94 3.6.2. Deploying as a Vinci Service ................................................................ 96 3.6.3. Calling a UIMA Service ....................................................................... 98 3.6.4. Restrictions on remotely deployed services .......................................... 99 3.6.5. The Vinci Naming Services (VNS) ...................................................... 100 3.6.6. Configuring Timeout Settings ............................................................ 103 3.7. Increasing performance using parallelism ..................................................... 105 3.8. Monitoring AE Performance using JMX ........................................................ 106 3.9. Performance Tuning Options ........................................................................ 108 4. Flow Controller Developer's Guide ......................................................................... 111 4.1. Developing the Flow Controller Code ........................................................... 111 4.1.1. Flow Controller Interface Overview ................................................... 111 4.1.2. Example Code .................................................................................... 112 4.2. Creating the Flow Controller Descriptor ....................................................... 114 4.3. Adding Flow Controller to an Aggregate ...................................................... 116 4.4. Adding Flow Controller to CPE .................................................................... 117 4.5. Using Flow Controllers with CAS Multipliers ............................................... 118 4.6. Continuing the Flow When Exceptions Occur ............................................... 118 5. Annotations, Artifacts & Sofas ................................................................................. 121 5.1. Terminology .................................................................................................. 121 5.1.1. Artifact ............................................................................................... 121 5.1.2. Subject of Analysis — Sofa ................................................................. 121 5.2. Formats of Sofa Data .................................................................................... 121 iv UIMA Tutorial and Developers' Guides UIMA Version 2.3.0 UIMA Tutorial and Developers' Guides 5.3. Setting and Accessing Sofa Data ................................................................... 122 5.3.1. Setting Sofa Data ................................................................................ 122 5.3.2. Accessing Sofa Data ........................................................................... 122 5.3.3. Accessing Sofa Data using a Java Stream ............................................ 123 5.4. The Sofa Feature Structure ............................................................................ 123 5.5. Annotations .................................................................................................. 124 5.5.1. Built-in Annotation types ................................................................... 124 5.5.2. Annotations have an associated Sofa .................................................. 124 5.6. AnnotationBase ............................................................................................. 124 6. Multiple CAS Views ................................................................................................ 127 6.1. CAS Views and Sofas ................................................................................... 127 6.1.1. Naming CAS Views and Sofas ........................................................... 127 6.1.2. Multi/Single View parts in Applications ............................................. 128 6.2. Multi-View Components ............................................................................... 128 6.2.1. Deciding: Multi-View ......................................................................... 128 6.2.2. Multi-View: additional capabilities ..................................................... 128 6.2.3. Component XML metadata ................................................................ 129 6.3. Sofa Capabilities & APIs for Apps ................................................................ 129 6.4. Sofa Name Mapping ..................................................................................... 129 6.4.1. Name Mapping in an Aggregate Descriptor ....................................... 130 6.4.2. Name Mapping in a CPE Descriptor .................................................. 131 6.4.3. CAS View for Single-View Parts ......................................................... 132 6.4.4. Name Mapping in a UIMA Application ............................................. 133 6.4.5. Name Mapping for Remote Services .................................................. 133 6.5. JCas extensions for Multiple Views ............................................................... 134 6.6. Sample Multi-View Application .................................................................... 134 6.6.1. Annotator Descriptor ......................................................................... 134 6.6.2. Application Setup .............................................................................. 135 6.6.3. Annotator Processing ......................................................................... 135 6.6.4. Accessing the results of analysis ......................................................... 136 6.7. Views API Summary .................................................................................... 137 6.8. Sofa Incompatibilities: V1 and V2 ................................................................. 137 7. CAS Multiplier ........................................................................................................ 139 7.1. Developing the CAS Multiplier Code ............................................................ 139 7.1.1. CAS Multiplier Interface Overview .................................................... 139 7.1.2. Getting an empty CAS Instance .......................................................... 140 7.1.3. Example Code .................................................................................... 141 7.2. CAS Multiplier Descriptor ............................................................................ 144 7.3. Using CAS Multipliers in Aggregates ........................................................... 145 7.3.1. Aggregate: Adding the CAS Multiplier .............................................. 145 7.3.2. CAS Multipliers and Flow Control ..................................................... 145 7.3.3. Aggregate CAS Multipliers ................................................................ 147 7.4. CAS Multipliers in CPE's .............................................................................. 147 7.5. Applications: Calling CAS Multipliers .......................................................... 148 7.5.1. Output CASes .................................................................................... 148 UIMA Version 2.3.0 UIMA Tutorial and Developers' Guides v UIMA Tutorial and Developers' Guides 7.5.2. CAS Multipliers with other AEs ......................................................... 149 7.6. Merging with CAS Multipliers ...................................................................... 150 7.6.1. CAS Merging Overview ..................................................................... 150 7.6.2. Example CAS Merger ......................................................................... 151 7.6.3. SimpleTextMerger in an Aggregate .................................................... 153 8. XMI & EMF ............................................................................................................. 155 8.1. Overview ...................................................................................................... 155 8.2. Converting an Ecore Model to or from a UIMA Type System ........................ 155 8.3. Using XMI CAS Serialization ........................................................................ 156 8.3.1. Character Encoding Issues with XML Serialization ............................. 157 vi UIMA Tutorial and Developers' Guides UIMA Version 2.3.0 Chapter 1. Annotator and Analysis Engine Developer's Guide This chapter describes how to develop UIMA type systems, Annotators and Analysis Engines using the UIMA SDK. It is helpful to read the UIMA Conceptual Overview chapter for a review on these concepts. An Analysis Engine (AE) is a program that analyzes artifacts (e.g. documents) and infers information from them. Analysis Engines are constructed from building blocks called Annotators. An annotator is a component that contains analysis logic. Annotators analyze an artifact (for example, a text document) and create additional data (metadata) about that artifact. It is a goal of UIMA that annotators need not be concerned with anything other than their analysis logic – for example the details of their deployment or their interaction with other annotators. An Analysis Engine (AE) may contain a single annotator (this is referred to as a Primitive AE), or it may be a composition of others and therefore contain multiple annotators (this is referred to as an Aggregate AE). Primitive and aggregate AEs implement the same interface and can be used interchangeably by applications. Annotators produce their analysis results in the form of typed Feature Structures, which are simply data structures that have a type and a set of (attribute, value) pairs. An annotation is a particular type of Feature Structure that is attached to a region of the artifact being analyzed (a span of text in a document, for example). For example, an annotator may produce an Annotation over the span of text President Bush, where the type of the Annotation is Person and the attribute fullName has the value George W. Bush, and its position in the artifact is character position 12 through character position 26. It is also possible for annotators to record information associated with the entire document rather than a particular span (these are considered Feature Structures but not Annotations). All feature structures, including annotations, are represented in the UIMA Common Analysis Structure(CAS). The CAS is the central data structure through which all UIMA components communicate. Included with the UIMA SDK is an easy-to-use, native Java interface to the CAS called the JCas. The JCas represents each feature structure as a Java object; the example feature structure from the previous paragraph would be an instance of a Java class Person with getFullName() and setFullName() methods. Though the examples in this guide all use the JCas, it is also possible to directly access the underlying CAS system; for more information see Chapter 4, CAS Reference in UIMA References . The remainder of this chapter will refer to the analysis of text documents and the creation of annotations that are attached to spans of text in those documents. Keep in mind that the CAS can represent arbitrary types of feature structures, and feature structures can refer to Annotator & AE Developer's Guide 1 Getting Started other feature structures. For example, you can use the CAS to represent a parse tree for a document. Also, the artifact that you are analyzing need not be a text document. This guide is organized as follows: • Section 1.1, “Getting Started” [2] is a tutorial with step-by-step instructions for how to develop and test a simple UIMA annotator. • Section 1.2, “Configuration and Logging” [14] discusses how to make your UIMA annotator configurable, and how it can write messages to the UIMA log file. • Section 1.3, “Building Aggregate Analysis Engines” [21] describes how annotators can be combined into aggregate analysis engines. It also describes how one annotator can make use of the analysis results produced by an annotator that has run previously. • Section 1.4, “Other examples” [27] describes several other examples you may find interesting, including • SimpleTokenAndSentenceAnnotator – a simple tokenizer and sentence annotator. • PersonTitleDBWriterCasConsumer – a sample CAS Consumer which populates a relational database with some annotations. It uses JDBC and in this example, hooks up with the Open Source Apache Derby database. • Section 1.5, “Additional Topics” [28] describes additional features of the UIMA SDK that may help you in building your own annotators and analysis engines. • Section 1.6, “Common Pitfalls” [44] contains some useful guidelines to help you ensure that your annotators will work correctly in any UIMA application. This guide does not discuss how to build UIMA Applications, which are programs that use Analysis Engines, along with other components, e.g. a search engine, document store, and user interface, to deliver a complete package of functionality to an end-user. For information on application development, see Chapter 3: “Application Developer's Guide” [77] . 1.1. Getting Started This section is a step-by-step tutorial that will get you started developing UIMA annotators. All of the files referred to by the examples in this chapter are in the examples directory of the UIMA SDK. This directory is designed to be imported into your Eclipse workspace; see Section 3.2, “Setting up Eclipse to view Example Code” in UIMA Overview & SDK Setup for instructions on how to do this. See Section 3.4, “Attaching UIMA Javadocs” in UIMA Overview & SDK Setup for how to attach the UIMA Javadocs to the 1 jar files. Also you may wish to refer to the UIMA SDK Javadocs located in the docs/api directory. 1 file:../../api/index.html 2 Annotator & AE Developer's Guide UIMA Version 2.3.0 Defining Types Note: In Eclipse 3.1, if you highlight a UIMA class or method defined in the UIMA SDK Javadocs, you can conveniently have Eclipse open the corresponding Javadoc for that class or method in a browser, by pressing Shift + F2. Note: If you downloaded the source distribution for UIMA, you can attach that as well to the library Jar files; for information on how to do this, see Chapter 1, Javadocs in UIMA References. The example annotator that we are going to walk through will detect room numbers for rooms where the room numbering scheme follows some simple conventions. In our example, there are two kinds of patterns we want to find; here are some examples, together with their corresponding regular expression patterns: Yorktown patterns: 20-001, 31-206, 04-123(Regular Expression Pattern: ##-[0-2]##) Hawthorne patterns: GN-K35, 1S-L07, 4N-B21 (Regular Expression Pattern: [G1-4][NS]-[A-Z]##) There are several steps to develop and test a simple UIMA annotator. 1. Define the CAS types that the annotator will use. 2. Generate the Java classes for these types. 3. Write the actual annotator Java code. 4. Create the Analysis Engine descriptor. 5. Test the annotator. These steps are discussed in the next sections. 1.1.1. Defining Types The first step in developing an annotator is to define the CAS Feature Structure types that it creates. This is done in an XML file called a Type System Descriptor. UIMA defines basic primitive types such as Boolean, Byte, Short, Integer, Long, Float, and Double, as well as Arrays of these primitive types. UIMA also defines the built-in types TOP, which is the root of the type system, analogous to Object in Java; FSArray, which is an array of Feature Structures (i.e. an array of instances of TOP); and Annotation, which we will discuss in more detail in this section. UIMA includes an Eclipse plug-in that will help you edit Type System Descriptors, so if you are using Eclipse you will not need to worry about the details of the XML syntax. See Chapter 3, Setting up the Eclipse IDE to work with UIMA in UIMA Overview & SDK Setup for instructions on setting up Eclipse and installing the plugin. The Type System Descriptor for our annotator is located in the file descriptors/ tutorial/ex1/TutorialTypeSystem.xml. (This and all other examples are located in the examples directory of the installation of the UIMA SDK, which can be imported into UIMA Version 2.3.0 Annotator & AE Developer's Guide 3 Defining Types an Eclipse project for your convenience, as described in Section 3.2, “Setting up Eclipse to view Example Code” in UIMA Overview & SDK Setup.) In Eclipse, expand the uimaj-examples project in the Package Explorer view, and browse to the file descriptors/tutorial/ex1/TutorialTypeSystem.xml. Right-click on the file in the navigator and select Open With fi Component Descriptor Editor. Once the editor opens, click on the “Type System” tab at the bottom of the editor window. You should see a view such as the following: Our annotator will need only one type – org.apache.uima.tutorial.RoomNumber. (We use the same namespace conventions as are used for Java classes.) Just as in Java, types have supertypes. The supertype is listed in the second column of the left table. In this case our RoomNumber annotation extends from the built-in type uima.tcas.Annotation. Descriptions can be included with types and features. In this example, there is a description associated with the building feature. To see it, hover the mouse over the feature. The bottom tab labeled “Source” will show you the XML source file associated with this descriptor. The built-in Annotation type declares three fields (called Features in CAS terminology). The features begin and end store the character offsets of the span of text to which the annotation refers. The feature sofa (Subject of Analysis) indicates which document the begin and end offsets point into. The sofa feature can be ignored for now since we assume in this tutorial that the CAS contains only one subject of analysis (document). Our RoomNumber type will inherit these three features from uima.tcas.Annotation, its supertype; they are not visible in this view because inherited features are not shown. One additional feature, building, is declared. It takes a String as its value. Instead of String, 4 Annotator & AE Developer's Guide UIMA Version 2.3.0

Description:
Written and maintained by the Apache UIMA Development Community. Version 2.3.0- For example, you can use the CAS to represent a parse tree for a.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.