Meta-Knowledge Annotation of Bio-Events Annotation Guidelines Paul Thompson, Raheel Nawaz, John McNaught and Sophia Ananiadou School of Computer Science, University of Manchester, UK {paul.thompson, john.mcnaught, sophia.ananiadou}@manchester.ac.uk [email protected] Annotation Guidelines: Meta-Knowledge Annotation of Bio-Events Page 1 Contents 1 Introduction and Background ................................................................................ 4 1.1 Background to the Task –Searching for Relevant Information ..................... 5 1.1.1 Keyword Searching and its Problems ........................................................ 5 1.1.2 Events and Event-Based Searching ........................................................... 6 1.2 Need for Meta-Knowledge Annotation ......................................................... 9 1.2.1 Meta-Knowledge Examples ....................................................................... 9 2 The Annotation Scheme ....................................................................................... 13 2.1 Knowledge Type .......................................................................................... 14 2.1.1 Investigation ............................................................................................. 14 2.1.2 Analysis.................................................................................................... 15 2.1.3 Observation .............................................................................................. 18 2.1.4 Method ..................................................................................................... 20 2.1.5 Fact ........................................................................................................... 21 2.1.6 Other ........................................................................................................ 22 2.2 Certainty Level............................................................................................. 24 2.2.1 L3 ............................................................................................................. 24 2.2.2 L2 ............................................................................................................. 24 2.2.3 L1 ............................................................................................................. 26 2.3 Polarity ......................................................................................................... 26 2.3.1 Positive ..................................................................................................... 26 2.3.2 Negative ................................................................................................... 26 2.4 Manner ......................................................................................................... 28 2.4.1 High.......................................................................................................... 28 2.4.2 Low .......................................................................................................... 29 2.4.3 Neutral...................................................................................................... 30 2.5 Source .......................................................................................................... 30 2.5.1 Current ..................................................................................................... 31 2.5.2 Other ........................................................................................................ 32 3 Hypothetical Examples ........................................................................................ 34 4 Annotation Task ................................................................................................... 39 4.1 What Annotation is Already There? ............................................................ 39 4.1.1 Named Entity Annotations ....................................................................... 39 4.1.2 Event Annotations .................................................................................... 40 4.2 What to Annotate ......................................................................................... 43 4.2.1 Sequence of annotation ............................................................................ 43 4.2.2 Annotating Clue Phrases .......................................................................... 46 5 Annotation Environment ...................................................................................... 49 5.1 Introduction to X-Conc ................................................................................ 49 5.1.1 Getting Started ......................................................................................... 49 5.1.2 Importing Annotation Projects ................................................................. 50 5.1.3 Getting Ready to Annotate....................................................................... 52 5.2 How to annotate an Event with X-Conc: A stepwise (Illustrated) Guide .... 52 5.2.1 Existing information about events ........................................................... 52 5.2.2 Annotating Meta-Knowledge Dimension Values .................................... 53 5.2.3 Editing Meta-Knowledge Dimension Values .......................................... 53 Annotation Guidelines: Meta-Knowledge Annotation of Bio-Events Page 2 5.2.4 Annotating Clue Words/Phrases .............................................................. 55 5.3 X-Conc Tips, Pitfalls and Common Sources of Error ................................. 58 5.3.1 Ensuring that the correct annotation is selected ....................................... 58 5.3.2 Deleting/changing text span annotations ................................................. 58 5.3.3 Words and Phrases that are Clues for Multiple Meta-Knowledge Annotations .......................................................................................................... 58 6 Annotation Reference 1: Sequence, Clues and Implications ............................... 60 7 Annotation Reference 2 – List of Typical Clues ................................................. 62 Annotation Guidelines: Meta-Knowledge Annotation of Bio-Events Page 3 1 Introduction and Background If a user wishes to search for relevant information located within biomedical documents, the usual method is to enter keywords into a search engine. However, such searches normally return a large number of documents, many of which are likely to be irrelevant. Assume that the user wishes to find instances of positive regulations involving the protein narL gene product. He may enter the search terms “narL gene product” and activate, since instances of positive regulations are often described using the verb activate. Although his goal is to find documents where these search terms are related to each other in a specific way, the problem is that normal search engines do not take account of relationships between search terms, and may even return documents where the 2 search terms are each located in a separate sentence. Text mining systems help to cut down on the amount of time that users have to spend sifting through irrelevant documents. This is facilitated by providing the user with the means to formulate more structured queries, which ensure that only those documents containing the required type of knowledge are returned by the search. Using a text mining system, the user can specify that he wishes to find all instances of positive regulations, where the narL gene product is the instigator of the regulation. It is not necessary to worry about exactly how the regulation is expressed in the text, e.g., which verb is used. Although text mining systems providing functionality such as the above have already been developed, what they often lack is a means to distinguish between definite facts and other types of interpretations. For example, a text mining system may retrieve the following fact in response to the query above: (S1) The narL gene product activates the nitrate reductase operon Sentence (S1) can fairly certainly be interpreted as describing a definite fact. However, compare this to sentence (S2): (S2) Our results suggest that the narL gene product activates the nitrate reductase operon In (S2), the first part of the sentence projects a rather different interpretation to the information described by the verb activates, i.e., it is a somewhat tentative interpretation/analysis of results, which should certainly not be interpreted as a definite fact. The ability to distinguish between different interpretations of information can be important, e.g., a biologist may want to search a collection of documents to isolate descriptions of new knowledge (e.g., experimental observations and confident analyses of results) from other types of knowledge (e.g., descriptions of well- established knowledge, hypotheses, etc.). This could be useful, for example, in maintaining an up-to-date database of biological interactions. If the isolation of new Annotation Guidelines: Meta-Knowledge Annotation of Bio-Events Page 4 knowledge from other types of knowledge can be carried out automatically, this can potentially save the user a large amount of time. In order to produce systems that can distinguish different interpretations of information, we need to undertake a task called annotation. This involves reading texts and identifying and marking (annotating) the different ways in which information relating to the interpretation of knowledge (which we term meta- knowledge) can be expressed in texts. The text mining system can then learn to generalize from the annotated examples (using a computer algorithm), in order to be able to assign interpretation information to previously unseen examples. This annotation process is the subject of this document. 1.1 Background to the Task –Searching for Relevant Information Complex, structured queries such as those introduced above must be matched against structured representations of the biological knowledge that occurs in documents. Text mining systems need to be able to analyse texts in order to locate this biological knowledge and produce structured representations from the unstructured text. These structured representations of knowledge are called events. A number of collections of documents (called corpora) contain event annotations. These have been produced by domain experts, in order to allow text mining systems to learn how to recognise relevant events within texts. The meta-knowledge annotation introduced above will be carried out for individual events within these event-annotated corpora. This will provide the necessary information to train systems which not only recognise events, but can also determine automatically how those events should be interpreted. In this section, we firstly look more closely at why events and event-based searching are needed, by examining the more usual keyword searches, and highlighting their pitfalls. We then move on to look at an example of an event, and how searching using events can be more powerful and can retrieve more focussed results than are possible using keyword searches 1.1.1 Keyword Searching and its Problems It is often necessary for biologists to search the literature for relevant information. For example, a particular user may be interested in discovering the types of things that are positively regulated by a particular protein, e.g. the narL gene product. A sentence such as (S1) would provide the type of information that is sought: (S1) The narL gene product activates the nitrate reductase operon In other words, one type of sentence that would help the user to locate the information they require would be one in which The narL gene product is the grammatical subject of a verb which describes a positive regulation (such as activate). In such a sentence, the grammatical object of the verb (i.e., the nitrate reductase operon in the above example) will provide the information that is sought. As mentioned above, using a search engine such as Google or PubMed would involve entering keywords and phrases such as “narL gene product” and “activate”. Although a search carried using these terms is highly likely to retrieve relevant Annotation Guidelines: Meta-Knowledge Annotation of Bio-Events Page 5 documents, it is just as likely to retrieve a large number of documents that are not relevant. Keyword searches such as the above can be problematic for a number of reasons, and can retrieve many irrelevant documents as well as relevant ones. For example: Searching for The narL gene product and activate as separate search terms does not guarantee that they will be grammatically related to each other in the text in the way specified above. The search terms may not even occur within the same sentence. Searching using a single quoted search term, e.g., “The narL gene product activates”, to ensure that the verb occurs next to the protein in the text, is also not sufficient. The set of documents returned by such a query is likely to be smaller and more relevant than if using separate search terms. However, many relevant documents could also be missed, due to the large number of potential variations in the way that the positive regulation can be expressed in text. Some similar phrasings of the sentence (1) would include “The narL gene product is known to activate the nitrate reductase operon.”, “The narL gene product rapidly activates the nitrate reductase operon”, “The nitrate reductase operon is activated by the narL gene product”. Positive regulation events may be described by a number of different verbs and nouns other than activate e.g. increase, affect, effect In short, retrieving all relevant documents using simple keyword searches can be rather time consuming, and will often require a number of separate searches to be carried out, and much sifting of the documents returned in order to distinguish those documents that are relevant to the query. 1.1.2 Events and Event-Based Searching Text mining technology can help greatly in searching for information, both to giving extra power to the searching mechanism, thus reducing the number of separate searches that have to be carried out, as well as increasing the relevance of the results that are returned by the search. Unlike traditional search engines, text mining systems do not simply view documents as sequences of words, but rather they try to structure this information automatically, and try to find relationships between words and phrases within sentences. These structures are called events and the automatic process is called event extraction. A possible structured representation of the event described in sentence (S1) would be the following: EVENT_TYPE: Positive_Regulation EVENT_TRIGGER: activates CAUSE: The narL gene product (PROTEIN) THEME: the nitrate reductase operon (OPERON) The main features of this representation are as follows: Annotation Guidelines: Meta-Knowledge Annotation of Bio-Events Page 6 EVENT_TRIGGER – a word or phrase around which the event is “organized” in the text. This is often a verb (in this case activates) or nominalized verb (a noun with a verb-like meaning, such as transcription or activation) EVENT_TYPE - The event is assigned a type from a fixed set of possible values that characterise different types of events in biomedical texts. The event type abstracts away from the actual verb used to describe the event in the text. Event participants – Each event has one or more participants. These are generally entities (e.g. genes, proteins, organisms, etc.) that play a part in description of the event. Each participant is separately identified and assigned the following information: - Semantic role – a label that characterizes the contribution of the participant towards the description of the event. The labels used are rather general, as they are intended to be applicable to all events in biomedical texts. The following roles are used in the description above. CAUSE – participant responsible for the event occurring THEME – participant affected by or during the event - Named Entity (NE) type – a label that characterizes the type of biological entity that the event participant represents (e.g. PROTEIN). Again, these types are chosen from a fixed set of values. The automatic extraction of such events from texts allows searches to be carried out on these structures themselves, rather than using keyword searches on the unstructured text. The event structure abstracts from the exact wording in the text, meaning that searches over events can specify the following: Event types (e.g. Negative_regulation, Binding) instead of precise verbs or nominalised verbs used to describe the event Restrictions on the event participants in terms of: - Semantic roles specified by the event (e.g., CAUSE, THEME) - Values of particular roles, which could be specified as either: Keywords when searching for specific values (e.g., narL gene product) NE types for a more general search (e.g. events where the CAUSE is any entity of type PROTEIN) Thus, the user has a choice about how general or specific to make their query. NE and event types are often arranged into a hierarchy, giving the use even more control over how general or specific their search will be. As event-based searching allows users to be more precise about the type of information they are looking for, the set of results is better aligned with the users requirements, i.e., the results are more focussed, and contain fewer irrelevant documents than simple keyword searches. The results are also more concise than those returned by a traditional search engine, showing only the relevant events, or the sentences from the documents in which the relevant events are contained, rather than complete documents. Annotation Guidelines: Meta-Knowledge Annotation of Bio-Events Page 7 In more complex sentences, it is possible for multiple events to be present, and it is also possible for the participant of a particular event to be another event. Consider example (S3). (S3) We found that Y activates the expression of X Here, the “main” event in the sentence, i.e., the one which is triggered by the verb activates, has a similar structure to the event in sentence (S1), except that the THEME of the event (i.e. the expression of X) is not a simple entity, so how do we deal with it? EVENT_TYPE: Positive_Regulation EVENT_TRIGGER: activates CAUSE: Y THEME: ? We actually treat this THEME as being a separate event, as it can be seen as having its own structure, with the type GENE_EXPRESSION and the THEME of X. Note that is not necessary for both CAUSE and THEME to be specified for all events. To deal with the fact that this second event is a participant of the first, we assign the unique identifiers E1 and E2 to the events. Figure 1 shows the full representation of these 2 events. Using this notation, the biological knowledge contained in a document can be represented a set of events, some of which will be “nested” within each other. We refer to E2 as a primary event, and E1 as a secondary event. E2 conveys the main information, whilst E1 can be seen as providing supporting information – it is not a complete or “interesting” piece of knowledge in itself. It is often (but not exclusively) the case that primary events have event triggers that are verbs, whilst secondary events have triggers that are a special type of noun with a verb-like meaning called nominalised verbs. The noun expression is an example of one of these, with a meaning similar to the verb express. Other examples would include transcription (from the verb transcribe) and regulation (from the verb regulate). Figure 1 – Event Representation Example Annotation Guidelines: Meta-Knowledge Annotation of Bio-Events Page 8 1.2 Need for Meta-Knowledge Annotation Text mining systems are normally trained to recognise events by learning from annotated examples. That is to say, a corpus of document (called a corpus, plural corpora) are annotated with events by human domain experts. The event annotation process often involves: Locating the event trigger Assigning a type to the event Identifying the participants of the event Assigning roles and NE types to these participants In the biomedical field, a number of such annotated corpora already exist, making it possible to train systems to recognize events and their participants. However, information about the interpretation of the events (i.e., meta-knowledge) is often missing from the annotation, or it is not dealt with in a satisfactory way. Some examples of meta-knowledge that we consider to be important include the following: Is the event negated? Is the event stated with complete certainty, or is there some degree of uncertainty conveyed? Does the event describe well-established knowledge or new knowledge? New knowledge may correspond to direct observations, or an analyses made by the author based on experimental results What is the intensity of the event? (e.g. strong or rapid vs. weak or slow) A text mining system that can distinguish between these different types of interpretations can clearly be useful to users. For example, positive and negative events have completely different interpretations. Likewise, it would be useful to present to the user some indication of the reliability of the event, e.g. events explicitly marked as possibly true need to be distinguished from those events which are known to be definite. In a similar way, analyses based on results are less reliable than direct observations. The ability to distinguish between new and well-established knowledge may be useful in applications, such as curating a database of known protein interactions. In order to allow precise meta-knowledge to be recognized at the level of events, the annotation task described in this document will identify and assign different types of meta-knowledge to each individual event in a document. 1.2.1 Meta-Knowledge Examples To make the ideas of meta-knowledge introduced above more concrete, let us consider 8 sample sentences, the majority of which contain 2 basic events: 1) A positive regulation event where Y is the AGENT, and the expression event described in 2) is the THEME 2) An event describing a gene expression, where X is the THEME Annotation Guidelines: Meta-Knowledge Annotation of Bio-Events Page 9 Note that, in most cases 1) is the primary event in the sentence, whilst 2) is the secondary event. It is normally the case that most meta-knowledge information expressed in the sentence will apply to the primary event. Often there is no information that allows a specific interpretation to be applied to a secondary event. This is not exclusively the case, although here we concentrate mainly on the interpretations of the primary events in the sentences. The sample sentences are as follows: (S3) We found that Y activates the expression of X (S4) We examined the effect of Y on expression of X (S5) These results suggest that Y has no effect on expression of X (S6) Y is known to increase expression of X (S7) Addition of Y slightly increased the expression of X (S8) These results suggest that Y might affect the expression of X (S9) Significant expression of X was observed (S10) Previous studies have shown that Y activates the expression of X The trigger words for the events are underlined in each of the examples. The expression event, which occurs in all sentences, is always indicated by the nominalised verb expression. However, the positive regulation event is expressed in a number of different ways, namely using the verbs activate, increase and affect, or the nominalised verb effect. The positive regulation event occurs in all sentences, with the exception of (S9). The emboldened words and phrases in the examples below help to show that the way in which the events should be interpreted can vary considerably. However, current text mining systems will normally treat the events extracted from all the above sentences in an identical way, thus missing important or even vital details about the event. Most of the emboldened words affect the interpretation of the positive regulation event, which is the main event in the sentence. However, in (S9) the interpretation of the expression event is altered. In sentence (S3) above, the presence of the word found shows explicitly that the positive regulation event is backed by evidence, i.e. it is an experimental observation. The word we shows that is very likely that event was observed by the authors of the paper as part of the study being described, which would mean that it could be considered as “new” knowledge. No explicit information is specified for the secondary expression event, although we also consider this to be an observation. The interpretation of the positive regulation event in (S10) is very similar to (S3). The presence of the word shown is again an explicit indication that the positive regulation event is an experimental outcome. However, the use of Previous studies at the start of Annotation Guidelines: Meta-Knowledge Annotation of Bio-Events Page 10
Description: