Translation Equivalence and Synonymy: Preserving the Synsets in Cross-lingual Wordnets Olivia O.Y. Kwong The Chinese University of Hong Kong oykwong@arts.cuhk.edu.hk
Infrastructure of Princeton WordNet Synsets as building blocks Unordered sets of words that “denote the same concept and are interchangeable in many contexts” Synonymy / mutual substitutability Nouns, verbs, adjectives, adverbs Adjectives not hierarchically ordered, considered polysemous but of limited use in conveying info GWC 2018, NTU, Singapore 10 Jan 2018
Wordnets in other languages Princeton WordNet Merge Model Select vocabulary and develop synsets separately and locally Generate equivalence relations to PWN Expand Model Start with PWN vocab and synsets Translate synsets into target language using bilingual dictionaries Wordnets in other languages GWC 2018, NTU, Singapore 10 Jan 2018
Chinese Wordnets Various attempts (Huang et al., 2004; Xu et al., 2008; Huang et al., 2010; Wang and Bond, 2013) (Semi-)automatic identification of translation equivalents with human verification Some limited the number of translation equivalents for a synset, while others intentionally added more entries Chinese Open Wordnet (Wang and Bond, 2013) Follow Expand Model, with detailed guidelines for checking Chinese translations obtained by merging existing data, checked manually, adding new translations from authoritative bilingual dictionaries High coverage but possibly lower accuracy Adjectives: 13.8% of 4,960 core synsets GWC 2018, NTU, Singapore 10 Jan 2018
Potential Blind Spots 好 Generalness of the concept nice (pleasant or pleasing or agreeable in nature or appearance) 体贴(的),合意(的),美好(的),和蔼(的),友好(的),令人愉快(的),令人快乐(的),讨人喜欢(的) 好 Generalness of the concept pleasant / pleasing / agreeable nature / appearance ==> ANYTHING ! 和蔼 --> person 美好 --> inanimate obj GWC 2018, NTU, Singapore 10 Jan 2018
Potential Blind Spots 和蔼 exists in both synsets kind (having or showing a tender and considerate and helpful nature; used especially of persons and their behavior) 体谅(的),体贴(的),善良(的),仁慈(的),和善(的),宽厚(的),友善(的),好心(的),好心肠(的),亲切(的),温和(的),和蔼(的),宽宏大量(的),友好(的),乐于助人(的) considerate friendly helpful 和蔼 exists in both synsets --> “nice” and “kind” synonymous? --> Multiple senses of 和蔼 in most dictionaries? --> Legitimate to treat it as translation equivalents for both synsets? --> 和蔼 and 体贴 synonymous? --> Still qualify as a synset? GWC 2018, NTU, Singapore 10 Jan 2018
Two Issues Seriousness of the problem across different parts of speech Nouns and verbs may have more distinct references Fuzziness and subjectivity involved in adjectives Problem expected to be more pronounced among adjectives When the coverage of the meanings by the translation equivalents is at the expense of violating the requirements for synsets, are there better ways to handle such cases? GWC 2018, NTU, Singapore 10 Jan 2018
Nouns < Adjs < Verbs Synset sizes: Nouns (1-39 items) Adjs (1-15 items) Verbs (1-13 items) Overall tendency: Nouns < Adjs < Verbs GWC 2018, NTU, Singapore 10 Jan 2018
Examples (Nouns) 12896307-n black nightshade, common nightshade, poison-berry, poisonberry, Solanum nigrum (Eurasian herb naturalized in America having white flowers and poisonous hairy foliage and bearing black berries that are sometimes poisonous but sometimes edible) 老鸦酸浆草, 乌归菜, 野葡萄, 酸浆草, 救儿草, 黑姑娘, 天泡果, 地戎草, 七粒扣, 山海椒, 黑茄, 野茄子, 天泡草, 地泡子, 天天茄, 天茄子, 野辣 角, 野海椒, 后红子, 天茄苗儿, 老鸦眼睛草, 水茄, 水苦菜, 野伞子, 天茄菜, 山辣椒, 狗钮子, 苦葵, 苦菜, 野茄菜, 飞天龙, 龙葵, 耳坠菜, 乌疔草, 野辣椒 09823502-n aunt, auntie, aunty (the sister of your father or mother; the wife of your uncle) 妗, 姑母, 伯母, 姑姑, 老大妈, 阿姨, 妗母, 叔母, 姑妈, 舅母, 姑, 姨妈, 姨, 舅妈, 婶子, 婶婶, 姨母, 婶母 GWC 2018, NTU, Singapore 10 Jan 2018
Examples (Adjectives) hot (extended meanings; especially of psychological heat; marked by intensity or vehemence especially of passion or enthusiasm) 流行(的), 热切(的), 激烈(的), 热门(的), 才发行(的), 急躁(的), 销路好(的), 刚出版(的), 轰动一时(的), 最新(的), 紧缺(的), 激动(的), 狂热(的),热烈(的),时新(的) popular impatient hot topic temper new book love affair argument … GWC 2018, NTU, Singapore 10 Jan 2018
Examples (Verbs) 01215137-v arrest, pick up, nail, apprehend, nab, collar, cop (take into custody) 捕捉, 捉到, 捕获, 逮捕, 拘留, 拘押, 拘捕, 抓住, 抓获, 当场逮捕, 擒获, 逮住 Too general Over-specific GWC 2018, NTU, Singapore 10 Jan 2018
Adjectives and Non-synsets Examined 200 top-sized adjective synsets from COW At most 27 out of 200 do not contain phrasal members Show that bilingual dictionaries tend to provide translated definitions or paraphrase instead of or in addition to translation equivalents Compatibility with WordNet structure is questionable Possible causes of the non-synsets? GWC 2018, NTU, Singapore 10 Jan 2018
Different Sense Distinctions 00411886-a civilized, civilised (having a high state of culture and development both social and technological) 文明化(的), 有礼貌(的), 有教养(的), 开化(的), 文明(的), 文雅(的) 01947741-a cultured, polite, civilized, civilised, cultivated, genteel (marked by refinement in taste and manners) 文雅(的), 有礼貌(的), 优雅(的), 有教养(的), 有礼(的), 文明(的), 有先进文化(的), 有修养(的) More collective sense ? ? ? elegant polite cultivated More personal and individual behaviour GWC 2018, NTU, Singapore 10 Jan 2018
Over-interpretation of Concepts docile (willing to be taught or led or supervised or directed) 易管教(的), 驯服(的), 易教育(的), 易驾驭(的), 可教导(的), 容易教(的), 听话(的), 驯良(的), 愿学习(的), 易训练(的), 温顺(的), 顺从(的), 易控制(的) Lexicalised: 驯服,温顺,听话 Phrasal: 易管教 (easy to teach),易驾驭 (easy to control) But 愿学习 (willing to learn) == willing to be taught / easy to control ?? GWC 2018, NTU, Singapore 10 Jan 2018
Multiple Facets of Concepts Chinese (of or pertaining to China or its peoples or cultures) 中国文化(的), 汉, 华, 中文(的), 中国人(的), 汉语(的), 中国话(的), 中国(的), 中 Pertains to various aspects relating to China, but 中国人 == 中国话 ?? GWC 2018, NTU, Singapore 10 Jan 2018
Related but Subtly Different Words brown, brownish, dark-brown, chocolate-brown (of a color similar to that of wood or earth ) 咖啡色(的), 呈褐色(的), 黑褐色(的), 茶褐色(的), 棕色(的), 褐色(的) Different hues and intensities of “brownness” GWC 2018, NTU, Singapore 10 Jan 2018
Contradictory Connotation sharp, shrewd, astute (marked by practical hardheaded intelligence) 狡黠(的), 锐利(的), 精明(的), 狡猾(的), 机敏(的), 诡计多端(的), 锋利(的) - + - + - GWC 2018, NTU, Singapore 10 Jan 2018
Handling Extra-synset Information Conceptual and lexical gaps across languages Useful info for language learning and translation by humans and machines alike Importance and potential use of multiple forms and renditions in a target language Value-adding to accommodate them in wordnets in some way Basic synset structure should be maintained GWC 2018, NTU, Singapore 10 Jan 2018
1. Lexicalised Items Only Unless no lexicalised translation equivalent is available in target language Avoid over-interpretation 01251128-a cold (having a low or inadequate temperature or feeling a sensation of coldness or having been made cold by e.g. ice or refrigeration) 冰,冻,冷,寒,冰冻,冰冷,寒冷,气温低,温度不足,温度没有达到要求 GWC 2018, NTU, Singapore 10 Jan 2018
2. Language-specific Extensions Separate layer of class to store non-lexicalised expressions conveying meaning close enough to the original synset Should be a language-specific structure, not the core wordnet structure or the Inter-Lingual-Index Linked to base concepts GWC 2018, NTU, Singapore 10 Jan 2018
3. Comparable Specificity For very general or highly polysemous adjectives, similarly general equivalents should be included in corresponding synset Collocation-specific equivalents indicating different facets or senses should be captured at a subsuming level If no corresponding synset for specific meaning in PWN, add extra synset in target language wordnet linked to general synset Link specific meanings with corresponding synsets in PWN with similar-to Wise 聪明,聪颖 General Smart 聪明,聪颖 similar_to similar_to sagacious, perspicacious, sapient 睿智 sharp, shrewd, astute 精明,机敏 Specific GWC 2018, NTU, Singapore 10 Jan 2018
4. Utilisation of Pertainym Relation clever, wise, smart, intelligent, sharp, sagacious, canny … 聪明,聪颖,聪敏,机智,睿智,英明,精明 … General Mentally quick Able to make wise decisions Not equally synonymous Same word in too many synsets Distorted picture of polysemy Pertain to: Human Decision GWC 2018, NTU, Singapore 10 Jan 2018
5. Ensure logical validity Avoid words with contradictory connotation in a synset Prudently handle phrasal expressions 喝醉 vs 烂醉 (drink+drunk) (very+drunk) 贫困 vs 极度贫困 (impoverished) (extremely+impoverished) GWC 2018, NTU, Singapore 10 Jan 2018
Conclusion Translation equivalents not necessarily synonymous Could be a problem for building cross-lingual wordnets Vulnerability of adjectives, esp. the general ones Context-dependent equivalents separately linked Importance of keeping the theoretical foundation intact GWC 2018, NTU, Singapore 10 Jan 2018