Jian SU (苏俭)

Unit Head, Natural Language Processing (NLP)
Co-Director, Baidu I2R Research Centre (BIRC)
A*STAR Institute for Infocomm Research (I2R)

Selected Collaborations / Deployments

  1. Co-Director, Baidu I2R Research Centre (BIRC), 2012 - 2023
    BIRC, a collaboration between Baidu and A*STAR I2R, is Baidu’s first overseas joint laboratory. Since the signing of the research agreement in March 2012, this laboratory has achieved significant progress in cutting edge research on natural language processing, speech processing and robotics.

    • The world's first Voice Biometrics smart phone, Lenovo A586 was launched in 2012, using I2R Voice Print Technology;

    • Music Search technology, “听歌识曲” - Ting Ge Shi Qu, which literally translates to “hear a tune and know the song" was released through Baidu Music App in 2013 with better success rate comparing with other systems in the market;

    • Baidu's Online Machine Translation released Thai English translation services in 2013 using BIRC’s language resources, beating the comparison systems with a large margin;

    • Thai Word Segmentation was integrated in Hao 123 Thai Browse in 2014;

    • Entity Linking technology to automatically link related entities such as companies and people in the web pages to Knowledge Bases has been used for web search (processing huge amount of pages on the whole web)and various search applications since 2014;

    • Thai & Vietnamese Word Segmentation, Part of Speech Tagging and Named Entity Recognition (NER) were incorporated in Baidu Translation in 2015.

    • Far-field Speech Recognition technology was embedded in DuRobot in 2016;

    • Sentiment analysis technology has been used in various Baidu platforms with a batch of topics through multiple platforms including Duer (AI powered personal assistant), mobile search (with millions of pageviews per day soon after the release) and Nuomi (Online2Offline app) in 2016.

    • Entity relation extraction technology has been used in Duer in 2016.

    • Thai and Vietnamese word segmentation, Part of speech tagging and named entity recognition were released through Baidu MT in 2015.

    • Sentiment Analysis technology has been used to provide online public opinions on public figures (with millions of pageviews per day soon after the release) as well as in 2 operators (with millions of API calls per day soon after the release) on Sentiment Classification and Comment Opinion Extraction which were published at Baidu AI Open Platform in 2017.

    • Benefactive / malefactive IT hightech news event detection was released through Baidu AI Open Platform in 2018.

    • Baidu's Music conversation NER leveraging on semisupervised learning framework from BIRC has been released through Baidu's Unit Platform in 2018.

  2. Principle Investigator, Semantic and Sentiment Analysis of User Generated Text, 2016-2019
    This project will provide a platform for semantic and sentiment analysis of noisy Singlish user generated content (UGC) over various online sources beyond social media, which covers Singlish phenomena besides the typical UGC disorder.

  3. Principle Investigator, Text and Speech Information Extraction, 2014-2017
    A 3 year project founded by industry.

  4. Principle Investigator, Entity Linking for Health Text Information Aggregation, 2011-2012
    A project funded by MSRA eHealth Theme Program.
    Project Member: Wei Zhang, Yanchuan Sim, Bin Chen, WenTing Wang, Zhiqiang Toh
    Collaboration partner from MSRA: Chin-Yew Lin, Yunbao Chao

  5. Principle Investigator, High Performance Entity Tracking System, 2007-2009
    To develop a high performance Named Entity Recognition and Co-reference Resolution engine for text mining systems on intelligence gathering.
    Project Member: Xiaofeng Yang, Upali Sathyajith Kohomban, Lang Jun

  6. Principal Investigator (I2R, Singapore), Supervisory Committee Member, Project Management Committee Member, Work Package Leader, EU strategic research project: BootStrep(Bootstrapping Of Ontologies and Terminologies STrategic REsearch Project), 2006 – 2009
    This project is funded in the EC's 6th Framework Programme. The project consortium consists of 6 EU parties and I2R team. My team provides co-reference resolution besides other NLP tools being part of human language technology infrastructure for bio-knowledge extraction and bio-ontology acquisition. I further lead the final work package to validate bio-lexicon and bio-ontology generated in the first two years with information extraction, information retrieval and multilingual access tasks. This work package involves 5 teams including UK national text mining centre, European BioInformatics Institute, Jena Univ.(Germany), Freiburg Univ. (Germany), Universit de Rennes (France) and I2R.
    Project member(I2R, Singapore): Xiao Mang Shou, Yang Xiao Feng, Long Qiu, Upali Sathyajith Kohomban, Dilip Kumar Limbu, Fonlin Lai, Stanley Wai Keong Yong, Jie Zhang

  7. Senior Member, Exploiting Lexical & Encyclopedic Resources for Entity Disambiguation, JHU Summer Workshop, 2007
    This is collaborative project sponsored by NSF(US) and participating organizations, where the lead researchers from Uni. of Essex, Uni. of Trento, EML Research, MITRE, UMass, DoD, JHU Cernter of Excellence, FBI-IRST and I2R work together with selected students from Columbia Uni. UCLA, UMass, JHU. The whole project aims to advance the technology on both intra document coreference resolution and cross document entity disambiguation. From my team, Xiao Feng Yang participants in the intra document coreference resolution studies, contributing to DART coreference resolution framework especially with the tree kernel resolution, while Stanley Wai Keong Yong and I work on relation extraction for cross document entity disambiguation.

  8. Principal Investigator, I2R — SOC NUS joint project 'Discourse, Entailment and Classification", 2007 — 2009
    Principal Investigator (I2R): Jian Su, Principal Investigator (SOC NUS): Chew Lim Tan
    Project member: Man Lan, Xiaofeng Yang, Wenting Wang, Bin Chen, Wei Zhang, Lang Jun

  9. Co-Principal Investigator, "A Personalized and Adaptive Literature Curation System for the Biomedical Science", 2005 – 2008
    This project funded by National Grid Office is to build a grid based literature curation system together with National Cancer Centre, Genome Institute of Singapore, National University of Singapore, Bioinformatics Institute.
    Principle Investigator: Patrick Tan, Project Co-PI: Lim Soon Wang(SOC NUS), See Kiong Ng (I2R), Jian Su (I2R), Tin Wee Tan (BioChemistry Dept, NUS), Yun Ping Lim (BII)

  10. Principal Investigator, I2R-Tokyo University joint project "MedCo Corpus Annotation", 2003 – 2007
    This project is to annotate co-reference information in MedLine abstracts (GENIA Collection) and full biology papers in the same domain, a joint project of Institute for Infocomm Research (I2R) team, Singapore and Tsujii Laboratory, Tokyo University. Tsujii Lab provides the funding support and biology validation of linguistic annotation done by I2R team. Dr. Tateisi Yuka from Tsujii's Lab coordinated the biology validation with 5 biology Master and PhD Students from Tokyo U on abstract annotation. Dr. Jin-Dong Kim from Tsujii’s Lab coordinated the full paper annotation. With the additional efforts after the project, the total annotated corpus with 1999 meddling abstracts and 43 full papers, the largest co-reference annotation corpus of biology literature is an important resource for information extraction and other text-mining applications.The coreference links of genes or proteins from the abstract portion are further polished by BioNLP Shared Task 2011 organizers for the supporting task: Protein/Gene Coreference Task.
    Annotation Scheme Designer:Hong Hua Qing (With inputs from Jian Su, Xiao Feng Yang and Guo Dong Zhou)
    Annotators: Lai Khar Chong, Zhen Zhen Fan, Poh Khim Yeo, Hua Qing Hong, Peishan Ong Jasmine, Wei Chu Heng
    Programming Support: Jie Zhang, Bin Chen, Xiao Feng Yang

  11. Principal Investigator, I2R-SOC NUS joint project "Information Extraction on Biology Literature", July 2003 - June 2007
    This project is to develop information extraction technologies and to build information management applications for biology literature. 9 postgraduate / PhD students and 2 Research fellows has been developed in this project, besides other achievements including a number of publications, benchmark competitions and etc.
    Principal Investigator (I2R):Jian Su, Principal Investigator (SOC NUS): Chew Lim Tan
    Project members: Guodong Zhou, Min Zhang, Xiaofeng Yang, Upali Kohomban, Lan Man, Huaqing Hong, Stanley Yong, Jie Zhang, Dan Sheng, Xiao Juan, Chen Bin, Wenting Wang, Wen Gang Ji, Dan Mei Wang

  12. Technical Advisor, Material Safety Document Sheet Knowledge Workbench project, 2003 – 2004
    The project is to build a system extracting material safety information from document sheets and further checking the validation according to international standards. The system is delivered to an government organization. Instead of only being able to randomly check 5 % of large amount of data sheets, the government officer could check 100% of data sheets with the help of the system. It won The Enterprise Challenge Award 2003
    Project Manager: Hwee Leng Ong, Project Coordinator: Ai Ti Aw, Project Member: Fon Lin Lai, Jamie Shua Ling Ng, Guo Dong Zhou, Teng Chuan Chua

  13. Project Manager, TextMining CoT funding project, 2004
    The project is to build SDK tools with the engines built in house to make them ready for commercialization. The tools cover information extraction, text clustering, text classification, text retrieval, summarization and term extraction technologies. The project has led to a number of licensing to SMEs and 5 Polytechnics and good publicity for enable the company ready to run and go to market fast with a new product.
    Project Manager: Jian Su, Project member: Guo Dong Zhou, Jie, Zhang, Donghong Ji, Lingpeng Yang, Yu Nie, Fon Lin Lai, Kanagasabai Rajaraman, Jamie Shua Ling Ng, Hwee Leng Ong