LNCS 11140

Veˇra Ku˚rková · Yannis Manolopoulos Barbara Hammer · Lazaros Iliadis Ilias Maglogiannis (Eds.)

Artificial Neural Networks and Machine Learning – ICANN 2018 27th International Conference on Artificial Neural Networks Rhodes, Greece, October 4–7, 2018 Proceedings, Part II

123

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board David Hutchison Lancaster University, Lancaster, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Zurich, Switzerland John C. Mitchell Stanford University, Stanford, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel C. Pandu Rangan Indian Institute of Technology Madras, Chennai, India Bernhard Steffen TU Dortmund University, Dortmund, Germany Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbrücken, Germany

11140

More information about this series at http://www.springer.com/series/7407

Věra Kůrková Yannis Manolopoulos Barbara Hammer Lazaros Iliadis Ilias Maglogiannis (Eds.) •

•

Artiﬁcial Neural Networks and Machine Learning – ICANN 2018 27th International Conference on Artiﬁcial Neural Networks Rhodes, Greece, October 4–7, 2018 Proceedings, Part II

123

Editors Věra Kůrková Czech Academy of Sciences Prague 8 Czech Republic

Lazaros Iliadis Democritus University of Thrace Xanthi Greece

Yannis Manolopoulos Open University of Cyprus Latsia Cyprus

Ilias Maglogiannis University of Piraeus Piraeus Greece

Barbara Hammer CITEC Bielefeld University Bielefeld Germany

ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Computer Science ISBN 978-3-030-01420-9 ISBN 978-3-030-01421-6 (eBook) https://doi.org/10.1007/978-3-030-01421-6 Library of Congress Control Number: 2018955577 LNCS Sublibrary: SL1 – Theoretical Computer Science and General Issues © Springer Nature Switzerland AG 2018 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microﬁlms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional afﬁliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

Technological advances in artiﬁcial intelligence (AI) are leading the rapidly changing world of the twenty-ﬁrst century. We have already passed from machine learning to deep learning with numerous applications. The contribution of AI so far to the improvement of our quality of life is profound. Major challenges but also risks and threats are here. Brain-inspired computing explores, simulates, and imitates the structure and the function of the human brain, achieving high-performance modeling plus visualization capabilities. The International Conference on Artiﬁcial Neural Networks (ICANN) is the annual flagship conference of the European Neural Network Society (ENNS). It features the main tracks “Brain-Inspired Computing” and “Machine Learning Research,” with strong cross-disciplinary interactions and applications. All research ﬁelds dealing with neural networks are present. The 27th ICANN was held during October 4–7, 2018, at the Aldemar Amilia Mare ﬁve-star resort and conference center in Rhodes, Greece. The previous ICANN events were held in Helsinki, Finland (1991), Brighton, UK (1992), Amsterdam, The Netherlands (1993), Sorrento, Italy (1994), Paris, France (1995), Bochum, Germany (1996), Lausanne, Switzerland (1997), Skovde, Sweden (1998), Edinburgh, UK (1999), Como, Italy (2000), Vienna, Austria (2001), Madrid, Spain (2002), Istanbul, Turkey (2003), Budapest, Hungary (2004), Warsaw, Poland (2005), Athens, Greece (2006), Porto, Portugal (2007), Prague, Czech Republic (2008), Limassol, Cyprus (2009), Thessaloniki, Greece (2010), Espoo-Helsinki, Finland (2011), Lausanne, Switzerland (2012), Soﬁa, Bulgaria (2013), Hamburg, Germany (2014), Barcelona, Spain (2016), and Alghero, Italy (2017). Following a long-standing tradition, these Springer volumes belong to the Lecture Notes in Computer Science Springer series. They contain the papers that were accepted to be presented orally or as posters during the 27th ICANN conference. The 27th ICANN Program Committee was delighted by the overwhelming response to the call for papers. All papers went through a peer-review process by at least two and many times by three or four independent academic referees to resolve any conflicts. In total, 360 papers were submitted to the 27th ICANN. Of these, 139 (38.3%) were accepted as full papers for oral presentation of 20 minutes with a maximum length of 10 pages, whereas 28 of them were accepted as short contributions to be presented orally in 15 minutes and for inclusion in the proceedings with 8 pages. Also, 41 papers (11.4%) were accepted as full papers for poster presentation (up to 10 pages long), whereas 11 were accepted as short papers for poster presentation (maximum length of 8 pages). The accepted papers of the 27th ICANN conference are related to the following thematic topics: AI and Bioinformatics Bayesian and Echo State Networks Brain-Inspired Computing

VI

Preface

Chaotic Complex Models Clustering, Mining, Exploratory Analysis Coding Architectures Complex Firing Patterns Convolutional Neural Networks Deep Learning (DL) – – – – – –

DL DL DL DL DL DL

in Real Time Systems and Big Data Analytics and Big Data and Forensics and Cybersecurity and Social Networks

Evolving Systems – Optimization Extreme Learning Machines From Neurons to Neuromorphism From Sensation to Perception From Single Neurons to Networks Fuzzy Modeling Hierarchical ANN Inference and Recognition Information and Optimization Interacting with the Brain Machine Learning (ML) – – – – – –

ML ML ML ML ML ML

for Bio-Medical Systems and Video-Image Processing and Forensics and Cybersecurity and Social Media in Engineering

Movement and Motion Detection Multilayer Perceptrons and Kernel Networks Natural Language Object and Face Recognition Recurrent Neural Networks and Reservoir Computing Reinforcement Learning Reservoir Computing Self-Organizing Maps Spiking Dynamics/Spiking ANN Support Vector Machines Swarm Intelligence and Decision-Making Text Mining Theoretical Neural Computation Time Series and Forecasting Training and Learning

Preface

VII

The authors of submitted papers came from 34 different countries from all over the globe, namely: Belgium, Brazil, Bulgaria, Canada, China, Czech Republic, Cyprus, Egypt, Finland, France, Germany, Greece, India, Iran, Ireland, Israel, Italy, Japan, Luxembourg, The Netherlands, Norway, Oman, Pakistan, Poland, Portugal, Romania, Russia, Slovakia, Spain, Switzerland, Tunisia, Turkey, UK, USA. Four keynote speakers were invited, and they gave lectures on timely aspects of AI. We hope that these proceedings will help researchers worldwide to understand and to be aware of timely evolutions in AI and more speciﬁcally in artiﬁcial neural networks. We believe that they will be of major interest for scientists over the globe and that they will stimulate further research. October 2018

Věra Kůrková Yannis Manolopoulos Barbara Hammer Lazaros Iliadis Ilias Maglogiannis

Organization

General Chairs Věra Kůrková Yannis Manolopoulos

Czech Academy of Sciences, Czech Republic Open University of Cyprus, Cyprus

Program Co-chairs Barbara Hammer Lazaros Iliadis Ilias Maglogiannis

Bielefeld University, Germany Democritus University of Thrace, Greece University of Piraeus, Greece

Steering Committee Vera Kurkova (President of ENNS) Cesare Alippi Guillem Antó i Coma Jeremie Cabessa Wlodzislaw Duch Petia Koprinkova-Hristova Jaakko Peltonen Yifat Prut Bernardete Ribeiro Stefano Rovetta Igor Tetko Alessandro Villa Paco Zamora-Martínez

Czech Academy of Sciences, Czech Republic Università della Svizzera Italiana, Switzerland Pompeu Fabra University, Barcelona, Spain Université Paris 2 Panthéon-Assas, France Nicolaus Copernicus University, Poland Bulgarian Academy of Sciences, Bulgaria University of Tampere, Finland The Hebrew University, Israel University of Coimbra, Portugal University of Genoa, Italy German Research Center for Environmental Health, Munich, Germany University of Lausanne, Switzerland das-Nano, Spain

Publication Chair Antonis Papaleonidas

Democritus University of Thrace, Greece

Communication Chair Paolo Masulli

Technical University of Denmark, Denmark

Program Committee Najem Abdennour

Higher Institute of Computer Science and Multimedia (ISIMG), Gabes, Tunisia

X

Organization

Tetiana Aksenova Zakhriya Alhassan Tayfun Alpay Ioannis Anagnostopoulos Cesar Analide Annushree Bablani Costin Badica Pablo Barros Adam Barton Lluís Belanche Bartlomiej Beliczynski Kostas Berberidis Ege Beyazit Francisco Elanio Bezerra Varun Bhatt Marcin Blachnik Sander Bohte Simone Bonechi Farah Bouakrif Meftah Boudjelal Andreas Bougiouklis Martin Butz Jeremie Cabessa Paulo Vitor Campos Souza Angelo Cangelosi Yanan Cao Francisco Carvalho Giovanna Castellano Jheymesson Cavalcanti Amit Chaulwar Sylvain Chevallier Stephane Cholet Mark Collier Jorg Conradt Adriana Mihaela Coroiu Paulo Cortez David Coufal Juarez Da Silva Vilson Luiz Dalle Mole Debasmit Das Bodhisattva Dash Eli David Konstantinos Demertzis

Atomic Energy Commission (CEA), Grenoble, France Durham University, UK University of Hamburg, Germany University of Thessaly, Greece University of Minho, Portugal National Institute of Technology Goa, India University of Craiova, Romania University of Hamburg, Germany University of Ostrava, Czech Republic Polytechnic University of Catalonia, Spain Warsaw University of Technology, Poland University of Patras, Greece University of Louisiana at Lafayette, USA University Ninth of July, Sao Paolo, Brazil Indian Institute of Technology, Bombay, India Silesian University of Technology, Poland National Research Institute for Mathematics and Computer Science (CWI), The Netherlands University of Siena, Italy University of Jijel, Algeria Mascara University, Algeria National Technical University of Athens, Greece University of Tübingen, Germany Université Paris 2, France Federal Center for Technological Education of Minas Gerais, Brazil Plymouth University, UK Chinese Academy of Sciences, China Federal University of Pernambuco, Brazil University of Bari, Italy University of Pernambuco, Brazil Technical University Ingolstadt, Germany University of Versailles St. Quentin, France University of Antilles, Guadeloupe Trinity College, Ireland Technical University of Munich, Germany Babes-Bolyai University, Romania University of Minho, Portugal Czech Academy of Sciences, Czech Republic University of Vale do Rio dos Sinos, Brazil Federal University of Technology Parana, Brazil Purdue University, USA International Institute of Information Technology, Bhubaneswar, India Bar-Ilan University, Israel Democritus University of Thrace, Greece

Organization

Antreas Dionysiou Sergey Dolenko Xiao Dong Shirin Dora Jose Dorronsoro Ziad Doughan Wlodzislaw Duch Gerrit Ecke Alexander Eﬁtorov Manfred Eppe Deniz Erdogmus Rodrigo Exterkoetter Yingruo Fan Maurizio Fiasché Lydia Fischer Andreas Fischer Qinbing Fu Ninnart Fuengfusin Madhukar Rao G. Mauro Gaggero Claudio Gallicchio Shuai Gao Artur Garcez Michael Garcia Ortiz Angelo Genovese Christos Georgiadis Alexander Gepperth Peter Gergeľ Daniel Gibert Eleonora Giunchiglia Jan Philip Goepfert George Gravanis Ingrid Grenet Jiri Grim Xiaodong Gu Alberto Guillén Tatiana Valentine Guy Myrianthi Hadjicharalambous Petr Hajek Xue Han Liping Han Wang Haotian Kazuyuki Hara Ioannis Hatzilygeroudis

XI

University of Cyprus, Cyprus Lomonosov Moscow State University, Russia Chinese Academy of Sciences, China University of Amsterdam, The Netherlands Autonomous University of Madrid, Spain Beirut Arab University, Lebanon Nicolaus Copernicus University, Poland University of Tübingen, Germany Lomonosov Moscow State University, Russia University of Hamburg, Germany Northeastern University, USA LTrace Geophysical Solutions, Florianopolis, Brazil The University of Hong Kong, SAR China Polytechnic University of Milan, Italy Honda Research Institute Europe, Germany University of Fribourg, Germany University of Lincoln, UK Kyushu Institute of Technology, Japan Indian Institute of Technology, Dhanbad, India National Research Council, Genoa, Italy University of Pisa, Italy University of Science and Technology of China, China City University of London, UK Aldebaran Robotics, France University of Milan, Italy University of Macedonia, Thessaloniki, Greece HAW Fulda, Germany Comenius University in Bratislava, Slovakia University of Lleida, Spain University of Genoa, Italy Bielefeld University, Germany Democritus University of Thrace, Greece University of Côte d’Azur, France Czech Academy of Sciences, Czech Republic Fudan University, China University of Granada, Spain Czech Academy of Sciences, Czech Republic KIOS Research and Innovation Centre of Excellence, Cyprus University of Pardubice, Czech Republic China University of Geosciences, China Nanjing University of Information Science and Technology, China National University of Defense Technology, China Nihon University, Japan University of Patras, Greece

XII

Organization

Stefan Heinrich Tim Heinz Catalina Hernandez Alex Hernández García Adrian Horzyk Wenjun Hou Jian Hou Haigen Hu Amir Hussain Nantia Iakovidou Yahaya Isah Shehu Sylvain Jaume Noman Javed Maciej Jedynak Qinglin Jia Na Jiang Wenbin Jiang Zongze Jin Jacek Kabziński Antonios Kalampakas Jan Kalina Ryotaro Kamimura Andreas Kanavos Savvas Karatsiolis Kostas Karatzas Ioannis Karydis Petros Kefalas Nadia Masood Khan Gul Muhammad Khan Sophie Klecker Taisuke Kobayashi Mario Koeppen Mikko Kolehmainen Stefanos Kollias Ekaterina Komendantskaya Petia Koprinkova-Hristova Irena Koprinska Dimitrios Kosmopoulos Costas Kotropoulos Athanasios Koutras Konstantinos Koutroumbas

University of Hamburg, Germany University of Siegen, Germany District University of Bogota, Colombia University of Osnabrück, Germany AGH University of Science and Technology in Krakow, Poland China Agricultural University, China Bohai University, China Zhejiang University of Technology, China University of Stirling, UK King’s College London, UK Coventry University, UK Saint Peter’s University, Jersey City, USA Namal College Mianwali, Pakistan University of Grenoble Alpes, France Peking University, China Beihang University, China Huazhong University of Science and Technology, China Chinese Academy of Sciences, China Lodz University of Technology, Poland American University of the Middle East, Kuwait Czech Academy of Sciences, Czech Republic Tokai University, Japan University of Patras, Greece University of Cyprus, Cyprus Aristotle University of Thessaloniki, Greece Ionian University, Greece University of Shefﬁeld, International Faculty City College, Thessaloniki, Greece University of Engineering and Technology Peshawar, Pakistan University of Engineering and Technology, Peshawar, Pakistan University of Luxembourg, Luxembourg Nara Institute of Science and Technology, Japan Kyushu Institute of Technology, Japan University of Eastern Finland, Finland University of Lincoln, UK Heriot-Watt University, UK Bulgarian Academy of Sciences, Bulgaria University of Sydney, Australia University of Patras, Greece Aristotle University of Thessaloniki, Greece TEI of Western Greece, Greece National Observatory of Athens, Greece

Organization

Giancarlo La Camera Jarkko Lagus Luis Lamb Ángel Lareo René Larisch Nikos Laskaris Ivano Lauriola David Lenz Florin Leon Guangli Li Yang Li Hongyu Li Diego Ettore Liberati Aristidis Likas Annika Lindh Junyu Liu Ji Liu Doina Logofatu Vilson Luiz Dalle Mole Sven Magg Ilias Maglogiannis George Magoulas Christos Makris Kleanthis Malialis Kristína Malinovská Konstantinos Margaritis Thomas Martinetz Gonzalo Martínez-Muñoz Boudjelal Meftah Stefano Melacci Nikolaos Mitianoudis Hebatallah Mohamed Francesco Carlo Morabito Giorgio Morales Antonio Moran Dimitrios Moschou Cristhian Motoche Phivos Mylonas Anton Nemchenko Roman Neruda Amy Nesky Hoang Minh Nguyen Giannis Nikolentzos

XIII

Stony Brook University, USA University of Helsinki, Finland Federal University of Rio Grande, Brazil Autonomous University of Madrid, Spain Chemnitz University of Technology, Germany Aristotle University of Thessaloniki, Greece University of Padua, Italy Justus Liebig University, Giessen, Germany Technical University of Iasi, Romania Chinese Academy of Sciences, China Peking University, China Zhongan Technology, Shanghai, China National Research Council, Rome, Italy University of Ioannina, Greece Dublin Institute of Technology, Ireland Huiying Medical Technology, China Beihang University, China Frankfurt University of Applied Sciences, Germany Federal University of Technology – Paraná (UTFPR), Campus Toledo, Spain University of Hamburg, Germany University of Piraeus, Greece Birkbeck College, London, UK University of Patras, Greece University of Cyprus, Cyprus Comenius University in Bratislava, Slovakia University of Macedonia, Thessaloniki, Greece University of Lübeck, Germany Autonomous University of Madrid, Spain University Mustapha Stambouli, Mascara, Algeria University of Siena, Italy Democritus University of Thrace, Greece Roma Tre University, Italy Mediterranean University of Reggio Calabria, Italy National Telecommunications Research and Training Institute (INICTEL), Peru University of Leon, Spain Aristotle University of Thessaloniki, Greece National Polytechnic School, Ecuador Ionian University, Greece UCLA, USA Czech Academy of Sciences, Czech Republic University of Michigan, USA Korea Advanced Institute of Science and Technology, South Korea Ecole Polytechnique, Palaiseau, France

XIV

Organization

Dimitri Nowicki Stavros Ntalampiras Luca Oneto Mihaela Oprea Sebastian Otte Jun Ou Basil Papadopoulos Harris Papadopoulos Antonios Papaleonidas Krzysztof Patan Jaakko Peltonen Isidoros Perikos Alfredo Petrosino Duc-Hong Pham Elias Pimenidis Vincenzo Piuri Mirko Polato Yifat Prut Jielin Qiu Chhavi Rana Marina Resta Bernardete Ribeiro Riccardo Rizzo Manuel Roveri Stefano Rovetta Araceli Sanchis de Miguel Marcello Sanguineti Kyrill Schmid Thomas Schmid Friedhelm Schwenker Neslihan Serap Will Serrano Jivitesh Sharma Rafet Sifa Sotir Sotirov Andreas Stafylopatis Antonino Staiano Ioannis Stephanakis Michael Stiber Catalin Stoean Rudolf Szadkowski Mandar Tabib Kazuhiko Takahashi Igor Tetko Yancho Todorov

National Academy of Sciences, Ukraine University of Milan, Italy University of Genoa, Italy University Petroleum-Gas of Ploiesti, Romania University of Tubingen, Germany Beijing University of Technology, China Democritus University of Thrace, Greece Frederick University, Cyprus Democritus University of Thrace, Greece University of Zielona Góra, Poland University of Tampere, Finland University of Patras, Greece University of Naples Parthenope, Italy Vietnam National University, Vietnam University of the West of England, UK University of Milan, Italy University of Padua, Italy The Hebrew University, Israel Shanghai Jiao Tong University, China Maharshi Dayanand University, India University of Genoa, Italy University of Coimbra, Portugal National Research Council, Rome, Italy Polytechnic University of Milan, Italy University of Genoa, Italy Charles III University of Madrid, Spain University of Genoa, Italy University of Munich, Germany University of Leipzig, Germany Ulm University, Germany Sengor Istanbul Technical University, Turkey Imperial College London, UK University of Agder, Norway Fraunhofer IAIS, Germany University Prof. Dr. Asen Zlatarov, Burgas, Bulgaria National Technical University of Athens, Greece University of Naples Parthenope, Italy Hellenic Telecommunications Organisation, Greece University of Washington Bothell, USA University of Craiova, Romania Czech Technical University, Czech Republic SINTEF, Norway Doshisha University, Japan Helmholtz Center Munich, Germany Aalto University, Espoo, Finland

Organization

César Torres-Huitzil Athanasios Tsadiras Nicolas Tsapatsoulis George Tsekouras Matus Tuna Theodoros Tzouramanis Juan Camilo Vasquez Tieck Nikolaos Vassilas Petra Vidnerová Alessandro Villa Panagiotis Vlamos Thanos Voulodimos Roseli Wedemann Stefan Wermter Zhihao Ye Hujun Yin Francisco Zamora-Martinez Yongxiang Zhang Liu Zhongji Rabiaa Zitouni Sarah Zouinina

XV

National Polytechnic Institute, Victoria, Tamaulipas, Mexico Aristotle University of Thessaloniki, Greece Cyprus University of Technology, Cyprus University of the Aegean, Greece Comenius University in Bratislava, Slovakia University of the Aegean, Greece FZI, Karlsruhe, Germany ATEI of Athens, Greece Czech Academy of Sciences, Czech Republic University of Lausanne, Switzerland Ionian University, Greece National Technical University of Athens, Greece Rio de Janeiro State University, Brazil University of Hamburg, Germany Guangdong University of Technology, China University of Manchester, UK Veridas Digital Authentication Solutions, Spain Sun Yat-Sen University, China Chinese Academy of Sciences, China Tunis El Manar University, Tunisia Université Paris 13, France

Keynote Talks

Cognitive Phase Transitions in the Cerebral Cortex – John Taylor Memorial Lecture Robert Kozma University of Massachusetts Amherst

Abstract. Everyday subjective experience of the stream of consciousness suggests continuous cognitive processing in time and smooth underlying brain dynamics. Brain monitoring techniques with markedly improved spatiotemporal resolution, however, show that relatively smooth periods in brain dynamics are frequently interrupted by sudden changes and intermittent discontinuities, evidencing singularities. There are frequent transitions between periods of large-scale synchronization and intermittent desynchronization at alpha-theta rates. These observations support the hypothesis about the cinematic model of cognitive processing, according to which higher cognition can be viewed as multiple movies superimposed in time and space. The metastable spatial patterns of ﬁeld potentials manifest the frames, and the rapid transitions provide the shutter from each pattern to the next. Recent experimental evidence indicates that the observed discontinuities are not merely important aspects of cognition; they are key attributes of intelligent behavior representing the cognitive “Aha” moment of sudden insight and deep understanding in humans and animals. The discontinuities can be characterized as phase transitions in graphs and networks. We introduce computational models to implement these insights in a new generation of devices with robust artiﬁcial intelligence, including oscillatory neuromorphic memories, and self-developing autonomous robots.

On the Deep Learning Revolution in Computer Vision

Nathan Netanyahu Bar-Ilan University, Israel Abstract. Computer Vision (CV) is an interdisciplinary ﬁeld of Artiﬁcial Intelligence (AI), which is concerned with the embedding of human visual capabilities in a computerized system. The main thrust, essentially, of CV is to generate an “intelligent” high-level description of the world for a given scene, such that when interfaced with other thought processes can elicit, ultimately, appropriate action. In this talk we will review several central CV tasks and traditional approaches taken for handling these tasks for over 50 years. Noting the limited performance of standard methods applied, we briefly survey the evolution of artiﬁcial neural networks (ANN) during this extended period, and focus, speciﬁcally, on the ongoing revolutionary performance of deep learning (DL) techniques for the above CV tasks during the past few years. In particular, we provide also an overview of our DL activities, in the context of CV, at Bar-Ilan University. Finally, we discuss future research and development challenges in CV in light of further employment of prospective DL innovations.

From Machine Learning to Machine Diagnostics

Marios Polycarpou University of Cyprus

Abstract. During the last few years, there have has been remarkable progress in utilizing machine learning methods in several applications that beneﬁt from deriving useful patterns among large volumes of data. These advances have attracted signiﬁcant attention from industry due to the prospective of reducing the cost of predicting future events and making intelligent decisions based on data from past experiences. In this context, a key area that can beneﬁt greatly from the use of machine learning is the task of detecting and diagnosing abnormal behaviour in dynamical systems, especially in safety-critical, large-scale applications. The goal of this presentation is to provide insight into the problem of detecting, isolating and self-correcting abnormal or faulty behaviour in large-scale dynamical systems, to present some design methodologies based on machine learning and to show some illustrative examples. The ultimate goal is to develop the foundation of the concept of machine diagnostics, which would empower smart software algorithms to continuously monitor the health of dynamical systems during the lifetime of their operation.

Multimodal Deep Learning in Biomedical Image Analysis

Sotirios Tsaftaris University of Edinburgh, UK

Abstract. Nowadays images are typically accompanied by additional information. At the same time, for example, magnetic resonance imaging exams typically contain more than one image modality: they show the same anatomy under different acquisition strategies revealing various pathophysiological information. The detection of disease, segmentation of anatomy and other classical analysis tasks, can beneﬁt from a multimodal view to analysis that leverages shared information across the sources yet preserves unique information. It is without surprise that radiologists analyze data in this fashion, reviewing the exam as a whole. Yet, when aiming to automate analysis tasks, we still treat different image modalities in isolation and tend to ignore additional information. In this talk, I will present recent work in learning with deep neural networks, latent embeddings suitable for multimodal processing, and highlight opportunities and challenges in this area.

Contents – Part II

ELM/Echo State ANN Rank-Revealing Orthogonal Decomposition in Extreme Learning Machine Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jacek Kabziński An Improved CAD Framework for Digital Mammogram Classification Using Compound Local Binary Pattern and Chaotic Whale Optimization-Based Kernel Extreme Learning Machine . . . . . . . . . . . . . . . . Figlu Mohanty, Suvendu Rup, and Bodhisattva Dash A Novel Echo State Network Model Using Bayesian Ridge Regression and Independent Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hoang Minh Nguyen, Gaurav Kalra, Tae Joon Jun, and Daeyoung Kim

3

14

24

Image Processing A Model for Detection of Angular Velocity of Image Motion Based on the Temporal Tuning of the Drosophila . . . . . . . . . . . . . . . . . . . . . . . . . Huatian Wang, Jigen Peng, Paul Baxter, Chun Zhang, Zhihua Wang, and Shigang Yue Local Decimal Pattern for Pollen Image Recognition . . . . . . . . . . . . . . . . . . Liping Han and Yonghua Xie New Architecture of Correlated Weights Neural Network for Global Image Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sławomir Golak, Anna Jama, Marcin Blachnik, and Tadeusz Wieczorek Compression-Based Clustering of Video Human Activity Using an ASCII Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guillermo Sarasa, Aaron Montero, Ana Granados, and Francisco B. Rodriguez

37

47

56

66

Medical/Bioinformatics Deep Autoencoders for Additional Insight into Protein Dynamics . . . . . . . . . Mihai Teletin, Gabriela Czibula, Maria-Iuliana Bocicor, Silvana Albert, and Alessandro Pandini

79

XXIV

Contents – Part II

Pilot Design of a Rule-Based System and an Artificial Neural Network to Risk Evaluation of Atherosclerotic Plaques in Long-Range Clinical Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jiri Blahuta, Tomas Soukup, and Jakub Skacel A Multi-channel Multi-classifier Method for Classifying Pancreatic Cystic Neoplasms Based on ResNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Haigen Hu, Kangjie Li, Qiu Guan, Feng Chen, Shengyong Chen, and Yicheng Ni Breast Cancer Histopathological Image Classification via Deep Active Learning and Confidence Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Baolin Du, Qi Qi, Han Zheng, Yue Huang, and Xinghao Ding Epileptic Seizure Prediction from EEG Signals Using Unsupervised Learning and a Polling-Based Decision Process . . . . . . . . . . . . . . . . . . . . . Lucas Aparecido Silva Kitano, Miguel Angelo Abreu Sousa, Sara Dereste Santos, Ricardo Pires, Sigride Thome-Souza, and Alexandre Brincalepe Campo Classification of Bone Tumor on CT Images Using Deep Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yang Li, Wenyu Zhou, Guiwen Lv, Guibo Luo, Yuesheng Zhu, and Ji Liu DSL: Automatic Liver Segmentation with Faster R-CNN and DeepLab . . . . . Wei Tang, Dongsheng Zou, Su Yang, and Jing Shi Temporal Convolution Networks for Real-Time Abdominal Fetal Aorta Analysis with Ultrasound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nicoló Savioli, Silvia Visentin, Erich Cosmi, Enrico Grisan, Pablo Lamata, and Giovanni Montana An Original Neural Network for Pulmonary Tuberculosis Diagnosis in Radiographs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Junyu Liu, Yang Liu, Cheng Wang, Anwei Li, Bowen Meng, Xiangfei Chai, and Panli Zuo

90

101

109

117

127 137

148

158

Computerized Counting-Based System for Acute Lymphoblastic Leukemia Detection in Microscopic Blood Images . . . . . . . . . . . . . . . . . . . . . . . . . . . Karima Ben-Suliman and Adam Krzyżak

167

Right Ventricle Segmentation in Cardiac MR Images Using U-Net with Partly Dilated Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gregory Borodin and Olga Senyukova

179

Contents – Part II

Model Based on Support Vector Machine for the Estimation of the Heart Rate Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Catalina Maria Hernández-Ruiz, Sergio Andrés Villagrán Martínez, Johan Enrique Ortiz Guzmán, and Paulo Alonso Gaona Garcia High-Resolution Generative Adversarial Neural Networks Applied to Histological Images Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Antoni Mauricio, Jorge López, Roger Huauya, and Jose Diaz

XXV

186

195

Kernel Tensor Learning in Multi-view Kernel PCA . . . . . . . . . . . . . . . . . . . . . . . . Lynn Houthuys and Johan A. K. Suykens

205

Reinforcement ACM: Learning Dynamic Multi-agent Cooperation via Attentional Communication Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xue Han, Hongping Yan, Junge Zhang, and Lingfeng Wang

219

Improving Fuel Economy with LSTM Networks and Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andreas Bougiouklis, Antonis Korkofigkas, and Giorgos Stamou

230

Action Markets in Deep Multi-Agent Reinforcement Learning . . . . . . . . . . . Kyrill Schmid, Lenz Belzner, Thomas Gabor, and Thomy Phan Continuous-Time Spike-Based Reinforcement Learning for Working Memory Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marios Karamanis, Davide Zambrano, and Sander Bohté Reinforcement Learning for Joint Extraction of Entities and Relations . . . . . . Wenpeng Liu, Yanan Cao, Yanbing Liu, Yue Hu, and Jianlong Tan

240

250 263

Pattern Recognition/Text Mining/Clustering TextNet for Text-Related Image Quality Assessment . . . . . . . . . . . . . . . . . . Hongyu Li, Junhua Qiu, and Fan Zhu

275

A Target Dominant Sets Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . Jian Hou, Chengcong Lv, Aihua Zhang, and Xu E.

286

Input Pattern Complexity Determines Specialist and Generalist Populations in Drosophila Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Aaron Montero, Jessica Lopez-Hazas, and Francisco B. Rodriguez

296

XXVI

Contents – Part II

A Hybrid Planning Strategy Through Learning from Vision for Target-Directed Navigation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaomao Zhou, Cornelius Weber, Chandrakant Bothe, and Stefan Wermter

304

Optimization/Recommendation Check Regularization: Combining Modularity and Elasticity for Memory Consolidation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Taisuke Kobayashi

315

Con-CNAME: A Contextual Multi-armed Bandit Algorithm for Personalized Recommendations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaofang Zhang, Qian Zhou, Tieke He, and Bin Liang

326

Real-Time Session-Based Recommendations Using LSTM with Neural Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . David Lenz, Christian Schulze, and Michael Guckert

337

Imbalanced Data Classification Based on MBCDK-means Undersampling and GA-ANN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anping Song and Quanhua Xu

349

Evolutionary Tuning of a Pulse Mormyrid Electromotor Model to Generate Stereotyped Sequences of Electrical Pulse Intervals . . . . . . . . . . . . . . . . . . . Angel Lareo, Pablo Varona, and F. B. Rodriguez

359

An Overview of Frank-Wolfe Optimization for Stochasticity Constrained Interpretable Matrix and Tensor Factorization . . . . . . . . . . . . . . . . . . . . . . . Rafet Sifa

369

Computational Neuroscience A Bio-Feasible Computational Circuit for Neural Activities Persisting and Decaying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dai Dawei, Weihui, and Su Zihao Granger Causality to Reveal Functional Connectivity in the Mouse Basal Ganglia-Thalamocortical Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alessandra Lintas, Takeshi Abe, Alessandro E. P. Villa, and Yoshiyuki Asai A Temporal Estimate of Integrated Information for Intracranial Functional Connectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xerxes D. Arsiwalla, Daniel Pacheco, Alessandro Principe, Rodrigo Rocamora, and Paul Verschure

383

393

403

Contents – Part II

XXVII

SOM/SVM Randomization vs Optimization in SVM Ensembles . . . . . . . . . . . . . . . . . . Maryam Sabzevari, Gonzalo Martínez-Muñoz, and Alberto Suárez An Energy-Based Convolutional SOM Model with Self-adaptation Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexander Gepperth, Ayanava Sarkar, and Thomas Kopinski A Hierarchy Based Influence Maximization Algorithm in Social Networks. . . Lingling Li, Kan Li, and Chao Xiang Convolutional Neural Networks in Combination with Support Vector Machines for Complex Sequential Data Classification . . . . . . . . . . . . . . . . . Antreas Dionysiou, Michalis Agathocleous, Chris Christodoulou, and Vasilis Promponas Classification of SIP Attack Variants with a Hybrid Self-enforcing Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Waldemar Hartwig, Christina Klüver, Adnan Aziz, and Dirk Hoffstadt

415

422 434

444

456

Anomaly Detection/Feature Selection/Autonomous Learning Generalized Multi-view Unsupervised Feature Selection. . . . . . . . . . . . . . . . Yue Liu, Changqing Zhang, Pengfei Zhu, and Qinghua Hu

469

Performance Anomaly Detection Models of Virtual Machines for Network Function Virtualization Infrastructure with Machine Learning . . . . . . . . . . . . Juan Qiu, Qingfeng Du, Yu He, YiQun Lin, Jiaye Zhu, and Kanglin Yin

479

Emergence of Sensory Representations Using Prediction in Partially Observable Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thibaut Kulak and Michael Garcia Ortiz

489

Signal Detection Change Detection in Individual Users’ Behavior . . . . . . . . . . . . . . . . . . . . . Parisa Rastin, Guénaël Cabanes, Basarab Matei, and Jean-Marc Marty Extraction and Localization of Non-contaminated Alpha and Gamma Oscillations from EEG Signal Using Finite Impulse Response, Stationary Wavelet Transform, and Custom FIR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Najmeddine Abdennour, Abir Hadriche, Tarek Frikha, and Nawel Jmail

501

511

XXVIII

Contents – Part II

Long-Short Term Memory/Chaotic Complex Models Chaotic Complex-Valued Associative Memory with Adaptive Scaling Factor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daisuke Karakama, Norihito Katamura, Chigusa Nakano, and Yuko Osana Computation of Air Traffic Flow Management Performance with Long Short-Term Memories Considering Weather Impact. . . . . . . . . . . . . . . . . . . Stefan Reitmann and Michael Schultz

523

532

Wavelet/Reservoir Computing A Study on the Influence of Wavelet Number Change in the Wavelet Neural Network Architecture for 3D Mesh Deformation Using Trust Region Spherical Parameterization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Naziha Dhibi, Akram Elkefai, and Chokri Ben Amar

545

Combining Memory and Non-linearity in Echo State Networks. . . . . . . . . . . Eleonora Di Gregorio, Claudio Gallicchio, and Alessio Micheli

556

A Neural Network of Multiresolution Wavelet Analysis . . . . . . . . . . . . . . . . Alexander Efitorov, Vladimir Shiroky, and Sergey Dolenko

567

Similarity Measures/PSO - RBF Fast Supervised Selection of Prototypes for Metric-Based Learning . . . . . . . . Lluís A. Belanche Modeling Data Center Temperature Profile in Terms of a First Order Polynomial RBF Network Trained by Particle Swarm Optimization. . . . . . . . Ioannis A. Troumbis, George E. Tsekouras, Christos Kalloniatis, Panagiotis Papachiou, and Dias Haralambopoulos

577

587

Incorporating Worker Similarity for Label Aggregation in Crowdsourcing . . . Jiyi Li, Yukino Baba, and Hisashi Kashima

596

NoSync: Particle Swarm Inspired Distributed DNN Training . . . . . . . . . . . . Mihailo Isakov and Michel A. Kinsy

607

Superkernels for RBF Networks Initialization (Short Paper) . . . . . . . . . . . . . David Coufal

621

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

625

ELM/Echo State ANN

Rank-Revealing Orthogonal Decomposition in Extreme Learning Machine Design Jacek Kabziński(&) Lodz University of Technology, Stefanowskiego 18/22, Lodz, Poland [emailprotected]

Abstract. Extreme Learning Machine (ELM), a neural network technique used for regression problems, may be considered as a nonlinear transformation (from the training input domain into the output space of hidden neurons) which provides the basis for linear mean square (LMS) regression problem. The conditioning of this problem is the important factor influencing ELM implementation and accuracy. It is demonstrated that rank-revealing orthogonal decomposition techniques can be used to identify neurons causing collinearity among LMS regression basis. Such neurons may be eliminated or modiﬁed to increase the numerical rank of the matrix which is pseudo-inverted while solving LMS regression. Keywords: Neural networks modelling Nonlinear systems

Extreme learning machine

1 Introduction An Extreme Learning Machine (ELM) [1, 2] – a neural network with one ﬁxed hidden layer and adjustable output weights - is able to solve complicated regression or classiﬁcation problems. In this paper, application of ELMs for modeling multivariable, nonlinear functions with batch data processing is considered. The main ideas behind the standard ELM approach are that: the weights and biases of the hidden nodes are generated at random, without ‘seeing the data’, and are not adjusted, so ‘training’ means that the output weights are determined analytically, solving a linear mean square (LMS) problem. Therefore, the training is reduced to one step and the training time is very short comparing to iterative training. The numerical round-off errors of linear mean square regression are the main reasons for ELMs’ modeling errors and are strictly connected with the number of neurons in the hidden layer. When the number of hidden layer nodes is small, the ELM may not be able to transform the input into the feature space effectively and the approximation error may be unacceptably large. When the number of hidden layer nodes is large, it increases the computation complexity, may lead to an ill-conditioned LMS regression problem and may even result in overﬁtting of the ELM. The necessity of improving the numerical properties of ELM was noticed in several recent publications [3–6]. Neurons pruning techniques were proposed in [7, 8] and incremental learning was used in [9, 10]. Both methods try to get the optimal number of hidden layer nodes. But, with every change of hidden layer nodes, the output weights need to © Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11140, pp. 3–13, 2018. https://doi.org/10.1007/978-3-030-01421-6_1

4

J. Kabziński

be recalculated, so these techniques considerably increase the computation complexity of ELM. In [11], the method called orthogonal projections to latent structures, which is a combination of orthogonal signal correction and partial least squares, is proposed, but it still leads to a tedious iterative procedure. Complicated methods of probability distribution optimization are proposed in [12]. The main contribution of this paper is to show that rank-revealing transformations, known since the previous century, are effective tools to indicate neurons responsible for numerical collinearity among the LSM regression basis. Such “non-contributing” neurons may be eliminated or modiﬁed so that they are useful in the approximation. In any case, the ﬁnal basis for LMS regression is orthogonal and the output weights may be obtained by solving well-conditioned LMS problem. The standard ELM is described in Sect. 2. Instead of the most popular random generation of weights, the application of low discrepancy sequences (LDS) [13, 14] is considered. Rank-Revealing Orthogonal Decomposition is introduced and applied in Sect. 3, while the proposed neuron modiﬁcation procedure is presented in Sect. 4. The paper ends with numerical experiments and conclusions.

2 Basic Extreme Learning Machine The standard Extreme Learning Machine applied for modelling (regression) problems may be considered as a combination of a nonlinear mapping from the input space into the feature space and a linear least-squares regression. The training data for a n-input ELM form a batch of N samples: fðxi ; ti Þ;

xi 2 Rn ; ti 2 R; i ¼ 1; . . .; N g;

ð1Þ

where xi denote the inputs and ti denote the desired outputs, which form the target (column) vector T ¼ ½ t1 tN T . It is assumed that each input is normalized to the interval [0,1]. The nonlinear mapping is performed by a single layer of hidden neurons with inﬁnitely differentiable activation functions. The “projection-based neurons” are used most commonly. Each n-dimensional input is projected by the input layer weights wTk ¼ ½wk;1 . . .wk;n ; k ¼ 1; . . .M and the bias bk into the k-th hidden neuron input. Next, a nonlinear transformation hk , called activation function (AF) of the neuron is applied to obtain the neuron output. The transformation of a batch of N samples by the hidden layer is represented as an N M matrix: H ¼ HNM ¼ hi wTi xj þ bi j ¼ 1; . . .; N :

ð2Þ

i ¼ 1; ::; M

It is assumed that the number of samples is greater than the number of neurons: N [ M. The impact of the selected type of AFs on the network performance is limited, and therefore sigmoid AFs remain among the most widely used.

Rank-Revealing Orthogonal Decomposition in Extreme Learning Machine Design

5

According to the standard approach, the weights and biases are generated at random, using any continuous probability distribution [2]. Using the uniform distribution in ½1; 1 to generate the weights and the biases is the standard procedure. Recently, an application of Low-Discrepancy Sequences (LDS) [13, 14] was proposed to replace the random generation of neurons’ parameters. The discrepancy measures the uniformity of a sequence X of N points in the hypercube P ¼ ½0; 1n , and is deﬁned as No ðX; BÞ ð3Þ LðBÞ; DN ð X Þ ¼ supB N where B is any hypercube ½a1 ; b1 . . . ½an ; bn P, No ðX; BÞ denotes the number of points from X belonging to B and LðBÞ is the Lebesgue measure (volume) of B [13]. So, low discrepancy means that the number of points in a subset is as proportional as possible to the volume. Numerical procedures to generate various LDSs are offered by popular software packages. For example, easy generation of Halton and Sobol sequences [13, 14] is possible in Matlab. The distance among any LDS and a random set tends to zero if the number of points increases (Fig. 1), so the universal approximation property of a standard (randomly-generated) ELM [15] is generalized to the deterministic case with weights and biases taken from an LDS [16] (Fig. 2). 0.8 Rand Halton Rand/Halton

0.7

Halton set

Random set

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

grid distance

0.6

0.5

0.4

0.3

0.2

0.1

500

1000 number of points

1500

2000

Fig. 1. Maximal distance (mean in 20 experiments) to the nearest neighbour: * - inside the random set, o – inside the Halton set, + from the Halton set to the random set. Points are generated in the 3-dimensional cube.

0.5

1

0.5

1

Fig. 2. 80 points generated from Halton sequence and from uniform distribution in 2 dimensions.

Hence, deterministic creation of neurons’ weights and biases from an LDS is an interesting alternative and allows to describe features of an ELM without repeating numerous experiments. The output weights b are found by minimizing the approximation error EC ¼ kbk2 þ CkHb T k2 ;

ð4Þ

6

J. Kabziński

where C [ 0 is a design parameter added to improve the conditioning of the problem. This approach is called ‘Tikhonow regularization’ [17]. For C ! 1 the problem becomes equivalent to the minimization of E1 ¼ kHb T k2 :

ð5Þ

The output weights, which minimize the regularized criterion (4) are bCopt ¼

1 I þ HT H C

1 ð6Þ

H T T;

while (5) is minimized by bopt ¼ H þ T;

ð7Þ

where H þ is the Moore–Penrose generalized inverse of matrix H: 1 H þ ¼ HT H HT :

ð8Þ

3 ELM with Rank-Revealing Orthogonal Decomposition When a large number of hidden layer neurons is selected, high correlations and multicollinearity always exist among the columns of the hidden layer output matrix H. It may lead to ill-condition of the Moore–Penrose calculation or cause overﬁtting of the ﬁnal model. The high condition number of H T H is the main reason of numerical difﬁculties in ELM implementation. The Tikhonov regularization is supposed to improve this situation - the coefﬁcient C is selected to decrease the condition number of 1 T C I þ H H, but unavoidably degrades the approximation accuracy. The “numerical rank” of a matrix is deﬁned as the number of singular values larger than a certain threshold r. Rank-Revealing Orthogonal Decomposition, introduced in [18, 19] allows to eliminate multicollinearity among columns of H. Rank-revealing decomposition provides information about the numerical rank of the matrix. For any numerical rank threshold r, the algorithm called RRQR (Rank-Revealing Q-R factorization) allows to represent the column-permuted matrix H as 2

H ½ P1

P2 ¼ ½ Q1

Q2

R1 Q3 4 0 0

3 R2 R3 5; 0

ð9Þ

where P1 ; P2 are permutation matrices, Q ¼ ½ Q1 Q2 Q3 is an orthogonal matrix, R1 ; R3 are upper-triangular matrices and R1 is a full-numerical-rank matrix with respect

Rank-Revealing Orthogonal Decomposition in Extreme Learning Machine Design

7

to the threshold r, while maximal singular value of R3 is not bigger than r. Therefore, after calculation of the rank-revealing QR factorization, the orthogonal matrix Q1 ¼ HP1 R1 1

ð10Þ

may be used to replace H. The multiplication by the permutation matrix P1 represents selection of the neurons that contribute to the numerical rank. Then, the multiplication T by R1 1 provides normalization such that cond Q1 Q1 ¼ 1. Finally, the optimal output weights are obtained from 1 bopt ¼ QT1 Q1 QT1 T:

ð11Þ

Some neurons are eliminated permanently from the initial set of neurons, hence, the ﬁnal number of neurons may be smaller than the initially planned. Therefore, the effort to select parameters of excluded neurons is spoiled, but all remaining neurons contribute to the effective approximation. The ﬁnal form of the network is presented in Fig. 3.

b)

c)

d)

e)

f)

g)

Fig. 3. The modiﬁed ELM: (a) input, (b) weights and biases, (c) hidden neurons, (d) elimination of neurons, (e) normalization, (f) output weights, (g) output.

The applied algorithms (RRQR and triangular matrix inversion) are available in various software packages. The computational complexity of the proposed modiﬁcation is OðM 3 Þ: All operations are done “without seeing the data” and the ﬁnal training of the network (calculation of bopt ) is done in one step. The only additional parameter is the numerical rank threshold r. The proposed procedure may lead to stagnation of the number of neurons, limited by the numerical rank condition, in spite of the user’s plan to use more neurons. Therefore RRQR factorization may be used to recognize “non-contributing” neurons. Such neurons are indicated by the permutation matrix P2 and their parameters must be modiﬁed. The modiﬁcation is discussed in the next section, and the pseudo-code for the complete design procedure is presented in Fig. 4.

8

J. Kabziński

1. 2. 3. 4.

5. 6.

Select weights and biases for hidden neurons. Select the numerical rank threshold . Calculate the matrix using neurons parameters and input samples. While LoopCounter < Max do 4.1 Perform the RRQR factorization of the matrix with the numerical rank threshold . 4.2 Use the permutation matrix to recognize the non-contributing neurons. 4.3 Modify the weights and biases of the non-contributing neurons. 4.4 Calculate the new matrix using new neurons’ parameters and input samples. 4.5 Increment the LoopCounter and go to 4.1. Perform the RRQR factorization of the matrix with the numerical rank threshold . Eliminate the non-contributing neurons and calculate the optimal output weights from (11). Fig. 4. The pseudo-code for the modiﬁed ELM

4 Modiﬁcation of Non-contributing Neurons According to step 4.3 of the design procedure presented in Fig. 4, weights and biases of the selected neurons need to be modiﬁed. The aim of this modiﬁcation is to change columns of the matrix H which do not contribute to the numerical rank for the given threshold, i.e. the columns in HP2 . The modiﬁcation has to preserve the nature of weights selection – at random, using a continuous probability distribution, or from an LDS, in a compact hypercube. Several approaches are possible, but it is well-known that multicollinearity of columns in the matrix H may be caused by an insufﬁcient variance of the AFs. The easy way to enhance the variance of sigmoid AFs was proposed in [4–6] and may be applied to modify the weights and biases of the noncontributing neurons. The ﬁrst step to enlarge variation of sigmoid activation functions is to increase the range of weights. The weights must be large enough to expose the nonlinearity of the sigmoid AF, and small enough to prevent saturation. Therefore, the already selected weights of non-contributing neurons will be multiplied by a random factor taken from the interval ½q; p. The values ½q; p ¼ ½3; 10 are suitable. Next, the biases are selected to guarantee that the range of a sigmoid function is sufﬁciently large. The minimal value of the sigmoid function hk ð x Þ ¼

1 ; x 2 ½0; 1n ; 1 þ exp wTk x þ bk

ð12Þ

Rank-Revealing Orthogonal Decomposition in Extreme Learning Machine Design

9

is achieved at the vertex selected according to the following rules: wk;i [ 0 ) xi ¼ 0; wk;i \0 ) xi ¼ 1 i ¼ 1; . . .; n;

ð13Þ

and equals 1 P

: w þ b 1 þ exp k i:wk;i \0 k;i

hk;min ¼

ð14Þ

The maximal value is attained at the vertex deﬁned by wk;i [ 0 ) xi ¼ 1; wk;i \0 ) xi ¼ 0 i ¼ 1; . . .; n;

ð15Þ

and is hk;max ¼

1 P

: 1 þ exp i:wk;i [ 0 wk;i þ bk

ð16Þ

Therefore, to get hk;min \r1 ; hk;max [ r2 for given 0\r1 \r2 \1 requires to have b :¼

X 1 1 w ln 1 \bk \ ln 1 w :¼ ~ b: i:wk;i [ 0 k;i i:wk;i \0 k;i r2 r1

X

ð17Þ As the initial bias bk old was selected from the interval [−1,1], it is modiﬁed according to the linear transformation bk new ¼

1 ~ 1 ~ b b bk old þ b þb ; 2 2

ð18Þ

providing the chance for b\bk new \~b:

5 Numerical Examples The two-dimensional function z ¼ sinð2pðx1 þ x2 Þ; x1 ; x2 2 ½0; 1

ð19Þ

is considered. 200 samples selected at random constitute the training set, and 100 samples are use as the test set. The surface (18) is plotted in Fig. 5. In all experiments, the initial values of the hidden layer neurons weights and biases are selected from the Halton sequence. First, only orthogonalization-based elimination of the non-contributing neurons is applied, and this approach (Orthogonalized Extreme Learning Machine - OELM) is compared with the standard ELM. The numerical rank

10

J. Kabziński

50 45

No of used neurons

40

1

0.5

1 0.8

z

0.6

35 30 25 20 15

-0.5 0.4

10

0.2

-1 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

5 x2

1

20

x1

Fig. 5. The surface (18) with with the training (circles) and the testing (stars) data.

40 60 No of hidden neurons

80

100

Fig. 6. Reduction of non-contributing neurons.

threshold is 109 . As it is presented in Fig. 6, the number of the ﬁnally used neurons is stabilized below 50, although up to 100 neurons were planned to be used initially. The achieved modeling accuracy is almost the same as obtained with the standard approach with 100 neurons (Fig. 7) and the conditioning of the output weights calculation is far better (Fig. 8). 25

0.8

OELM-log(norm(OW) T

0.7

OELM-log(cond(I/C+H *H) ELM-log(norm(OW)

20

T

ELM log(cond(I/C+H *H) OELM-train err. OELM-test err. ELM-train err. ELM-test err.

0.5

logarithmic scale

ELM errors

0.6

0.4 0.3

15

10

5

0.2

0.1 0

20

40

60

80

100

No of hidden neurons

Fig. 7. Training and test errors of the ELM with reduced number of neurons (OLM) and the standard ELM.

-5

20

40 60 No of hidden neurons

80

100

Fig. 8. The condition coefﬁcient and the norm of output weights in OELM and standard ELM.

Of course, the increase of the numerical rank threshold reduces the number of the ﬁnally used neurons and the approximation errors increase. If the threshold equals 103 the number of neurons is reduced below 15 and the errors stabilize at 0:5, which is far too large. Applying the procedure enhancing the variation of the AFs (r1 ¼ 0:1; r2 ¼ 0:9Þ, it is possible to increase the number of ﬁnally used neurons and to reduce the approximation errors, preserving numerical rank threshold of 103 . In Fig. 9 the number of ﬁnally used neurons is presented after the ﬁrst, second and third application of the variation enhancing procedure. In this case, the errors of the modiﬁed ELM are smaller

Rank-Revealing Orthogonal Decomposition in Extreme Learning Machine Design 100

80

0.7 0.6

70 60

ELM errors

No of used neurons

0.8

Finally used neurons (1) Finally used neurons (2) Finally used neurons (3) Initial contributing neurons

90

11

50 40

OELM-train err. OELM-test err. ELM-train err. ELM-test err.

0.5 0.4 0.3

30 0.2

20 0.1

10 0

20

40 60 No of hidden neurons

80

100

20

40

60

80

100

No of hidden neurons

Fig. 9. The number of ﬁnally used neurons after the ﬁrst, second and third application of enhancing variation procedure.

Fig. 10. Training and test errors of the ELM with reduced number of neurons and enhancing variation procedure (OLM) and the standard ELM.

With orthogonalization

25 OELM-log(norm(OW) T

OELM-log(cond(I/C+H *H) ELM-log(norm(OW)

20

1.5

T

logarithmic scale

ELM log(cond(I/C+H *H) 1

15

0.5 0

10

-0.5 5

-1 -1.5 1

1 0.8

0.5 -5

0.6 0.4

20

40 60 No of hidden neurons

80

100

Fig. 11. The condition coefﬁcient and the norm of output weights in ELM with neurons reduction and enhancing variation modiﬁcation (OELM) compared with the standard ELM.

y

0.2 0

x

Fig. 12. The surface generated by the trained ELM.

than the standard ELM (Fig. 10), while it is still guaranteed that the condition number equals 1 and the norm of the output weights is minimized (Figs. 11 and 12).

6 Conclusions The rank-revealing QR decomposition is an effective tool to indicate the neurons in ELM which do not contribute to the effective approximation due to multicollinearity of the columns in matrix H. The indicated neurons may be eliminated and the remaining neurons may be linearly transformed to get the orthogonal basis for the ﬁnal linear mean square problem, which provides the output weights. If this procedure generates a too small number of neurons to get the desired approximation accuracy, the indicated

12

J. Kabziński

non-contributing neurons may be modiﬁed to enhance variation of AFs. The approach presented in Sect. 4 re-scales previously chosen weights and biases and increases the number of contributing neurons. The numerical rank threshold is the only additional parameter of the ELM design and it allows to control numerical properties of the network training effectively.

References 1. Huang, G., Huang, G.-B., Song, S., You, K.: Trends in extreme learning machines: a review. Neural Netw. 61(1), 32–48 (2015) 2. Huang, G.-B., Zhu, Q.-Y., Siew, C.-K.: Extreme learning machine: theory and applications. Neurocomputing 70(1–3), 489–501 (2006) 3. Akusok, A., Bjork, K.M., Miche, Y., Lendasse, A.: High-performance extreme learning machines: a complete toolbox for big data applications. IEEE Access 3, 1011–1025 (2015) 4. Kabziński, J.: Extreme learning machine with enhanced variation of activation functions. In: IJCCI 2016 - Proceedings of the 8th International Joint Conference on Computational Intelligence, vol. 3, pp. 77–82 (2016) 5. Kabzinski, J.: Extreme learning machine with diversiﬁed neurons. In: CINTI 2016 - 17th IEEE International Symposium on Computational Intelligence and Informatics: Proceedings, pp. 181-186 (2016) 6. Kabziński, J.: Is extreme learning machine effective for multisource friction modeling? In: Chbeir, R., Manolopoulos, Y., Maglogiannis, I., Alhajj, R. (eds.) AIAI 2015. IAICT, vol. 458, pp. 318–333. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23868-5_23 7. Miche, Y., Sorjamaa, A., Bas, P., Simula, O., Jutten, C., Lendasse, A.: OP-ELM: optimally pruned extreme learning machine. IEEE Trans. Neural Netw. 21(1), 158–162 (2010) 8. Rong, H.J., Ong, Y.S., Tan, A.H., Zhu, Z.X.: A fast pruned-extreme learning machine for classiﬁcation problem. Neurocomputing 72(1–3), 359–366 (2008) 9. Huang, G.-B., Chen, L.: Enhanced random search based incremental extreme learning machine. Neurocomputing 71(16–17), 3460–3468 (2008) 10. Feng, G., Bin Huang, G., Lin, Q., Gay, R.: Error minimized extreme learning machine with growth of hidden nodes and incremental learning. IEEE Trans. Neural Netw., 20(8), 1352– 1357 (2009) 11. Zhang, R., Xu, M., Han, M., Li, H.: Multivariate chaotic time series prediction using based on improved Extreme Learning Machine. In: Proceedings of the 36th Chinese Control Conference, 26–28 July 2017, Dalian, China, pp. 4006–4011 (2017) 12. Han, H., Gan, L., He, L.: Improved variations for Extreme Learning Machine: space embedded ELM and optimal distribution ELM. In: 20th International Conference on Information Fusion, Fusion 2017 - Proceedings, no. 2 (2017) 13. Dick, J., Pillichshammer, F.: Digital Nets and Sequences: Discrepancy Theory and QuasiMonte Carlo Integration. Cambridge University Press (2010) 14. Niederreiter, H.: Random Number Generation and Quasi-Monte Carlo Methods. SIAM, Philadelphia (1992) 15. Bin Huang, G., Chen, L., Siew, C.K.: Universal approximation using incremental constructive feedforward networks with random hidden nodes. IEEE Trans. Neural Netw. 17(4), 879–892 (2006) 16. Cervellera, C., Macciò, D.: Low-discrepancy points for deterministic assignment of hidden weights in extreme learning machines. IEEE Trans. Neural Netw. Learn. Syst. 27(4), 891– 896 (2016)

Rank-Revealing Orthogonal Decomposition in Extreme Learning Machine Design

13

17. Tikhonov, A.N., Goncharsky, A., Stepanov, V.V., Yagola, A.G.: Numerical Methods for the Solution of Ill-posed Problems. Kluwer Academic Publishers, Dordrecht (1995) 18. Fierro, R.D., Hansen, P.Ch.: Low-rank revealing UTV decompositions. Numer. Algorithms, 15, 37–55 (1997) 19. Chan, T.F.: Rank revealing QR factorizations. Linear Algebra Appl. 88(89), 67–82 (1987)

An Improved CAD Framework for Digital Mammogram Classification Using Compound Local Binary Pattern and Chaotic Whale Optimization-Based Kernel Extreme Learning Machine Figlu Mohanty(B) , Suvendu Rup, and Bodhisattva Dash Image and Video Processing Laboratory, IIIT Bhubaneswar, Bhubaneswar, India [emailprotected], [emailprotected], [emailprotected]

Abstract. The morbidity and mortality rate of breast cancer still continues to remain high among women across the world. This ﬁgure can be reduced if the cancer is identiﬁed at its early stage. A Computer-aided diagnosis (CAD) system is an eﬃcient computerized tool used to analyze the mammograms for ﬁnding cancer in the breast and to reach a decision with maximum accuracy. The presented work aims at developing a CAD model which can classify the mammograms as normal or abnormal, and further, benign or malignant accurately. In the present model, CLAHE is used for image pre-processing, compound local binary pattern (CMLBP) for feature extraction followed by principal component analysis (PCA) for feature reduction. Then, a chaotic whale optimization-based kernel extreme learning machine (CWO-KELM) is utilized to classify the mammograms as normal/abnormal and benign/malignant. The present model achieves the highest accuracy of 100% and 99.48% for MIAS and DDSM, respectively. Keywords: Mammograms · Compound local binary pattern Chaotic map · Whale optimization algorithm Kernel extreme learning machine

1

Introduction

According to the statistics of cancer, the incidence and mortality scenario of breast cancer is increasing day by day. The world health organization [17] makes an estimation of 21 million cancer cases by the year 2030 which was only 12.7 million in 2008. 0.537 millions of women and 0.477 millions of males in India were diagnosed with breast cancer in the year 2012 [19]. So, it becomes utmost necessary to design an eﬃcient detection and diagnosis tool in order to reduce the mortality rates among women and men. In this context, mammography is the most eﬀective and reliable tool to detect the abnormalities in the breast at its c Springer Nature Switzerland AG 2018 V. K˚ urkov´ a et al. (Eds.): ICANN 2018, LNCS 11140, pp. 14–23, 2018. https://doi.org/10.1007/978-3-030-01421-6_2

An Improved CAD Framework for Digital Mammogram Classiﬁcation

15

earliest stage. A computer-aided diagnosis (CAD) system combines various principles of image analysis, machine learning, and pattern recognition approaches to examine the crucial information present in the mammograms. CAD is a fast and cost-eﬀective system to assist the medical practitioners or radiologists to detect and diagnose cancer. Designing a CAD system with high eﬃciency is important as well as a challenging task. Various researchers have developed diﬀerent CAD models to detect or diagnose the abnormalities in the mammograms. Yasser et al. [21] proposed a CAD system which extracts features using discrete wavelet transform (DWT), contourlet transform (CT), and local binary pattern (LBP) reporting the best accuracy of 98.63%. Bajaj et al. [4] proposed a novel approach using bi-dimensional empirical mode decomposition (BEMD) and least-square SVM for mammogram classiﬁcation. Chithra et al. [5] used wavelet entropy and ensemble classiﬁers using k-nearest neighbor (k-NN) and SVM. Singh et al. [24] proposed a waveletbased center-symmetric local binary pattern technique to extract the features from the mammograms yielding an accuracy of 97.3%. Few more recently proposed CAD systems can be referred in [8,14,20,26]. Motivated by the previously developed CAD schemes, the proposed work aims at designing an eﬃcient and robust CAD system for accurate diagnosis of breast cancer. The structure of the remaining article is as follows: Sect. 2 elaborates the proposed CAD model. Section 3 analyses the results obtained with the proposed model. Lastly, Sect. 4 presents the concluding remarks.

2

Proposed Methodology

In the present work, initially, the desired ROIs are generated from the original mammograms using simple cropping approach. In case of abnormal mammograms, the cropping is done using the given ground truth information about the position and radius of the abnormal regions. However, for normal mammograms, cropping is done on any arbitrary location to get the ROI. Once the ROIs are obtained, the next step is to apply the CM-LBP technique to extract the texture features from the ROIs. Thereafter, PCA is applied to reduce the size of the feature vector followed by CWO-KELM to classify the mammograms as normal or abnormal, and further, benign or malignant. A detailed diagram of the proposed CAD model is illustrated in Fig. 1.

Fig. 1. Block diagram of the proposed CAD model

16

2.1

F. Mohanty et al.

Pre-processing Using CLAHE

Pre-processing of the ROIs is considered to be a vital step before performing any further modules in a CAD system. As some of the images collected from the datasets are of low contrast, so it is required to enhance the contrast of such images. Hence, in the present work, contrast limited adaptive histogram equalization (CLAHE) [18] is utilized to improve the quality of the low-contrast images. 2.2

Feature Extraction Using Compound Local Binary Pattern (CM-LBP)

The original local binary pattern operator ignores the magnitude of the diﬀerence between the center pixel value and its neighboring pixel values resulting in an inconsistent code. So, to overcome the issues of LBP, CM-LBP is introduced [1]. In CM-LBP, a 2-bit code is utilized to encode the local texture information of an image where the ﬁrst bit indicates the sign of the diﬀerence between the center pixel and its neighboring pixel values, and the second bit represents the magnitude of the diﬀerence with respect to a threshold value Tavg . The term Tavg is the average magnitude of the diﬀerence between the center pixel and its corresponding neighbors in the local neighborhood window. If Pn is one of the neighboring pixels, and Pc is the center pixel, then the mathematical expression of the 2-bit code is as follows: ⎧ 00 Pn − Pc < 0 and |Pn − Pc | ≤ Tavg ⎪ ⎪ ⎪ ⎨ 01 Pn − Pc < 0 and |Pn − Pc | > Tavg s(in , im ) = (1) ⎪ 10 Pn − Pc ≥ 0 and |Pn − Pc | ≤ Tavg ⎪ ⎪ ⎩ 11 otherwise This C(Pn , Pc ) generates 16-bit codes for the eight neighbors which is again split into two 8-bit codes, one for the diagonal neighbors and another for nondiagonal neighbors. Then, two diﬀerent histograms are plotted for the two different groups and are combined for generating the CLBP. 2.3

Feature Reduction Using PCA

The number of features obtained from the feature extraction module is quite large and to prevent the ‘curse of dimensionality’ issue, it is needed to lessen the size of the feature vector which also makes the task of classiﬁer simple. In addition, out of all the features, some of them are not relevant. To obtain the set of relevant features, the PCA technique is employed. PCA generates a reduced set of features by transforming the high-dimensional data to a set of data having low dimension retaining maximum variance of the original data. This transforms produces a set of linearly uncorrelated data which are called as principal components (PCs). A deep insight about PCA can be referred in [6,9].

An Improved CAD Framework for Digital Mammogram Classiﬁcation

2.4

17

Classification Using CWO-KELM

Chaotic Whale Optimization: The whale optimizer proposed in [16] is an optimization approach based on the hunting behaviour of the humpback whales, known as bubble-net hunting. The WOA technique considers the current best solution to be the prey or is near to the optimal solution. When the target prey is deﬁned, the rest of the whales hence update their locations to move towards the best search candidate with the increment of the iterations. This can be mathematically represented as: −→ →− − → − → (2) W (i + 1) = W ∗ (i) − A . D where

− − → → −−−→ −−−→ D = C .W ∗ (i) − W (i)

(3)

− → → − A and C are the two coeﬃcient vectors and can be deﬁned using Eqs. (4) and ∗ (5). W and W represent the location of the prey and the whale, respectively at the current iteration i. → − → A = 2a.− r .a (4) − → → C = 2.− r

(5) − → where r is a random vector ranging between [0,1]. a linearly decreases from 2 to 0 during the iterations and can be represented using Eq. (6). 2 (6) maxiter However, the humpback whales move around the target prey through a shrinking circle and also along a spiral-like pathway at the same time. Hence, a probability of 0.5 is assumed to select either the shrinking path or the spiral path in order to update the location of the whales. The mathematical representation of this behaviour can be expressed as follows: −→∗ →− − → W (i) − A . D p ≤ 0.5 W (i + 1) = − (7) → −→ D .ebt .cos(2πt) + W ∗ (i) p ≥ 0.5 a = 2 − i.

where

− → −−∗−→ −−−→ D = W (i) − W (i)

(8)

b represents a constant which deﬁnes the shape of the logarithmic spiral, t symbolizes a random number ranging between [−1, 1] whereas p is taken randomly between [0, 1]. → − → − The value of the parameter A can be randomly initialized with A > 1 or → − A < −1 to ﬁnd the target prey (exploration stage) and to make the search candidates go away from a reference whale. The mathematical formulation for this exploration stage can be deﬁned as: → − − → −−−−−−→ −−−→ (9) D = C .Wrand (i) − W (i)

18

F. Mohanty et al.

−−−−→ →− − → − → W (i + 1) = Wrand (i) − A . D

(10) −−−−→ where Wrand is a random whale’s position vector selected from the current population of whales. For detail understanding of WOA, readers are referred to [16]. Though the convergence rate of WOA is considerably good, it still cannot perform well in searching the global optimum solution which inﬂuences the convergence speed of the technique. Hence, to overcome such issue, in this article, a chaos-based WOA technique is adopted. The chaotic sequence has three important properties, namely, ergodicity, quasi-stochastic, and sensitivity to original conditions which helps to search the solution at a higher speed as compared to that of the stochastic search [22]. The dynamic property of the chaotic maps makes them acceptable in various optimization algorithms to explore the search more robustly [27]. Almost every meta-heuristic technique achieves randomness of the stochastic components by utilizing a probability distribution. However, it can be more promising if such random values are replaced with the chaotic sequences. In the present work, a logistic chaotic function [2] is incorporated in the WOA algorithm. The mathematical representation of the logistic chaos is given by: (11) Li+1 = cLi (1 − Li ), c = 4 Classification Using Proposed CWO-KELM: Extreme learning machine (ELM) is a type of single hidden layer feed-forward network (SLFN) proposed by Huang et al. [13], which has been utilized in many research domains [3,10, 23]. Unlike other traditional algorithms like BPNN [15] and SVM [7], ELM is capable of achieving higher classiﬁcation accuracy with faster convergence rate. In this work, one of the variants of ELM referred to as kernel ELM (KELM) is used as it is capable of proving improved results than that of the conventional ELM [12]. In KELM, the kernel function replaces the random feature mapping of the traditional ELM and thus results in more stable output weights. The kernel ELM uses two important parameters, namely, penalty parameter (C) and the kernel parameter (γ) to obtain the ﬁnal output weights. However, ﬁnding the optimal values for the aforementioned parameters is a challenging task. This motivates the authors to exploit an evolutionary algorithm to get the optimal values for these two parameters to obtain better convergence. In the present work, a chaotic whale optimization algorithm is incorporated with KELM to ﬁnd the optimized values of C and γ. Prior to the classiﬁcation module, the whole dataset is divided into training, validation, and testing set using a 10-fold stratiﬁed cross-validation (SCV) to prevent the over-ﬁtting problem. The ﬂowchart of the working principle of the proposed CWO-KELM model is represented in Fig. 2. In addition to this, the steps involved in the proposed CWO-KELM are given in the following: 1. Start the process with random initialization of candidate solution in the population so that each solution has a set of C and γ as K = [C1 , C2 , ..., Cj , γ1 , γ2 , ..., γj ]

(12)

An Improved CAD Framework for Digital Mammogram Classiﬁcation

19

C and γ are initialized in a range of [2−8 , 2−6 , 2−4 , ..., 28 ]. 2. Initialize the values of A, p, C, a, and t. 3. For each of the candidate solutions, determine the ﬁtness value (classiﬁcation accuracy) using KELM. The ﬁtness value is calculated on the validation set in order to prevent the over-ﬁtting issue. 4. Sort the whales in descending order and select the best whale position having the highest ﬁtness value. 5. Update the values of A and p using the chaotic map using Eq. 11. 6. Update the position of candidate whales based on values of A and p and ﬁnd the position of the best whale. – If p < 0.5 and |A| < 1, then update the position of the current whale using Eq. 2. – If p < 0.5 and |A| ≥ 1, then ﬁnd a random whale and update the position of the current whale with respect to the random whale using the Eq. 10. – If p ≥ 0.5, then update the position of then current whale using the Eq. 7. 7. Generate the new best whale as W (i + 1) if f (W (i + 1)) > f (W (i)) W (i + 1) = (13) W (i) otherwise where f (W (i + 1)) and f (W (i)) denote the ﬁtness value of the updated whale and previous whale, respectively. 8. Find the out-of-bound cases in the new solution and limit them in a range of [2−8 , 28 ] as −8 2 if W (i + 1) < 2−8 W (i + 1) = (14) 28 if W (i + 1) > 28 9. Repeat steps 3–8 till the predeﬁned number of iterations. Finally, the optimal values of C and γ are obtained and validated on the test set to get the overall performance of the proposed CWO-KELM-based model.

3

Experimental Results and Analysis

The proposed CAD model is experimented on two standard benchmark datasets, namely, MIAS [25] and DDSM [11]. A total of 314 and 1500 images are collected from MIAS and DDSM, respectively. The collected images are ﬁrst classiﬁed as normal or abnormal, and further, benign or malignant using the proposed CAD system. The performance of the proposed model is evaluated in terms of diﬀerent performance metrics, namely, accuracy, sensitivity, speciﬁcity, area under curve (AUC), and receiver operating characteristics (ROC) curve. In addition to this, the proposed scheme is compared against some of the recently designed CAD schemes. Prior to the feature extraction module, ROIs are segmented from the unnecessary background regions using cropping. Using the ground truth information regarding the coordinates of the abnormalities in the images, ROIs of

20

F. Mohanty et al.

Fig. 2. Flowchart of the proposed CWO-KELM

size 256 × 256 are generated. After cropping, the ROIs are pre-processed using CLAHE to enhance the contrast. Then, CLBP technique is applied on the extracted ROIs to obtain the feature matrix. Applying CLBP, a feature matrix of size s × F is obtained, where s and F indicate the number of ROIs and the number of generated features, respectively. In this work, 512 number of features (F ) are generated from CLBP which is quite large. So, to reduce the size of the feature vector and make the classiﬁcation simpler, PCA is utilized which reduces the number of features from 512 to 14 preserving 99% of the variance of the original data. The reduced features are passed to the proposed CWO-KELM classiﬁer for classifying the mammograms as normal or abnormal followed by benign or malignant. Table 1 depicts the various performance results attained with the proposed CAD model. From the table, it can be noticed that the highest accuracy achieved for MIAS dataset is 100% in both normal-abnormal and benign-malignant classiﬁcations. Similarly, for the DDSM dataset, an accuracy of 99.48% and 98.61% is achieved for normal-abnormal, and benign-malignant classiﬁcation, respectively. Additionally, the ROC graphs generated by the proposed classiﬁer are plotted in Figs. 3 and 4 showing the corresponding values of AUC for MIAS and DDSM datasets, respectively.

An Improved CAD Framework for Digital Mammogram Classiﬁcation

21

1 1

0.8

True Positive Rate

True Positive Rate

0.8

0.6

0.4

MIAS (N-A)AUC=1 0.2

MIAS (B-M)AUC=1 0

0.6

0.4

DDSM (N-A)AUC=0.9903

0.2

DDSM (B-M)AUC=0.9841

0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.1

0.2

False Positive Rate

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate

Fig. 3. ROC of MIAS

Fig. 4. ROC of DDSM

Further, to add more justiﬁcation, the proposed CAD model is compared with ﬁve recently developed CAD schemes. The comparison against the other schemes is made in terms of classiﬁcation accuracy and is depicted in Table 2. Table 1. Performance measures obtained by the proposed CWO-KELM-based model; N-Normal, A-Abnormal, B-Benign, M-Malignant Dataset Performance measures

CLAHE+CLBP+ PCA+CWO-KELM N-A B-M

MIAS

Sensitivity Speciﬁcity Accuracy (%)

1 1 100

DDSM

Sensitivity Speciﬁcity Accuracy (%)

0.9945 0.9886 0.9912 0.9869 99.48 98.61

1 1 100

Table 2. Comparison with some of the other existing CAD schemes Reference Proposed scheme

Classification accuracy (%)

[21]

Statistical Features+LBP+SVM

98.63 (DDSM)

[26]

Firefly+ANN

95.23 (DDSM)

[14]

Parasitic metric learning

96.7 (MIAS), 97.4 (DDSM)

[4]

BEMD+SVM

95 (MIAS)

[24]

WCS-LBP+SVM-RFE+Random Forest

97.25 (MIAS)

Proposed CLAHE+CLBP+PCA+CWO-KELM 100 (MIAS), 98.61 (DDSM)

22

4

F. Mohanty et al.

Conclusion

In the present work, an enhanced CAD system has been proposed for breast cancer classiﬁcation in digital mammograms. Initially, CLAHE is used to enhance the low-contrast images. Then, CLBP is employed to extract the texture features followed by a feature reduction module using PCA. The reduced feature set is then passed through a CWO-KELM-based classiﬁer to classify the mammograms. The proposed model has been experimented on two benchmark datasets, namely, MIAS and DDSM. Furthermore, the performance of the proposed model has been compared with ﬁve recent schemes and it has been noticed that the proposed model with only 14 features achieves improved results over the competent schemes. The high success rate with respect to the accuracy of the proposed scheme helps radiologists to make an accurate diagnosis decision to reduce unnecessary biopsies.

References 1. Ahmed, F., Hossain, E., Bari, A., Hossen, M.S.: Compound local binary pattern (CLBP) for rotation invariant texture classiﬁcation. Int. J. Comput. Appl. 33(6), 5–10 (2011) 2. Alatas, B.: Chaotic bee colony algorithms for global numerical optimization. Expert. Syst. Appl. 37(8), 5682–5687 (2010) 3. Bai, Z., Huang, G.B., Wang, D., Wang, H., Westover, M.B.: Sparse extreme learning machine for classiﬁcation. IEEE Trans. Cybern. 44(10), 1858–1870 (2014) 4. Bajaj, V., Pawar, M., Meena, V.K., Kumar, M., Sengur, A., Guo, Y.: Computeraided diagnosis of breast cancer using bi-dimensional empirical mode decomposition. Neural Comput. Appl. 1–9 (2017) 5. Chithra Devi, M., Audithan, S.: Analysis of diﬀerent types of entropy measures for breast cancer diagnosis using ensemble classiﬁcation. Biomed. Res. 28(7), 3182– 3186 (2017) 6. Christopher, M.B.: Pattern Recognition and Machine Learning. Springer, New York (2016) 7. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995) 8. Dhahbi, S., Barhoumi, W., Kurek, J., Swiderski, B., Kruk, M., Zagrouba, E.: Falsepositive reduction in computer-aided mass detection using mammographic texture analysis and classiﬁcation. Comput. Meth. Prog. Biomed. 160, 75–83 (2018) 9. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classiﬁcation. Wiley, New York (2012) 10. Han, M., Liu, B.: Ensemble of extreme learning machine for remote sensing image classiﬁcation. Neurocomputing 149, 65–70 (2015) 11. Heath, M., Bowyer, K., Kopans, D., Moore, R., Kegelmeyer, W.P.: The digital database for screening mammography. In: Proceedings of the 5th International Workshop on Digital Mammography, pp. 212–218. Medical Physics Publishing (2000) 12. Huang, G.B., Wang, D.H., Lan, Y.: Extreme learning machines: a survey. Int. J. Mach. Learn. Cybern. 2(2), 107–122 (2011)

An Improved CAD Framework for Digital Mammogram Classiﬁcation

23

13. Huang, G.B., Zhu, Q.Y., Siew, C.K.: Extreme learning machine: theory and applications. Neurocomputing 70(1–3), 489–501 (2006) 14. Jiao, Z., Gao, X., Wang, Y., Li, J.: A parasitic metric learning net for breast mass classiﬁcation based on mammography. Pattern Recogn. 75, 292–301 (2018) 15. Junguo, H., Guomo, Z., Xiaojun, X.: Using an improved back propagation neural network to study spatial distribution of sunshine illumination from sensor network data. Ecol. Model. 266, 86–96 (2013) 16. Mirjalili, S., Lewis, A.: The whale optimization algorithm. Adv. Eng. Softw. 95, 51–67 (2016) 17. World Health Organization: Burden: mortality, morbidity and risk factors. Global Status Report on Noncommunicable Diseases 2011 (2010) 18. Pizer, S.M., Johnston, R.E., Ericksen, J.P., Yankaskas, B.C., Muller, K.E.: Contrast-limited adaptive histogram equalization: speed and eﬀectiveness. In: Proceedings of the First Conference on Visualization in Biomedical Computing, pp. 337–345. IEEE (1990) 19. Raj, P., Muthulekshmi, M.: Review of cancer statistics in India. Int. J. Adv. Signal Image Sci. 1(1), 1–4 (2015) 20. Rampun, A., Scotney, B.W., Morrow, P.J., Wang, H., Winder, J.: Breast density classiﬁcation using local quinary patterns with various neighbourhood topologies. J. Imaging 4(1), 14 (2018) 21. Reyad, Y.A., Berbar, M.A., Hussain, M.: Comparison of statistical, LBP, and multi-resolution analysis features for breast mass classiﬁcation. J. Med. Syst. 38(9), 100 (2014) 22. dos Santos Coelho, L., Mariani, V.C.: Use of chaotic sequences in a biologically inspired algorithm for engineering design optimization. Expert Syst. Appl. 34(3), 1905–1913 (2008) 23. Silvestre, L.J., Lemos, A.P., Braga, J.P., Braga, A.P.: Dataset structure as prior information for parameter-free regularization of extreme learning machines. Neurocomputing 169, 288–294 (2015) 24. Singh, V.P., Srivastava, S., Srivastava, R.: Eﬀective mammogram classiﬁcation based on center symmetric-LBP features in wavelet domain using random forests. Technol. Health Care 25(4), 709–727 (2017) 25. Suckling, J., et al.: The mammographic image analysis society digital mammogram database. In: Exerpta Medica. International Congress Series, vol. 1069, pp. 375–378 (1994) 26. Thawkar, S., Ingolikar, R.: Classiﬁcation of masses in digital mammograms using ﬁreﬂy based optimization. Int. J. Image Graph. Sig. Process. 10(2), 25 (2018) 27. Yang, D., Li, G., Cheng, G.: On the eﬃciency of chaos optimization algorithms for global optimization. Chaos Solitons Fractals 34(4), 1366–1375 (2007)

A Novel Echo State Network Model Using Bayesian Ridge Regression and Independent Component Analysis Hoang Minh Nguyen, Gaurav Kalra, Tae Joon Jun, and Daeyoung Kim(B) School of Computing, Korea Advanced Institute of Science and Technology (KAIST), 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea {minhhoang,gvkalra,taejoon89,kimd}@kaist.ac.kr

Abstract. We propose a novel Bayesian Ridge Echo State Network (BRESN) model for nonlinear time series prediction, based on Bayesian Ridge Regression and Independent Component Analysis. BRESN has a regularization eﬀect to avoid over-ﬁtting, at the same time being robust to noise owing to its probabilistic strategy. In BRESN we also use Independent Component Analysis (ICA) for dimensionality reduction, and show that ICA improves the model’s accuracy more than other reduction techniques. Furthermore, we evaluate the proposed model on both synthetic and real-world datasets to compare its accuracy with twelve combinations of four other regression models and three diﬀerent choices of dimensionality reduction techniques, and measure its running time. Experimental results show that our model signiﬁcantly outperforms other state-of-the-art ESN prediction models while maintaining a satisfactory running time. Keywords: Echo State Network · Bayesian Ridge Regression Independent Component Analysis · Nonlinear time series prediction

1

Introduction

Analyzing time-dependent data and forecasting future values has been widely studied and applied in a multitude of domains, including economics, engineering, and natural and social sciences. A practical application of time-series involves both univariate and multivariate analysis, with linear and nonlinear dynamics [7,11]. Consequently, various neural network and support vector machine (SVM) based models have been proposed, such as multilayer perceptrons (MLP) [8], radial basis function (RBF) neural network [9], extreme learning machine (ELM) [4], and echo state network (ESN) [6]. ESN is a type of recurrent neural network (RNN) designed to solve vanishing and exploding gradient problems, with the core idea of driving a large and ﬁxed number of randomly generated neurons (called the reservoir) with the input signal to induce a nonlinear response in each neuron [5]. In other words, ESN c Springer Nature Switzerland AG 2018 V. K˚ urkov´ a et al. (Eds.): ICANN 2018, LNCS 11140, pp. 24–34, 2018. https://doi.org/10.1007/978-3-030-01421-6_3

A Novel BRESN for Nonlinear Time Series Forecasting

25

avoids the gradient problem by training only the output weights; this allows ESN to have signiﬁcantly faster training time compared to other multi-layer RNNs (e.g. Long Short Term Memory (LSTM)), which typically require powerful Graphics Processing Units (GPUs). In this paper, we propose a novel ESN model, namely Bayesian Ridge Echo State Network (BRESN), to improve the prediction capability of existing ESN models. Firstly, BRESN uses Bayesian Ridge Regression, which adds a probabilistic perspective and the ‘prior’ concept to help with the regularization of the classical ridge, and overall provides higher robustness. Secondly, we have introduced Independent Component Analysis for dimensionality reduction in BRESN, and shown that it provides more accurate prediction results with Bayesian Ridge Regression than previous reduction techniques. Finally, we have evaluated our BRESN model on both synthetic and real-world datasets, and shown that our model signiﬁcantly outperforms other state-of-the-art models in term of accuracy while maintaining a satisfactory running time. The rest of the paper is organized as follows. Section 2 provides information on related work, and Sect. 3 gives an overview on ESN architecture. Section 4 explains the components of our BRESN model, while Sect. 5 presents and discusses our experimental setup and results. Finally, concluding remarks are provided in Sect. 6.

2

Related Work

In training the output weights of ESN, pseudoinverse is a commonly used method. However, in practical applications, pseudoinverse can easily lead to ill-posed problems and cause weak generalization capability of the model. To resolve these problems, regularization methods such as Tikhonov regularization can be used. The traditional Tikhonov regularization, or ridge regression, penalizes high coeﬃcient values in order to ‘simplify’ the trained model as much as possible. This allows the method to avoid over-ﬁtting by minimizing the impacts of irrelevant features. Noise injection is an alternative, but it is not as stable as the standard Tikhonov regularization [1]. In addition to Ridge Regression, still one of the most widely applied and well performed ESN output regression techniques, there have been various other works on diﬀerent ESN variants. This includes using Support Vector Regression (SVR) together with ESN to replace ‘kernel trick’ with ‘reservoir trick’ in dealing with nonlinearity; in other words, to perform linear SVR in the high-dimension ‘reservoir’ state space [12]. Another notable work is from [3], where a probabilistic instead of regularization approach is applied in training ESN output weights. As ESN often deals with high dimensional data, dimensionality reduction techniques have been proposed and evaluated in [10]. In this work, Principal Component Analysis (PCA) and kernel Principal Component Analysis (kPCA) have been shown to consistently provide improvements in ESN prediction accuracy. This work has also experimented and shown promising results when using ν-SVR, a SVR variant that uses a hyperparameter ν to control the number of support vectors, together with PCA and kPCA.

26

3

H. M. Nguyen et al.

Basics of Echo State Network

The architecture of Echo State Network (without feedback, for purely inputdriven dynamical pattern recognition) is depicted in Fig. 1. The circles depict input u, reservoir state x, and output y. The gray dashed squares depict the input-to-reservoir weight matrix Wir and the reservoir weight matrix Wrr . These two matrices are randomly initialized with real values sampled from a uniform distribution in the [−1, 1] interval. The solid squares depict the reservoir-tooutput weight matrix Wro and the input-to-output weight matrix Wio . The diamond shape with z −1 depicts the unit delay operator, and the polygon illustrates the non-linear transformation performed by the neurons of the network.

Wri

x

Wor

y

z-1

u

Wrr

Woi

Fig. 1. Echo State Network architecture

The system and output equations of the discrete-time ESN are as follow: x[n] = f (Wrr x[n − 1] + Wir u[n]) + ν(n − 1)

(1)

y[n] = Wro x[n] + Wio u[n]

(2)

where f is a sigmoid function (usually the logistic sigmoid or tanh function), and ν is a noise vector added to the reservoir states. By deﬁnition, the two weight matrices Wir and Wrr are randomly initialized and not trained, while Wro and Wio of the readout are optimized for speciﬁc tasks. To begin with, supposed we have a training set Ttr ; from this training set input-output pairs can be formed: (u[1], y[1]), ..., (u[Ttr ], y[Ttr ])

(3)

In the training phase, or state harvesting, the reservoir states x[1],...,x[Ttr ] can be harvested using Eq. 1, and the target outputs are used for Eq. 2. The inputs, reservoir states, and outputs can be stacked into a matrix S, and the target outputs can be stacked into a vector Y in order to train the ESN readout layer: ⎡ T ⎤ ⎤ ⎡ x [1], uT [1] y[1] ⎢ ⎥ ⎢ . ⎥ .. S=⎣ (4) ⎦ , Y = ⎣ .. ⎦ . xT [Ttr ], uT [Ttr ]

y[Ttr ]

A Novel BRESN for Nonlinear Time Series Forecasting

4

27

Bayesian Ridge Echo State Network (BRESN)

The overall architecture of our Bayesian Ridge Echo State Network (BRESN) model is shown in Fig. 2.

Genetic Algorithm

Inputs

Time Series Reconstruction Time Delayed Embedding

Generate next individual

Reservoir

BRESN

Evaluate Fitness

Dimensionality Reduction

Readout

Independent Component Analysis

Bayesian Ridge Regression

Fig. 2. Bayesian Ridge Echo State Network (BRESN) architecture

The input time series data is ﬁrstly reconstructed using time delayed reconstruction method and put into BRESN. As there are a variety of hyperparameters necessary to be optimized for BRESN, genetic algorithm is utilized for this purpose. Individuals from genetic algorithm are created, and the individual with the best cross-validation ﬁtness result is chosen for testing of the model. 4.1

Time Series Reconstruction

In order to provide quality predictions, identifying the original phase space of observed time series data is necessary. This can be done by converting observations into state vector by a process known as phase space reconstruction. Takens’ Embedding Theorem [14] shows that with a suitable dimension, we can obtain a topologically equivalent structure to the original ‘attractor’ of the time series. Given a time series [X1 , X2 , ..., XN ], where Xi = [x1 (i), x2 (i), ..., xd (i)]T (d dimensions) and i = 1, 2, ..., N , suppose we set the time delay vector as M = [m1 , m2 , ..., md ], where mj (j = 1, 2, ..., d) is the jth embedding dimension, and τj (j = 1, 2, ..., d) is the time delay. Then a time delayed phase space reconstruction can be created as follow: V (k) = x1 (k), x1 (k − τ1 ), ..., x1 (k − (m1 − 1)τ1 ), x2 (k), x2 (k − τ2 ), ..., x2 (k − (m2 − 1)τ2 ), ...

(5)

xd (k), x2 d(k − τd ), ..., xd (k − (md − 1)τd ) The time series dimension after phase space reconstruction is m1 + m2 + ... + md . Suppose that ρ is the prediction horizon, then according to Takens’

28

H. M. Nguyen et al.

Embedding Theorem, with suitable time delay and embedding dimension parameters, generally there exists a function F such that: x1 (k + ρ) = F1 (V (k)) x2 (k + ρ) = F2 (V (k)) ... xd (k + ρ) = Fd (V (k))

(6)

To determine the correct time delay τj and embedding dimension mj , methods like autocorrelation or mutual information can be used. 4.2

Dimensionality Reduction

To deal with very high dimensional data in readout layer (Eq. 4), we have employed dimensionality reduction techniques to overcome the multicollinearity problem. Independent Component Analysis (ICA) is chosen for this purpose. For matrix S (Eq. 4), using ICA we consider the source S to be a linear combination of independent non-Gaussian components. ICA attempts to ‘un-mix’ the source into S = Si A where Si contains the independent components and A is the mixing matrix. In other words, ICA searches for A that maximizes the non-Gaussianity of the sources; as a result, it can also be used for dimensionality reduction with the resulting Si being the reduced version of S. At the center of the ICA algorithm, neg-entropy (J ) can be used to measure non-Gaussianity, which is fast to compute and more robust than kurtosisbased method. The approximation of neg-entropy for a variable s in case of one quadratic function G is of the form: J (s) [E{G(s)} − E{G(v)}]2

(7)

where E denotes expectation, s is assumed to be of mean 0 and unit variance, and v is a random variable following a normal distribution of mean 0 and unit variance. In this work, we have chosen G to be log cosh function; more speciﬁcally, function G for a variable u has the form G(u) = 1c log(cosh(cu)) where c is some suitable constant (c is set to 1 in this work). 4.3

Bayesian Ridge Regression (BayeRidge)

To train ESN readout layer, linear regression can be used; however, this method can easily cause overﬁtting problem. This is because both reservoir states and inputs (with increased dimensions from reconstruction) are stacked into matrix S of Eq. 4, making it very high dimensional and easy to overﬁt. As a result, one common approach to resolve this problem is ridge regression, which penalizes high coeﬃcient values by solving the following regularized least-square problem: Wls∗ = argmin(||Y − SW ||2 + λ||W ||2 ) W

(8)

A Novel BRESN for Nonlinear Time Series Forecasting

29

where λ is the L2 regularization coeﬃcient, and W = [Wio Wro ]. Larger values of λ will cause the components of W to shrink more towards zero. In matrix terms, the calculation in the right hand side of Eq. 8 is the same as: (Y − SW )T (Y − XW ) + λW T W

(9)

where W T denotes the transpose of the matrix form of W . Solving Eq. 9 we get the Ridge estimator as follows: W = (S T S + λI)−1 S T Y

(10)

where I is the identity matrix. For the Ridge Regression discussed above, we can obtain a Bayesian view of it by considering the standard regression model Y = SW + with two following conditions: (i) the error has a normal distribution with mean 0 and known variance matrix σ 2 I, and (ii) W has a prior normal distribution with known mean α and known variance matrix Z. Posterior probability of W can be obtained using Bayes’ theorem: p(W /Y ) ∼ N (Z −1 + (1/σ 2 )S T S)−1 (Z −1 α + (1/σ 2 )S T Y ); (Z −1 + (1/σ 2 )S T S)−1 (11) Using Eq. 11, if we set α = 0 and Z = (σ 2 /λ)I, the posterior mean of W is equal to (S T S + λI)−1 S T Y , which is the same as the Ridge estimator in Eq. 10. In other words, the penalization by weighted L2 coeﬃcient is equivalent to setting a Gaussian prior on the weights W . Also, in order to complete the priors’ speciﬁcation, the priors for the variances of and W need to be deﬁned. Suppose ϕ = 1/σ 2 and ϕw = λ/σ 2 are the precisions of and W , respectively, then their priors can be suitably deﬁned by the following Gamma distributions [2]: p(ϕ ) ∼ Gamma(α1 , α2 )

(12)

p(ϕw ) ∼ Gamma(λ1 , λ2 )

(13)

As a result, the hyperpriors α1 , α2 , λ1 , and λ2 are the hyperparameters necessary to be estimated for Bayesian Ridge Regression. The key diﬀerence between Ridge and BayeRidge is that the Bayesian approach makes predictions by integrating over the distribution of model parameter (W ), instead of using a speciﬁc estimated value. This key property allows Bayesian Ridge Regression to reduce overﬁtting (as its predictions are basically averaged over many possible solutions), as a result improve predictive capability compared to the classical Ridge Regression. 4.4

Hyperparameters Optimization Using Genetic Algorithm

In order to optimize the set of hyperparameters for BRESN, genetic algorithm is used with Gaussian mutation, random crossover, tournament selection, and

30

H. M. Nguyen et al.

elitism. The genetic algorithm is ran for 20 generations, population size of 50, number of oﬀsprings of 30 in each generation, mutation probability of 0.2, and crossover probability of 0.5. To select the best individual, the genetic algorithm attempts to minimize the following ﬁtness function: F it(θ) = (1 − r)Err(Y ) + r ∗ d/Nr

(14)

where θ is an individual, and the ratio r is set to 0.9 for this work. The ﬁtness function not only tries to minimize prediction errors (for targeted outputs Y ) but also penalizes models with high complexity of dimension d.

5

Results and Discussion

5.1

Experimental Setup

Benchmark Models: In this work we compare BRESN against 12 combinations of 4 other regression models and 3 other dimensionality reduction technique choices. The 4 benchmark regression models include Ridge Regression (Ridge), Linear Support Vector Regression (SVR), ν-Support Vector Regression (νSVR), and Bayesian Regression (Bayesian). The 3 benchmark dimensionality reduction technique choices are no reduction (Identity), Principal Component Analysis (PCA), and kernel Principal Component Analysis (kPCA). Datasets: To evaluate the accuracy of prediction models, we have used 2 synthetic and 2 real-world datasets. The 2 synthetic datasets include Lorenz (Lorenz) and Rossler (Rossler) chaotic time series generated for 4000 time steps, while the 2 real-world datasets include daily closing prices between January 1st, 2000 and December 31st, 2017 of Standard and Poor’s 500 stock data1 (SP500) and the 13-month smoothed monthly total international sunspot number between July 1749 and September 2017 (Sunspots) [13]. In each dataset, the ﬁrst 50% of data is used for training, the next 20% is for cross-validation, and the last 30% is for testing. For time series reconstruction, the time delay and embedding dimension (τ , m) for Lorenz, Rossler, SP500, and Sunspots are (10, 1), (3, 13), (10, 1), and (10, 1), respectively. Accuracy Metric: To evaluate the prediction accuracy of models and for the error function Err in Eq. 14, we use Normalized Root Mean Squared Error with real values Y = y[1], ..., y[Tte ] and (NRMSE). For a testing set T te prediction results P = p[1], ..., p[Tte ] , then NRMSE is deﬁned as follow:

2 Tte 1 i=1 p[i] − y[i] Tte (15) N RM SE(P , Y ) = std(Y ) where std(Y ) is the standard deviation of (Y ). 1

https://ﬁnance.yahoo.com/quote/SPY/history/.

A Novel BRESN for Nonlinear Time Series Forecasting

31

Hyperparameter Optimization: The hyperparameters necessary to be optimized by genetic algorithm is shown in Table 1; each hyperparameter is searched within the interval [min, max] with resolution Δ, except reservoir sparsity/connectivity is ﬁxed at 0.25 to maintain sparse weights of ESN. Table 1. Hyperparameter intervals and resolutions. For general ESN: number of neurons in reservoir (Nr ), state update noise (ξ), input scaling (wi ), teacher/output scaling (wo ), spectral radius (ρ); for Ridge Regression: regularization (λ); for Linear- and Nu-SVR: error term penalty (C), epsilon-insensitive loss function hyperparameter (), nu hyperparameter (ν), kernel coeﬃcient (γ); for Bayesian Ridge: shape and inverse scale parameters for Gamma distribution (λ1 , λ2 , α1 , α2 ); for PCA, kPCA, ICA: dimensionality reduction ratio ( Ndr ) ESN Nr ξ

Ridge SVR, ν-SVR wi wo ρ

λ

m in 100 0.0 0.1 0.1 0.5 0.001 m ax 500 0.1 0.9 0.9 1.4 1.0 Δ

5.2

5 0.01 0.08 0.08 0.09 0.1

C

BayeRidge

ν

γ

λ1

λ2

PCA, kPCA, ICA α1

α2

d Nr

0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 10.0

2.0

1.0

1.0

1.0

1.0

1.0

1.0

1.0

1.0

0.1

0.1

0.1

0.1

0.1

0.1

0.1

0.1

Dimensionality Reduction Technique

There have been experimental results showing the eﬀectiveness of using dimensionality reduction techniques, including PCA and kPCA to train ESN readout layer [10]. Thus, in order to demonstrate the reason for our choice of ICA for dimensionality reduction, we have shown the accuracy comparison across 4 datasets for Bayesian Ridge Regression with diﬀerent dimensionality reduction techniques in Table 2. Table 2. NRMSE of diﬀerent dimensionality reduction techniques for Bayesian Ridge Regression (lowest NRMSE results in bold blue text) Identity Lorenz

PCA −6

6.54 ∗ 10

−5

kPCA −8

4.29 ∗ 10

−7

3.13 ∗ 10

ICA −8

3.62 ∗ 10−8

−5

2.80 ∗ 10−7

3.78 ∗ 10

Rossler

1.92 ∗ 10

4.34 ∗ 10

SP500

3.95 ∗ 10−1 1.65 ∗ 10−1 1.37 ∗ 10−1 4.19 ∗ 10−2

Sunspots 2.59 ∗ 10−2 2.28 ∗ 10−2 2.29 ∗ 10−2 2.19 ∗ 10−2

The eﬀectiveness of applying dimensionality reduction can clearly be seen from Table 2, in which almost all dimensionality reduction techniques perform better than ‘identity’ (no reduction) with only one exception of kPCA for Rossler dataset. Also, even though kPCA provides a kernel extension to PCA, it does

32

H. M. Nguyen et al.

not always perform better when using together with Bayesian Ridge Regression (PCA is better in Rossler, and similar results between the two in Sunspots). Finally, the results in Table 2 clearly shows that ICA outperforms all other dimensionality reduction techniques in all 4 datasets. 5.3

Accuracy Comparison

In order to evaluate the accuracy of our BRESN model, we have compared it to 4 other regression models, including Ridge, SVR, ν-SVR, and Bayesian, and 3 other dimensionality reduction approaches, including Identity (no dimensionality reduction), PCA, and kPCA. As the 2 synthetic datasets (Lorenz, Rossler) are generated from well-deﬁned equations thus having less noise than the 2 real-world datasets (SP500, Sunspots), all the models provide better prediction results for the ﬁrst 2 than the last 2. It is also worth noting that Ridge and Bayesian models consistently perform better than SVR and ν-SVR given the same dimensional reduction techniques. Table 3. NRMSE of diﬀerent models (lowest NRMSE results in bold blue text)

Identity Lorenz 2.72 ∗ 10−5 Rossler 1.35 ∗ 10−4 SP500 4.49 ∗ 10−1 Sunspots 2.22 ∗ 10−2

Identity Lorenz 3.76 ∗ 10−3 Rossler 2.56 ∗ 10−4 SP500 7.00 ∗ 10−1 Sunspots 3.32 ∗ 10−2

Ridge PCA 9.90 ∗ 10−7 1.56 ∗ 10−4 1.03 ∗ 10−1 2.24 ∗ 10−2

ν-SVR PCA 4.87 ∗ 10−3 3.04 ∗ 10−4 5.84 ∗ 10−1 3.74 ∗ 10−2

kPCA 1.02 ∗ 10−6 5.52 ∗ 10−5 6.04 ∗ 10−2 2.44 ∗ 10−2

kPCA 2.16 ∗ 10−3 2.51 ∗ 10−4 6.97 ∗ 10−1 3.37 ∗ 10−2

Identity 2.90 ∗ 10−3 1.68 ∗ 10−3 6.35 ∗ 10−1 3.39 ∗ 10−2

Identity 1.30 ∗ 10−4 2.09 ∗ 10−4 4.81 ∗ 10−1 2.21 ∗ 10−2

SVR PCA 5.50 ∗ 10−4 6.08 ∗ 10−3 4.33 ∗ 10−1 4.17 ∗ 10−2

Bayesian PCA 1.96 ∗ 10−4 8.12 ∗ 10−7 1.40 ∗ 10−1 2.27 ∗ 10−2

kPCA 8.56 ∗ 10−4 1.27 ∗ 10−3 4.91 ∗ 10−2 3.11 ∗ 10−2

kPCA 2.34 ∗ 10−4 2.90 ∗ 10−5 8.67 ∗ 10−2 2.26 ∗ 10−2

BRESN 3.62 ∗ 10−8 2.80 ∗ 10−7 4.19 ∗ 10−2 2.19 ∗ 10−2

From the results of the 12 benchmark models, it is clear that dimensionality reduction (PCA, kPCA) oﬀers improvement in prediction capability of ESN models. When applying either PCA or kPCA, the NRMSE results either stay at similar levels or decrease, even signiﬁcantly decrease compared to Identity in cases like Ridge model for Lorenz dataset, or Bayesian for Rossler dataset. Furthermore, except in the case of Bayesian for Rossler dataset, generally kPCA either oﬀers improvements or at least provides similar accuracy results to that of PCA. The NRMSE results from Table 3 clearly show that our BRESN model with Bayesian Ridge Regression and Independent Component Analysis outperforms all other 12 models in all 4 datasets. By combining both the regularization and probabilistic aspects of Ridge and Bayesian, BRESN demonstrates both its high accuracy and robustness in non-linear time series prediction.

A Novel BRESN for Nonlinear Time Series Forecasting

5.4

33

Running Time

We have also measured the running time of diﬀerent regression models and dimensionality reduction techniques, by varying the number of neurons in the ESN reservoir while keeping all other hyperparameters ﬁxed. For each model, the running time has been obtained by averaging over 20 runs, with number of neurons ranging from 100 to 1000 at step size of 100 (Fig. 3). 6

7

5

6

5

Seconds

Seconds

4 3 2

4 3 2

1

1 0

0 100

200

300

400

500

600

700

800

900

SVR

ν-SVR

Bayesian

100

200

300

400

500

600

700

800

900

1000

Number of neurons in reservoir

Number of neurons in reservoir Ridge

1000

BayeRidge

(a) Diﬀerent regression models with no dimensionality reduction

Identity

PCA

kPCA

ICA

(b) Diﬀerent dimensionality reduction techniques for Bayesian Ridge Regression

Fig. 3. Running time of diﬀerent regression models and dimensionality reduction techniques. BayeRidge denotes Bayesian Ridge Regression (used in BRESN), and all hyperparameters except number of neurons in reservoirs are ﬁxed, including spectral radius ρ at 0.9 and dimensionality reduction ratio Ndr at 0.1.

As can be seen from the ﬁgure, our BRESN model maintains a satisfactory running time. Even without ICA to reduce dimensionality, one run still takes a reasonable amount of 5.31 s, while a ‘full’ BRESN with all components and 1000 neurons in reservoir reduces the number to 3.69 s per run. Also, it is worth noting that in this work we have experimented with a maximum of 500 neurons in reservoir for all ESN models, and we have run the models concurrently for multiple times of training and testing. These factors reduce the average running time even further, thus making BRESN’s training and testing speed satisfactory for real-world use.

6

Conclusion

In this paper, we have proposed a novel Bayesian Ridge Echo State Network (BRESN), which introduces Bayesian Ridge Regression for regression and Independent Component Analysis (ICA) for dimensionality reduction in ESN readout training. We have evaluated and shown that ICA provides higher accuracy improvements than other dimensionality techniques. Also, we have tested BRESN on both synthetic and real-world datasets, compared it with 12 combinations of 4 other regression models and 3 other dimensionality reduction technique choices, and measured its running time. The results show that BRESN signiﬁcantly outperforms other state-of-the-art models in term of accuracy while still having satisfactory running time.

34

H. M. Nguyen et al.

Acknowledgments. This research was supported by the MSIT (Ministry of Science, ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2018-2018-1-00877) supervised by the IITP (Institute for Information & communications Technology Promotion), and International Research & Development Program of the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning of Korea (2016K1A3A7A03952054).

References 1. Bishop, C.M.: Training with noise is equivalent to Tikhonov regularization. Neural Comput. 7(1), 108–116 (1995) 2. Bishop, C.M., Tipping, M.E.: Bayesian regression and classiﬁcation. Nato Science Series sub Series III Computer And Systems Sciences, vol. 190, pp. 267–288 (2003) 3. Han, M., Mu, D.: Multi-reservoir echo state network with sparse Bayesian learning. In: Zhang, L., Lu, B.-L., Kwok, J. (eds.) ISNN 2010. LNCS, vol. 6063, pp. 450–456. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-13278-0 58 4. Huang, G.B., Zhu, Q.Y., Siew, C.K.: Extreme learning machine: theory and applications. Neurocomputing 70(1–3), 489–501 (2006) 5. Jaeger, H.: The “echo state” approach to analysing and training recurrent neural networks-with an erratum note. German National Research Center for Information Technology GMD, Bonn, Germany, Technical report, vol. 148(34), p. 13 (2001) 6. Jaeger, H., Haas, H.: Harnessing nonlinearity: predicting chaotic systems and saving energy in wireless communication. Science 304(5667), 78–80 (2004) 7. Kantz, H., Schreiber, T.: Nonlinear Time Series Analysis, vol. 7. Cambridge University Press, Cambridge (2004) 8. Koskela, T., Lehtokangas, M., Saarinen, J., Kaski, K.: Time series prediction with multilayer perceptron, FIR and Elman neural networks. In: Proceedings of the World Congress on Neural Networks, pp. 491–496. Citeseer (1996) 9. Leung, H., Lo, T., Wang, S.: Prediction of noisy chaotic time series using an optimal radial basis function neural network. IEEE Trans. Neural Netw. 12(5), 1163–1172 (2001) 10. Løkse, S., Bianchi, F.M., Jenssen, R.: Training echo state networks with regularization through dimensionality reduction. Cogn. Comput. 9(3), 364–378 (2017) 11. Reinsel, G.C.: Elements of Multivariate Time Series Analysis. Springer, New York (2003) 12. Shi, Z., Han, M.: Support vector echo-state machine for chaotic time-series prediction. IEEE Trans. Neural Netw. 18(2), 359–372 (2007) 13. SILSO World Data Center: The international sunspot number. International Sunspot Number Monthly Bulletin and online catalogue (1749-2017). http://www. sidc.be/silso/ 14. Takens, F.: Detecting strange attractors in turbulence. In: Rand, D., Young, L.S. (eds.) Dynamical Systems and Turbulence, Warwick 1980. LNM, vol. 898, pp. 366–381. Springer, Heidelberg (1981). https://doi.org/10.1007/BFb0091924

Image Processing

A Model for Detection of Angular Velocity of Image Motion Based on the Temporal Tuning of the Drosophila Huatian Wang1 , Jigen Peng2 , Paul Baxter1 , Chun Zhang3 , Zhihua Wang3 , and Shigang Yue1(B) 1

3

The Computational Intelligence Lab (CIL), School of Computer Science, University of Lincoln, Lincoln LN6 7TS, UK [emailprotected], {pbaxter,syue}@lincoln.ac.uk 2 School of Mathematics and Information Science, Guangzhou University, Guangzhou 510006, China [emailprotected] Institute of Microelectronics, Tsinghua University, Beijing 100084, China {zhangchun,zhihua}@tsinghua.edu.cn

Abstract. We propose a new bio-plausible model based on the visual systems of Drosophila for estimating angular velocity of image motion in insects’ eyes. The model implements both preferred direction motion enhancement and non-preferred direction motion suppression which is discovered in Drosophila’s visual neural circuits recently to give a stronger directional selectivity. In addition, the angular velocity detecting model (AVDM) produces a response largely independent of the spatial frequency in grating experiments which enables insects to estimate the ﬂight speed in cluttered environments. This also coincides with the behaviour experiments of honeybee ﬂying through tunnels with stripes of diﬀerent spatial frequencies.

Keywords: Motion detection Spatial frequency

1

· Insect vision · Angular velocity

Introduction

Insects though with a mini-brain have very complex visual processing systems which is the fundamental of the motion detection. How visual information are processed, especially how insects estimate ﬂight speed have been met with strong interest for a long time. Here we use Drosophila as instance whose visual processing pathways have been researched the most among insects by using both anatomy, two-photon imaging and electron microscope technologies, to explain generally how signals are processed in insects’ visual systems, inspiring us to build up new bio-plausible neural network for estimating angular velocity of image motion. c Springer Nature Switzerland AG 2018 V. K˚ urkov´ a et al. (Eds.): ICANN 2018, LNCS 11140, pp. 37–46, 2018. https://doi.org/10.1007/978-3-030-01421-6_4

38

H. Wang et al.

Drosophila have tens of thousands of ommatidia, each of which has its small lens containing 8 photoreceptors R1-R8 sending their axons into the optic lobe to form a visual column. Optic lobe, as the most important part of the visual system, consists of four retinotopically organized layers, lamina, medulla, lobula and lobula plate. The number of columns in optic lobe is the same with the number of ommatidia [1]. Each column contains roughly one hundred neurons and can process light intensity increments (ON) and decrements (OFF) signals in parallel way simultaneously [2]. In each column, visual signals of light change can be transformed to motion signals by this visual system with ON and OFF pathways [3] (see Fig. 1). Visual signals of light change can be transformed to motion signals by these two pathways in each column [3] (see Fig. 1). lamina L2

medulla OFF

lobula

Tm2

T5

R1-R6

Tm1

Tm9

L3 Mi9 LPi

Tm3

T4

LPi LPi

L1

Mi1

LPi

ON

1 2 3 4

layer

retina

lobula plate

Fig. 1. Visual system of Drosophila with ON and OFF pathways. In each column of the visual system, the motion information are mainly captured by photoreceptors R1-R6, and processed by lamina cells L1-L3, medulla neurons (Mi1, Mi9, Tm1, Tm2, Tm3, Tm9) and T4, T5 neurons. The lobula plate functioning as a map of visual motion which has four layers representing four cardinal directions (front to back, back to front, upward and downward). T4 and T5 cells showing both preferred motion enhancement and non-preferred direction suppression are ﬁrst to give a strong directional selectivity [5]. This ﬁgure referenced Takemura and Arenz’s ﬁgures [3, 4].

How the visual system we describe above detects motions has been researched for a long time. Hassenstein and Richardt proposed an elementary motion detector (EMD) model to describe how animals sense motion [6]. This HR detector uses two neighbouring viewpoints as a pair to form a detecting unit. The delayed signal from one input multiplies the signal from another without delay to get a directional response (Fig. 2a). This ensures the motion of preferred direction have a higher response than non-preferred direction. Another competing model called BL model, proposed by Barlow and Levick implements the non-preferred direction suppression instead [7]. BL detector uses signal from one input without delay to divide the input from another delayed arm located on preferred side to get a directional selective response (Fig. 2b). Both models can be implemented in Drosophila’s visual system since patch-clamp recordings showed a temporal

A Model of Visual Detection of Angular Velocity for Drosophila

39

delay for Mi1 regard to Tm3 in ON pathway and Tm1 with regard to Tm2 in OFF pathway [8]. This also provides the neural fundamental for delay and correlation mechanism.

Fig. 2. Contrast of the Motion detectors. (a) In Hassenstein-Reichardt detector, a delayed signal from left photoreceptor multiplies the signal from right to give a preferred direction enhancement response. (b) In Barlow-Levick detector, a delayed signal from right divides the signal from left to suppress null direction response. (c) A recently proposed Full T4 detector combines both PD enhancement and ND suppression. (d) Proposed angular velocity detecting unit (AVDU) detector combines the enhancement and suppression with a diﬀerent structure.

Recently, a HR/BL hybrid model called Full T4 model has been proposed based on the ﬁnding that both preferred enhancement and non-preferred suppression is functioning in Drosophila’s visual circuits [9]. The motion detector they proposed consists of three input elements. The delayed signal from left arm multiplies the undelayed signal from middle arm, and then the product is divided by the delayed signal from right arm to give the ﬁnal response (Fig. 2c). Circuits connecting T4 or T5 cells that are anatomically qualiﬁed to implement both two mechanisms also give a support to this hybrid model [3]. According to their simulation, this model structure can produce a stronger directional selectivity than HR model and BL model. However, one problem of the models we mentioned above is that they prefer particular temporal frequency and cause the ambiguity that a response could correspond to two diﬀerent speeds. Though they can give a directional response for motion, it’s hard to estimate the motion speed. So these models can only explain part of the motion detection, while some of the descending neurons, according to Ibbotson’s records, shows that the response grows monotonically as the angular velocity increases [10]. What’s more, the

40

H. Wang et al.

response is largely independent with the spatial frequency of the stimulus, which is also coincident with the corridor behaviour experiments of honeybee [11]. In order to solve this problem, Riabinina presents a angular velocity detector mainly based on HR model [12]. The key point of this model is that it uses the summation of the absolute values of excitation caused by diﬀerentiation of signal intensity over time, which is strongly related to the temporal frequency and independent of the angular velocity, as the denominator to eliminate the temporal dependence of the ﬁnal output. Cope argues that this model simulates a circuit that separates to the optomotor circuit which requires more additional neurons and costs more energy. Instead, Cope proposes a more bio-plausible model as an extension to the optomotor circuit which uses the ratio of two HR model with diﬀerent delays [13]. The main idea is that the ratio of two bell shaped response curves with diﬀerent optimal temporal frequencies can make a monotonic response to eliminate the ambiguity. The problem is that the delays is chose by undetermined coeﬃcients method, and need to be ﬁnely tuned which may weaken the robustness of the model. Neural structure under recent researches inspires us building up a new angular velocity detection model. We agree that visual motion detection systems is complex and should have three or more input elements like Full T4 model as the new researches indicate. But the structure of the models with both enhancement and suppression implemented can be very diﬀerent from Full T4 model. Here we give an example AVDU (Fig. 2d) for reference. AVDU (angular velocity detector unit) uses the product of the delayed signal from left arm and undelayed signal from middle arm to divide by the product of the delayed signal from middle arm and undelayed one from right arm. This structure combines the HR and BL model together to give a directional motion response. What’s more, according to our simulation, AVDU is suitable as a fundamental unit for angular velocity detection model that is largely independent to spatial frequency of the grating pattern.

2

Results

Based on proposed AVDU detector, we build up the angular velocity detecting model (AVDM) to estimate visual motion velocity in insects’ eyes. AVDM consists of an ommatidial pattern with 27 horizontal by 36 vertical ommatidia per eye to cover the ﬁeld of view which is 270◦ horizontally by 180◦ vertically. Each 3 adjacent ommatidia in the horizontal direction form a detector for horizontal progressive image motion. And each detector consists of two AVDUs with diﬀerent sampling rates to produce a directional response for preferred progressive motion (i.e. image motion on left eye when ﬂying backward). The ratio of two AVDUs with diﬀerent sampling rates then produce a response largely independent of the spatial frequencies of the sinusoidal grating. The output of all detectors then are summed and averaged to give a response representing the velocity of the visual image motion (see Fig. 3). We simulated the OFF pathway of the Drosophila’s visual neural circuits when the sinusoidal grating moving in preferred direction. The normalized

A Model of Visual Detection of Angular Velocity for Drosophila

41

Fig. 3. Angular velocity detecting model. The model use three neighbouring photoreceptors as a unit and each unit contains two AVDUs with diﬀerent sampling rates. The output is then averaged over the whole visual ﬁeld to give the ﬁnal response.

responses of AVDM over diﬀerent velocities and spatial periods in contrast of experimental results [14] can been seen from Fig. 4. The response curves of AVDM are generally in accordance with the experimental data. Especially when the spatial period is 14◦ , the curve shows a notable lower response than other spatial periods. This might be caused by the suppression of high temporal frequency of T4/T5 cells [4] since the descending neurons are located downstream of optomotor circuit. This can also be explained by Jonathan’s research on spatial frequency tuning of bumblebee Bombus impatiens which indicates that high spatial frequency aﬀects the speed estimation [15]. And this will be discussed in later researches. In order to get a more general results, the spatial period of the grating and the angular velocity of the image motion are chosen widely (Fig. 5). All response curves under diﬀerent periods show nearly monotonic increasing potential. And the responses weakly depend on the spatial period of the grating. This coincides with the responses of the descending neurons according to Ibbotson’s records [10,14]. And this is important for insects estimating ﬂight speed or gauging distance of foraging journey in a clutter environment. Though the results of Riabinina’s model use diﬀerent velocity and spatial frequency metric and Cope’s model use spikes as the ﬁnal output, the trend of the curves can show the performances of the models. So we give their results here as reference (Fig. 6). In general, AVDM performs better than Riabinina’s model whose response curves of 4 diﬀerent spatial frequencies are separate from each other [12]. Cope’s model is more bio-plausible than Riabinina’s model which is based on optomotor circuit. But it only performs well when the speed is around 100 deg/s, and the semilog coordinate outstands that part, while honeybee mainly maintains a constant angular velocity of 200–300 deg/s in open ﬂight [16]. Another problem of Cope’s model is that the response of grating with

42

H. Wang et al.

b

a

Fig. 4. Contrast of AVDM and experimental records under diﬀerent angular velocities. (a) The responses of AVDM over diﬀerent spatial periods. (b) The responses of one type of descending neuron (DNIII4 ) over diﬀerent spatial periods based on Ibbotson’s records [14].

very high frequency should be lower rather than maintain spatial independence according to Ibbotson’s records on descending neuron [10,14]. Our model AVDM uses a bandpass temporal frequency ﬁlter simulated by experimental data [4] to deal with this problem. As you can see, AVDM produces a lower response when the spatial period is 14◦ and shows response largely independence on spatial period ranging from 36◦ to 72◦ (Fig. 5).

1

Responses by Spatial Frequencies 36 45 54 63 72

Normalized Average Response

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 102

103

Angular Velocity, deg/s

Fig. 5. Responses of AVDM over diﬀerent spatial periods under diﬀerent angular velocities.

A Model of Visual Detection of Angular Velocity for Drosophila

a

43

average response

Response, arbitrary units

b

10 Angular speed (rad/s)

100 1000 Angular speed (deg/s)

Fig. 6. Contrast of responses of two other models. (a) Riabinina’s model uses rad/s as the velocity metric and the spatial frequencies is 10 m−1 (solid), 20 m−1 (dashed), 30 m−1 (dotted) and 40 m−1 (dot-dashed) [12]. (b) Cope’s model uses spikes to represent the model response and uses method of undetermined coeﬃcients to decide the two delays of the correlation system [13].

3

Methods

c The MathWorks, Inc.), And the All simulations were carried out in Matlab ( layout of the AVDM neural layers is given below. 3.1

Input Signals Simulation

The input signal is simulated using two dimensional images frames with sinusoidal grating moving across the vision. AVDU1 processes all input images while AVDU2 only samples half the total images. The spatial period λ (deg) of the grating and the moving speed V (deg/s) are treated as variables. This naturally induces a temporal frequency of V/λ (Hz) and an angular frequency ω = 2πV /λ. Considering the sinusoidal grating moving in visual ﬁeld of the detecting unit with three receptors A, B and C, let I0 be the mean light intensity, then the signal in receptor A can be expressed as I0 + m · sin(ωt). Let Δφ denotes the angular separation between the neighbouring receptors, then the signal of receptor B is I0 + m · sin(ω(t − Δφ/V )), and the signal of receptor C is I0 + m · sin(ω(t − 2Δφ/V )). So the input signal of one eye can be expressed as: Ix,y (t) = I0 + m · sin(ω(t − yΔφ/V )),

(1)

where (x, y) denotes the location of the ommatidium. 3.2

AVDM Neural Layers

(1) Photoreceptor. The ﬁrst layer of the AVDM neural network receiving the input signals of light intensity change to get the primary information of visual motion: (2) Px,y (t) = Ix,y (t) − Ix,y (t − 1).

44

H. Wang et al.

(2) ON & OFF Pathways. The luminance changes are separated to two pathways according to the neural structures of the Drosophila visual systems, with ON representing light increments and OFF representing light decrements: ON (t) = (Px,y (t) + |Px,y (t)|)/2, Px,y OF F (t) = |(Px,y (t) − |P x, y(t)|)|/2. Px,y

(3)

(3) Delay and Correlation. The signals are delayed and correlated following the structure of AVDU. Here we take one AVDU as example, let S1 , S2 , S3 donate the input signal of photoreceptor A (left), B (middle), C (right), and S1D , S2D donate the temporal delayed signal of A and B, then we have the following expression: S1D (t) = m · [sin(ω(t + ΔT )) − sin(ω(t − 1 + ΔT ))] ≈ M · cos[ω(t + ΔT )], (4) similarly we can get S2 ≈ M ·cos[ω(t−Δφ/V )], S2D ≈ M ·cos[ω(t−Δφ/V +ΔT )] and S3 ≈ M · cos[ω(t − 2Δφ/V )], where ΔT is the temporal delay of the model. According to the structure of AVDU, the response of the detector can be expressed as (S1D · S2 )/(S2D · S3 ), where the bar means the response is averaged over a time period to remove ﬂuctuation caused by oscillatory input. What’s more, we set a lower bound of 0.01 on denominator to avoid the output being too high. This also can be explained by the tonic ﬁring rate of neurons. (4) Ratio and Average. If we set temporal delay as 6ms, and take two sampling rates as 1ms per frame and 2ms per frame, then we can get the responses of AVDU under diﬀerent angular velocities, spatial periods and sampling rates. According to our simulation, though the response curves of diﬀerent sampling rates have diﬀerent values, the shapes are very similar. That means that using the ratio of the responses under diﬀerent sampling rates can largely get rid of the inﬂuence of spatial frequency. The output of detectors each composed of three neighboring photoreceptors are then summed up and averaged over the whole visual ﬁeld. (5) Band-Pass Temporal Frequency Filter. We use the records of temporal tuning of the Drosophila to simulate the band-pass temporal frequency ﬁlter here [4]. According to Arenz’s experiments, the tuning optimum of the temporal frequency will shift from 1 Hz to 5 Hz with application of the octopamine agonist CDM (simulating the Drosophila shifts from still to ﬂying). So we set the temporal frequency ﬁlter as a bell-shaped response curve which achieves its optimum at 5 Hz under semilog coordinate. In fact HR completed model can naturally be a temporal frequency ﬁlter with little modiﬁcation since it has a particular temporal frequency preferred bell-shaped curve.

A Model of Visual Detection of Angular Velocity for Drosophila

4

45

Discussion

We proposed a bio-plausible model, the angular velocity detecting model (AVDM), for estimating the image motion velocity using the latest neural circuits discoveries of the Drosophila visual systems. We presented a new structure AVDU as a part of the model to implement both preferred direction motion enhancement and non-preferred direction motion suppression, which is found in Drosophila’s neural circuits to make a stronger directional selectivity. And we use the ratio of two AVDUs with diﬀerent sampling rates to give spatial frequency independent responses for estimating the angular velocity. In addition this can be used as the fundamental part of the visual odometer by integrating the output the AVDM. This also provides a possible explanation about how visual motion detection circuits connecting the descending neurons in the ventral nerve cord. Using the ratio of two AVDUs with diﬀerent sampling rates is twofold. One of the reason is that it can be realized in neural circuits naturally since one AVDU only needs to process part of the visual information while the structure and even the delay of two AVDUs are the same. It’s easier than using the ratio of two HR-detectors with diﬀerent delays as Cope’s model did [13], because signals are passed with two diﬀerent delays means there should have two neurotransmitters in one circuit or there are two circuits. Another reason is that the response of individual AVDU is largely dependent on the spatial frequency of the grating, and the ratio of diﬀerent sampling rates, according to our simulation, can get rid of the inﬂuence of the spatial frequency. Here we only simulate ON pathway of the visual systems with T4 cells. OFF pathway dealing with brightness decrements is similar. Further, models for forward, upward and downward motion detector can be constructed using the same structure since they can be parallel processed. Acknowledgments. This research is supported by EU FP7-IRSES Project LIVCODE (295151), HAZCEPT (318907) and HORIZON project STEP2DYNA (691154).

References 1. Fischbach, K.F.: Dittrich APM: the optic lobe of drosopholia melanogaster. I. A Golgi analysis of wild-type structure. Cell Tissue Res. 258(3), 441–475 (1989). https://doi.org/10.1007/BF00218858 2. Joesch, M., Weber, F., Raghu, S.V., Reiﬀ, D.F., Borst, A.: ON and OFF pathways in Drosophila motion vision. Nature 17(1), 300–304 (2011). https://doi.org/10. 1038/nature09545 3. Takemura, S.Y., Nern, A., Chklovskii, D.B., Scheﬀer, L.K., Rubin, G.M., Meinertzhagen, I.A.: The comprehensive connectome of a neural substrate for ‘ON’ motion detection in Drosophila. eLife 6, e24394 (2017). https://doi.org/10.7554/ eLife.24394 4. Arenz, A., Drews, M.S., Richter, F.G., Ammer, G., Borst, A.: The temporal tuning of the Drosophila motion detectors is determined by the dynamics of their input elements. Curr. Biol. 27, 929–944 (2017). https://doi.org/10.1016/j.cub.2017.01.051

46

H. Wang et al.

5. Haag, J., Mishra, A., Borst, A.: A common directional tuning mechanism of Drosophila motion-sensing neurons in the ON and in the OFF pathway. eLife 6, e29044 (2017). https://doi.org/10.7554/eLife.29044 6. Hassenstein, B., Reichardt, W.: Systemtheoretische analyse der zeit-, reihenfolgenund vorzeichenauswertung bei der bewegungsperzeption des r¨ usselk¨ afers chlorophanus. Zeitschrift F¨ ur Naturforschung B 11(9–10), 513–524 (1956). https://doi. org/10.1515/znb-1956-9-1004 7. Barlow, H.B., Levick, W.R.: The mechanism of directionally selective units in rabbit’s retina. J. Physiol. 178, 477–504 (1965). https://doi.org/10.1113/jphysiol. 1965.sp007638 8. Behnia, R., Clark, D.A., Carter, A.G., Clandinin, T.R., Desplan, C.: Processing properties of ON and OFF pathways for Drosophila motion detection. Nature 512, 427–430 (2014). https://doi.org/10.1038/nature13427 9. Haag, J., Arenz, A., Serbe, E., Gabbiani, F., Borst, A.: Complementary mechanisms create direction selectivity in the ﬂy. eLife 5, e17421 (2016). https://doi.org/ 10.7554/eLife.17421 10. Ibbotson, M.R.: Evidence for velocity-tuned motion-sensitive descending neurons in the honeybee. Proc. Biol. Sci. 268(1482), 2195 (2001). https://doi.org/10.1098/ rspb.2001.1770 11. Srinivasan, M.V., Lehrer, M., Kirchner, W.H., Zhang, S.W.: Range perception through apparent image speed in freely ﬂying honeybees. Vis. Neurosci. 6(5), 519– 535 (1991). https://www.ncbi.nlm.nih.gov/pubmed/2069903 12. Riabinina, O., Philippides, A.O.: A model of visual detection of angular speed for bees. J. Theor. Biol. 257(1), 61–72 (2009). https://doi.org/10.1016/j.jtbi.2008. 11.002 13. Cope, A., Sabo, C., Gurney, K.N., Vasislaki, E., Marshall, J.A.R.: A model for an angular velocity-tuned motion detector accounting for deviations in the corridorcentering response of the Bee. PLoS Comput Biol. 12(5), e1004887 (2016). https:// doi.org/10.1371/journal.pcbi.1004887 14. Ibbotson, M.R., Hung, Y.S., Meﬃn, H., Boeddeker, N., Srinivasan, M.V.: Neural basis of forward ﬂight control and landing in honeybees. Sci. Rep. 7(1), 14591 (2017). https://doi.org/10.1038/s41598-017-14954-0 15. Dyhr, J.P., Higgins, C.M.: The spatial frequency tuning of optic-ﬂow-dependent behaviors in the bumblebee Bombus impatiens. J. Exp. Biol. 213(Pt 10), 1643–50 (2010). https://doi.org/10.1242/jeb.041426 16. Baird, E., Srinivasan, M.V., Zhang, S., Cowling, A.: Visual control of ﬂight speed in honeybees. J. Exp. Biol. 208(20), 3895–905 (2005). https://doi.org/10.1242/jeb. 01818

Local Decimal Pattern for Pollen Image Recognition Liping Han ✉ (

)

and Yonghua Xie

School of Computer and Software, Nanjing University of Information Science and Technology, Nanjing, China [emailprotected]

Abstract. In this paper, we propose local decimal pattern (LDP) for pollen image recognition. Considering that the gradient image of pollen grains has more prom‐ inent textural features, we quantify by comparing the gradient magnitude of pixel blocks rather than the single pixel value. Unlike the local binary pattern (LBP) and its variants, we encoding by counting the pixel blocks on diﬀerent quantiza‐ tion intervals, which makes our descriptor robust to the rotation of pollen images. In order to capture the subtle textural feature of pollen images, we increase the number of quantization intervals. The average correct recognition rate of LDP on Pollenmonitor dataset is 90.95%, which is much higher than that of other compared pollen recognition methods. The experimental results show that our method is more suitable for the practical classiﬁcation and identiﬁcation of pollen images than compared methods. Keywords: Local decimal pattern · Pollen recognition · Textural feature Gradient magnitude

1

Introduction

The classiﬁcation of pollen particles has been widely applied for allergic pollen index forecast, drug research, paleoclimatic reconstruction, criminal investigation, oil explo‐ ration and some other ﬁelds [1]. The traditional identiﬁcation of pollen grains is mainly done by artiﬁcial inspection under microscopy, which requires the operator to have a rich knowledge of pollen morphology and needs a high level of training to get accurate recognition results. The commonly used discriminate criteria is the visual biological pollen grain morphological appearance, such as shape, polarity, aperture, size, exine stratiﬁcation and thickness, and so on [2]. It takes operator much of time and eﬀort to observe the appearance of pollen grains, and often causes misrecognition. With the development of image processing and pattern recognition [3–5], using computer to extract and classify pollen features has become an eﬀective way for pollen recognition. The early pollen recognition algorithms focused on extracting shape features, in which the contour shape is a prominent feature for some pollen grains with slender oval shape or rounded triangular shape. However, most pollen grains always have similar contour shapes, so it is diﬃcult to identify diﬀerent categories of pollen images only by shape features. Considering that pollen images from diﬀerent categories have large diﬀerences in texture, more and more texture based feature extraction © Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11140, pp. 47–55, 2018. https://doi.org/10.1007/978-3-030-01421-6_5

48

L. Han and Y. Xie

methods have been proposed for automatic classiﬁcation of pollen images. For example, Punyasena et al. [6] extracted the texture and shape features of pollen images using dictionary learning and sparse coding (DLSC), which obtained a recognition rate of 86.13%, however, the recognition performance largely depends on the selection and quantity of sample blocks. Daood et al. [7] decomposed the pollen image into multiple feature layers using clustering, then the texture and geometric features (TGF) of each layer were extracted using LBP and fractal dimension respectively. Finally, the SVM classiﬁer was used to classify pollen images and a recognition rate of 86.94% was obtained. Whereas, the method has little robustness to the rotation of pollen grains, and the decomposition of pollen images increases the dimension of features. Boochs et al. [8] proposed a pollen recognition method combining shape, texture and aperture features (STAF), which extracted 18 shape features, 5 texture features (Gabor Filters, Fast Fourier Transform, Local Binary Pattern, Histogram of Oriented Gradients, and Haralick features) and a surface aperture features of pollen images. The method used a random forest classiﬁer to identify pollen images, and obtained nearly 87% recognition rate. Guru et al. [9] proposed a pollen classiﬁcation model based on surface texture, which combined local binary pattern (LBP), Gabor wavelet, gray-level diﬀerence matrix (GLDM), and gray-level co-occurrence matrix (GLCM) for pollen recognition (LGGG), and obtained 91.66% recognition rate. However, the computation cost of these two methods is large due to high dimension of the combined features, which makes them unpractical for real application. Marcos et al. [10] extracted texture features using LogGabor ﬁlter (LGF), discrete Tchebichef moments (DTM), local binary patterns (LBP) and gray-level co-occurrence matrix (GLCM), which obtained a recognition rate of 94.83%, whereas, the fused texture feature (LDLG) contains large amounts of redundant information and the computational process is complex. Local binary pattern is an eﬀective method for representing texture feature, which has been widely used in face recognition and texture classiﬁcation [11]. The traditional local binary pattern and its variants usually use wide quantization intervals to quantize the neighboring pixels, which enhances the descriptor’s robustness to the illumination changes of images, but also loses some detailed textural information at the same time. Unlike the general texture images, the textural variation range of pollen images is rela‐ tively small, so it’s diﬃcult to capture the subtle textural diﬀerences of pollen images from diﬀerent categories in wide quantization intervals. In order to solve the problem, the local decimal pattern (LDP) was proposed. The advantages of our method are as follows: Quantizing using the gradient magnitude of pixel blocks instead of single pixel value to eliminate the eﬀects of image noise. Encoding by referring to the number of pixel blocks in each quantization interval making the descriptor invariant to the rotation of pollen grains. The combination of LDP features in multiple directions increases the descriptor’s discrimination. Experimental results on Pollenmonitor dataset show that the recognition rate and computation speed of our method is higher than that of most pollen recognition methods.

Local Decimal Pattern for Pollen Image Recognition

2

49

Local Decimal Pattern (LDP)

Most of the current methods for extracting pollen features are those combining diﬀerent single features: LBP and fractal dimension as in [7], and LBP, GLCM, LGF and DTM as in [10], and LBP, Gabor, etc. as in [8, 9]. All of these take advantages of diﬀerent features to construct the optimal representation of pollen images, but the use of multiple features leads to a higher computational costs. In order to build pollen feature descriptor with high computational eﬃciency, and high robustness to rotation and noise, we proposed Local Decimal Pattern (LDP). Figure 1 shows the implementation of LDP feature for representing pollen images, and Fig. 2 presents the step of the algorithm based on LDP for pollen recognition. The speciﬁc calculation process of LDP is as follows:

Fig. 1. Implementation of LDP feature for representing pollen.

Fig. 2. The step of the algorithm based on LDP for pollen recognition.

50

L. Han and Y. Xie

Fig. 3. Calculation of gradient histogram of an image block. The lengths and directions of arrows represent the gradient magnitude and gradient direction of pixels respectively.

First, we calculate the image gradient, the gradient information of each pixel includes gradient magnitude and gradient angle. The gradient angle range from −𝜋 to 𝜋, and we divide [−𝜋, 𝜋) into 8 equal-sized direction intervals. Then, a histogram of gradient is calculated by weighting all pixels’ gradient magnitude into corresponding gradient directions, and the directions with maximum, minimum and median gradient are marked as D1, D2 and D3 respectively (as shown in Fig. 3). The gradient magnitude of pixel blocks under diﬀerent gradient directions is calculated as follows: m ∑ ( ( )) = mPK × ED 𝜃Pk 2

BDm

(1)

k=1

( ) ED 𝜃Pk =

{

1 𝜃Pk ∈ D 0 𝜃Pk ∉ D

(2)

Where: D is the gradient direction; m is the pixel block size; mPK and 𝜃PK are the gradient magnitude and gradient angle of the pixel PK . Second, the number of pixel blocks in i th quantization interval under gradient direc‐ tion D is counted as follows: NiD =

n ( ) ∑ Si BDr,n,m,j − BDm,c

(3)

j=1

{ Si (x) =

1 |x| ∈ Qi 0 |x| ∉ Qi

) [ Qi = li , li+1

(4) (5)

Where: n is the number of neighboring pixel blocks; BDr,n,m,j is the gradient magnitude of the m × m pixel block in the square neighborhood with sampling radius r; j is the serial number of pixel blocks; BDm,c is the gradient magnitude of the central pixel block under gradient direction D; Qi is the i th quantization interval; li is the threshold of Qi.

Local Decimal Pattern for Pollen Image Recognition

51

After counting the number of neighboring pixel blocks located at diﬀerent quanti‐ zation intervals, we can deﬁne the Local Decimal Pattern (LDP) as follows: LDPD =

L ∑ i=1

NiD × 10i−1

(6)

Where L is the total number of quantization intervals. At last, we calculate the LDP feature histograms under three gradient directions, and the ﬁnal representation of pollen images is the concatenation of these LDP histograms: { } LDPH = LDPH D1 , LDPH D2 , LDPH D3

(7)

Figure 4 shows the calculation process of LDP of an image block under direction D1, the color of the square in ﬁgure represents the gradient magnitude diﬀerence between the neighboring pixel blocks and the central pixel block under the gradient direction D1, and the same color indicates that the diﬀerence of gradient magnitude belongs to the same quantization interval. In Fig. 4, the gradient magnitude of pixel blocks under gradient direction D1 are quantized into 4 intervals, and the number of pixel blocks under 4 quantization intervals is counted as 4, 2, 1 and 1, respectively. So we can get a local decimal pattern 1124.

Fig. 4. Calculation of LDP of an image block under direction D1.

3

Pollen Recognition Experiments

To evaluate our method, we performed experiments on pollenmonitor dataset with a computer of Intel(R) Core(TM) i5-3210 M @ 2.50 GHz processor and 6 GB memory, and the software we used is MATLAB R2014a. We randomly selected 60% of the pollen

52

L. Han and Y. Xie

images of each category on pollenmonitor dataset as training images and the rest were used as test images. A SVM classiﬁer [12, 13] was used for the classiﬁcation and recog‐ nition of pollen images, and the correct recognition rate (CRR), recall rate (RR), F1measure and recognition time (RT) were used to measure the experimental performance, where, F1-measure is the harmonic average of CRR and RR. 3.1 Parameter Selection (1) Neighbor number, sampling radius and block size: We use a sampling strategy with ﬁxed number of neighboring pixel blocks (n = 8), and diﬀerent block size and sampling radius (m = {2, 3, 4, 5, 6, 7, 8, 9, 10}, r = {2, 3, 4, 5, 6, 7, 8, 9, 10}). (2) Quantization interval number: We performed experiments with diﬀerent number of quantization intervals and ﬁnd that 2 quantization intervals is not enough to represent pollen texture feature, but too many (more than 4) leads to a higher dimension of LDP histogram. When the number of quantization intervals is 3, 4, the corresponding dimensions of LDP histogram are 8 × 102 and 8 × 103, respectively. In fact, many decimal patterns do not exist, resulting in large columns of LDP histogram are empty. That’s because the total number of quan‐ tized pixel blocks in the neighborhood is ﬁxed (n = 8). Take 3 quantization intervals for instance, if the number of pixel blocks located at ﬁrst quantization interval is 7, the decimal pattern only can be 107 or 017, and other patterns such as 117, 127, etc. can never appear. So, we delete the nonexistent decimal patterns from the LDP histogram, and the dimension of LDP histogram is 45, 165 when the quantization interval is 3, 4, respectively. (3) Quantization thresholds: The quantization thresholds with L = 3, 4, are presented in Table 1, which depends on pixel block size (m). Table 1. The quantization thresholds of diﬀerent quantization levels l1 l2 L=3

0 2m

L=4

0 m+1

l3

l4 3

–

m

2

(m + 1)

2(m + 1)2

3.2 Experimental Results on Pollenmonitor Dataset The Pollenmonitor dataset comprises air pollen samples from 33 diﬀerent taxa collected in Freiburg and Zurich in 2006. The number of pollen images in this dataset is about 22700. Aﬀected by the micro-sensors and irregular collection methods, some pollen images have some degrees of deformation and contamination, and the image quality is generally not high.

Local Decimal Pattern for Pollen Image Recognition

53

By varying the pixel block size and sampling radius from 2 to 10, we get the correct recognition rates as presented in Fig. 5. Obviously, 4 quantization intervals (L = 4) performs better, and the best recognition rate was obtained with the block size 5. 95 L=3 L=4

Recognition Accuracy (%)

90 85 80 75 70 65 60 55 50

2

3

4

5

6

7

8

9

10

Size of Blocks and Radius

Fig. 5. Recognition results (%) on Pollenmonitor dataset with diﬀerent block size and quantization intervals.

Figure 6 presents the partial recognition instances of 6 representative pollen cate‐ gories on Pollenmonitor dataset. It can be seen that most pollen images with clear texture and have not been contaminated and deformed can be correctly identiﬁed. The speciﬁc recognition results are shown in Table 2, we can ﬁnd that the correct recognition rates of most pollen categories are more than 90%, and the recall rates of all categories are

Fig. 6. Recognition instances of 6 classic pollen taxa from the Pollenmonitor dataset.

54

L. Han and Y. Xie

more than 73%. For Corylus category with varying degrees of rotation, our method achieved 94.02% correct recognition rate. For fa*gus category with severe noise, our method can also obtained 83.18% correct recognition rate. Table 2. Recognition results of 6 classic pollen taxa in Pollenmonitor dataset Pollen category Poaceae Corylus Rumex Carpinus fa*gus Alnus

CRR/% 90.33 94.02 92.15 88.62 83.18 92.50

RR/% 79.10 85.27 73.64 78.13 76.65 88.74

F1-measure 84.34 89.43 81.86 83.05 79.78 90.58

RT/s 6.5 6.4 6.4 6.5 6.3 6.9

3.3 Experimental Comparison and Analysis We compared the best recognition rates achieved by our method using diﬀerent block size with state-of-the-art pollen recognition methods, the experimental results on Pollen‐ monitor datasets are listed in Table 3. The average correct recognition rate of our method on Pollenmonitor datasets is 90.95%, which is on average 6.81 percentage points higher than that of compared pollen recognition methods. The experimental results show that our proposed method has a better recognition performance and the computational eﬃ‐ ciency is higher than most of the compared methods. Table 3. Comparison of the average recognition results of our method and 5 pollen recognition methods on Pollenmonitor dataset Method DLSC TGF STAF LGGG LDLG LDP

4

CRR/% 74.83 85.50 83.29 87.21 89.87 90.95

RR/% 82.97 69.62 80.53 70.15 75.46 78.25

F1-measure 78.69 76.75 81.89 77.76 82.04 84.12

ART/s 4.1 7.2 23.9 19.2 20.9 6.8

Conclusions

In this paper, we presented a LDP descriptor for pollen image recognition. Unlike most pollen recognition methods fusing diﬀerent kinds of features in recent years, our method extracts single texture feature in three directions, which decreases the dimensionality of pollen features and increases the discrimination at the same time. Experimental results show that our method outperforms 5 compared pollen recognition methods in extracting pollen texture feature, and has robustness to the noise and rotation of pollen images.

Local Decimal Pattern for Pollen Image Recognition

55

Acknowledgments. This work was partially supported by the grant of the National Natural Science Foundation of China 61375030.

References 1. Treloar, W.J., Taylor, G.E., Flenley, J.R.: Towards automation of palynology 1: analysis of pollen shape and ornamentation using simple geometric measures, derived from scanning electron microscope images. J. Quat. Sci. 19, 745–754 (2004) 2. Tian, H., Cui W., Wan, T., Chen, M.: A computational approach for recognition of electronic microscope plant pollen images. In: Congress on Image and Signal Processing, CISP 2008, pp. 259–263 (2008) 3. Chen, B., Yang, J., Jeon, B., Zhang, X.: Kernel quaternion principal component analysis and its application in RGB-D object recognition. Neurocomputing 266, 293–303 (2017) 4. Zhou, Z., Wang, Y., Wu, Q.M.J., Yang, C.N., Sun, X.: Eﬀective and eﬃcient global context veriﬁcation for image copy detection. IEEE Trans. Inf. Forensics Secur. 12, 48–63 (2017) 5. Yuan, C., Sun, X., Lv, R.: Fingerprint liveness detection based on multi-scale LPQ and PCA. China Commun. 13, 60–65 (2016) 6. Kong, S., Punyasena, S., Fowlkes, C.: Spatially aware dictionary learning and coding for fossil pollen identiﬁcation. In: Computer Vision and Pattern Recognition Workshops, pp. 1305–1314 (2016) 7. Daood, A., Ribeiro, E., Bush, M.: Pollen recognition using a multi-layer hierarchical classiﬁer. In: International Conference on Pattern Recognition, pp. 3091–3096 (2017) 8. Boochs, F., Chudyk, C.: Development of an automatic pollen classiﬁcation system using shape, texture and aperture features. In: LWA 2015 Workshops: KDML, FGWM, IR, and FGDB (2015) 9. Guru, D.S., Siddesha, S.: Texture in Classiﬁcation of Pollen Grain Images. Springer India, (2013) 10. Marcos, J.V., et al.: Automated pollen identiﬁcation using microscopic imaging and texture analysis. Micron. 68, 36–46 (2015) 11. Wolf, L., Hassner, T., Taigman, Y.: Eﬀective unconstrained face recognition by combining multiple descriptors and learned background statistics. IEEE Trans. Pattern Anal. Mach. Intell. 33, 1978–1990 (2011) 12. Gu, B., Sheng, V.S., Sheng, S.: A robust regularization path algorithm for v-support vector classiﬁcation. IEEE Trans. Neural Netw. Learn. Syst. 28, 1241 (2016) 13. Gu, B., Sheng, V.S., Tay, K.Y., Romano, W., Li, S.: Incremental support vector learning for ordinal regression. IEEE Trans. Neural Netw. Learn. Syst. 26, 1403–1416 (2017)

New Architecture of Correlated Weights Neural Network for Global Image Transformations Sławomir Golak, Anna Jama(&), Marcin Blachnik, and Tadeusz Wieczorek Department of Industrial Informatics, Silesian University of Technology, Krasinskiego 8, 40-019 Katowice, Poland [emailprotected]

Abstract. The paper describes a new extension of the convolutional neural network concept. The developed network, similarly to the CNN, instead of using independent weights for each neuron in the network uses related weights. This results in a small number of parameters optimized in the learning process, and high resistance to overtraining. However unlike the CNN, instead of sharing weights, the network takes advantage of weights correlated with coordinates of a neuron and its inputs, calculated by a dedicated subnet. This solution allows the neural layer of the network to perform global transformation of patterns what was unachievable for convolutional layers. The new network concept has been conﬁrmed by veriﬁcation of its ability to perform typical image afﬁne transformations such as translation, scaling and rotation. Keywords: Network architecture

Spatial transformation CNN

1 Introduction Recent approaches to object recognition make essential use of machine learning methods [1, 2]. To increase their performance, we can collect larger datasets, learn more powerful models, and use better techniques for preventing overﬁtting [3]. To create network capable to learn to recognize thousands of objects from millions of images, we need to build a model with a large learning capacity. Convolutional neural networks (CNNs) constitute one such class of models. They are powerful visual models which recently enjoyed a great success in large-scale image and video recognition [4–6], what has become possible thanks to the large public image repositories, such as ImageNet, and high performance computing systems, such as GPUs or large-scale distributed clusters [7]. CNN combine three architectural ideas to ensure some degree of shift, scale, and distortion invariance: local receptive ﬁelds, shared weights (or weight replication), and spatial or temporal subsampling [8, 9]. The main beneﬁt of using CNNs is the reduced amount of parameters that have to be determined during a learning process. CNN can be regarded as a variant of the standard neural network which instead of using fully connected hidden layers, introduces a special network structure, which consists of alternating so-called convolution and pooling layers [10]. © Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11140, pp. 56–65, 2018. https://doi.org/10.1007/978-3-030-01421-6_6

New Architecture of Correlated Weights Neural Network

57

One of the limitations of CNN is the problem of global pattern transformations such as e.g. rotation or scaling of an image. Convolutional neurons which have spatially limited input ﬁeld are unable to identify such transformations. They are only capable to local pattern transformations such as detection of local features on the image. The size increase of spatial transformation area would require the enlargement of the neurons input ﬁeld, therefore the increase of weight vector size and result in the loss of the primary advantage of the CNN network, which is a small number of parameters. In order to overcome the constraints of CNN in the paper we introduce a new network which can replace the convolutional layer or it can be used as an independent network able to learn any global and/or local transformations. The proposed network consists of two networks. The main network is a single fully connected layer of neurons, aimed at direct performing image transformation. However, weights of this network are not determined directly, instead to limit the number of trainable parameters, the weights of the neurons are obtained by the sub-network, which is relatively small network. The second network takes advantage of the observation that for global transformations particular weights of individual neurons are strongly correlated with one another, hence the entire network is called the Correlated Weights Neural Network (CWNN). Proposed network is a continuation of the earlier idea of using weights calculated by the subnet in a Radial Basis Network (RBF) for pattern classiﬁcation [11] called the Induced Weights Artiﬁcial Neural Network (IWANN). However, in the IWANN the values of weights were determined only based on the input coordinates, consequently the coordinates of the neurons were ignored. This approach was related to different structure and application of the IWANN network. In the paper we describe and explain the structure of the CWNN network (Sects. 2 and 3), then Sect. 4 describes the training algorithm, and in Sect. 5 we demonstrate its application to train global transformations such as scaling, translation and rotation. We end with a summary of the obtained results and draw further research perspectives.

2 Problem Deﬁnition Neural network presented in this paper is dedicated to global pattern transformations. Figure 1 shows linearized representation of the main network’s neurons which are responsible for the input-output transformation, and assume gray scale images. This single layered network consists of linear neurons, witch without the bias, are sufﬁcient to implement the network. Each neuron determine the output value (the gray level of the pixel for output image) as a linear combination of all pixels in the input image. In case of a N size pattern, this operation can be perform by a network which has N output neurons associated to each of the N inputs. Total number of weights for this network is N2, which for relatively small (32 32 pixels) images presented in the further part of the article leads to the number of 1 million parameters describing the network. The number of parameters is directly reflected in the complexity of the network learning process, and in the size requirements for training dataset.

58

S. Golak et al.

I1

I2

I3

I4

IN

O1

O2

O3

O4

ON

Fig. 1. Neural network for a global transformation of a pattern.

However, in vast majority of global transformation, network weights are correlated with positions of neurons and their inputs. Table 1 presents the values of weights for the network from Fig. 1, which performs translation of one-dimensional pattern, by 2 elements to the right (for simplicity we consider N = 5). The columns in the table correspond to the inputs, and rows to the outputs, so a single row represents weights of a single linear neuron. The ﬁnal value of each output is determined as the weighted sum of inputs described by weights in particular row. Table 1. Weights of the neural network translating a linear pattern by two elements. O1 O2 O3 O4 O5

I1 0 0 1 0 0

I2 0 0 0 1 0

I3 0 0 0 0 1

I4 0 0 0 0 0

5 0 0 0 0 0

The table content shows a clear regularity in the set of weights. The value of the weight which connects the i-th output’s neuron with j-th input, in this network, can be described by simple equation: wi;j ¼

1 0

jt ¼i j t 6¼ i

ð1Þ

Where: t - size of translation, i-output position, j-input position This formula replaces 25 network weights, and it’s degree of complexity is independent of the size of transformed pattern. The essential idea of the CWNN network can be explained by using this example. It involve the fact that the weights of the network are not stored as static values, but calculated based on the mutual position of neurons and their inputs. The practical use of the relation between neuron weights, requires a subsystem which is able to learn the dependences between neurons and its weights. To address this issue, we utilize additional subnetwork presented in the following section.

New Architecture of Correlated Weights Neural Network

59

3 Network Model Figure 2 shows the structure of a single-layer network with the Correlated Weights Neural Layer (CWNL). The network has one input layer, which topology results from the size and dimensionality of processed input pattern. Network inputs are described by the coordinates vector, which size is compatible with dimensionality of the input data. The topology of the main, active CWNL layer, which have signals transmitted directly on the network output, is compatible with dimensionality and the size of the output pattern. It should be taken into consideration that both dimensionality and the size of the input and the output can be completely different.

Subnetwork

P(O){1,1}

P(I)={1,3}

ω (P(I),P(O))

y(O) . . .

y(I)

Main network

Fig. 2. Structure of the neural network with correlated weights.

It has been assumed that for the neural network performing afﬁne transformations on an image, it is sufﬁcient to use neurons with a linear transition function, that takes as an argument a weighted sum of inputs, without the bias value (2). ðOÞ

yi

¼

N ðI Þ X

ðI Þ

ðM Þ

yj xi;j;1

ð2Þ

j¼1 ðOÞ

ðI Þ

Where: yi i-th output of the layer with correlated weights, yj j-th input of the ðM Þ

layer with correlated weights, N ðI Þ number of inputs for the CWNL layer, xi;j;1 output of subnet calculating weights – weight of connection between the i-th output with the j-th input of the CWNL layer, M - the number of the last subnet layer. The values of connected weights are calculated by the subnetwork based on the coordinates of neuron in the CWNL layer and the coordinates of neurons inputs. These values are determined many times by the subnet for every combination of input image

60

S. Golak et al.

pixels and output image pixels. The subnetwork inputs are represented by a vector, which consists of coordinates of the CWNL neurons and their inputs: ðI Þ

ðOÞ

Pi;j ¼ Pi [ Pj

ð3Þ

(O) – coordinates of the where: P(I) i – coordinates of the i-th input of the main network, Pj j-th output (neuron) of the main network

ðOÞ

xi;j;k ¼ Pi;j ½k

ð4Þ

where: the k-th input of the subnetwork calculating the weight of the connection between of the i-th input and the j-th neuron of the CWNL. The signal is processed by subsequent layers of the subnetwork: ðmÞ xi;j;k

¼f

ðmÞ

ðm1Þ OX

! ðm1Þ ðmÞ xi;j;l wk;l

ðmÞ þ bk

ð5Þ

l¼1 ðmÞ

where: xi;j;k – output of the k-th neuron of m-th layer of the subnetwork calculating the

weight for the connection between j-th neuron and i-th input of the CWNL layer, f ðmÞ – the transition function of neurons in the m-th layer, O(m) – the number of neurons in the ðmÞ ðmÞ m-th subnetwork layer, wk;l , bk - standard weight and bias of the subnet neuron. In the presented network structure, it was proposed to use the multilayer perceptron as a subnetwork. However, this task can be performed by any other approximator.

4 Learning Method The learning algorithm of the CWNN network is based on the classical minimization of the square error function, deﬁned as: N 2 1X ðOÞ y j dj 2 j¼1 ðOÞ

SSE ¼

ð6Þ ðOÞ

Where: N ðOÞ number of neurons in the layer with correlated weights, yj j-th output of the layer with correlated weights, dj – desired value of the j-th output of the CWNN network. Effective learning of CWNN requires the use of one of the gradient methods to minimalize the error. In case of presented network, the parameters optimized in the learning process include only: weights and biases of the subnetwork, that calculates the main layer weights. This required modiﬁcation of the classical backpropagation method, to allow for an error transfer to the subnet. The error value for the output

New Architecture of Correlated Weights Neural Network

61

neuron of the subnetwork, which calculates the weight of the connection between the j-th output and the i-th input of the main network, is determined by the equation: ðM Þ ðOÞ ðI Þ ri;j;k ¼ yi di yj

ð7Þ

ðI Þ

ðOÞ

where: yj i-th output of the layer with correlated weights, yj j-th output of the layer with correlated weights, di – value of the i-th input of main network. In the next stage the error is back propagated from the output through subsequent network layers. The aim of the backpropagation is to update each of the weights in the subnetwork so that they cause the actual output to be closer the target output, thereby minimizing the error for each output neuron and the network as a whole. ðmÞ ri;j;k

¼

ð m þ 1Þ OX

ðm þ 1Þ ðm þ 1Þ ri;j;l wk;l

i¼1

XOðm1Þ ðm1Þ ðmÞ ðmÞ 0 xi;j;l wk;l þ bk f l¼1

ð8Þ

ðmÞ

Where:xi;j;k – output of the k-th neuron of m-th layer of the subnetwork calculating the weight for the connection between j-th neuron and i-th input of the CWNL layer, ðmÞ ðmÞ OðmÞ the number of neurons in the m-th subnetwork layer, wk;l , bk - standard weight and bias of the subnet neuron. Based on the error value it is possible to determine partial derivative for all parameters (weights and biases) of the subnet. Determination of partial derivative requires summation of derivatives calculated for all weights provided by the subnet: @E ðmÞ @wk;l

¼

ðm1Þ N ðI Þ X N ðOÞ OX X

i¼1 j¼1

l¼1

ðmÞ

ðm1Þ

ri;j;k xk;l

;

@E ðmÞ @bk

¼

ðm1Þ N ðI Þ X N ðOÞ OX X

i¼1 j¼1

ðmÞ

ri;j;k

ð9Þ

l¼1

ðmÞ

where: xi;j;k – output of the k-th neuron of m-th layer of the subnetwork calculating the weight for the connection between j-th neuron and i-th input of the CWNL layer, O(m)ðmÞ the number of neurons in the m-th subnetwork layer, wk;l , N ðOÞ number of neurons in the layer with correlated weights,N ðI Þ number of inputs for the CWNL layer For the presented network partial derivatives were computed with the use of the above equations.

5 Results Effectiveness of the developed neural network has been exanimated by the implementation and analysis of the global image transformations, like scaling, rotation and translation. The popular CIFAR-10 collection was used in the experiments [12]. This dataset was primarily intended for testing classiﬁcation models, but it can be considered as a useful source of images also for other applications. The collection contains various type of scenes, and consists of 60000 32 32 color images.

62

S. Golak et al.

Due to the large computational complexity of learning process calculations were performed on the PLGRID cluster, which was created as part of the PL-Grid - Polish Science Infrastructure for Scientiﬁc Research in the European Research Area project. The PLGRID infrastructure contains computing power of over 588 teraflops and disk storage above 5.8 petabytes. For each variant of the experiment a collection of images, for both training and testing set, contained only 50 examples. Global image transformations were used to verify the performance of the designed system. The input of the network was a primary image (1024 pixels). The expected response (the output of the network) was an image after the selected transformation. The network was trained in batch mode using the RPROP method (resilient back propagation) [13] due to the resistance of this method to the vanishing gradient problem. This phenomenon occurs in networks with a complex, multilayer structure as in the developed network. For small training sets this method is more effective than the popular group of stochastic gradient descent methods due to a more stable learning process. Weights of the subnetwork were initialized randomly with the use of Nguyen-Widrow method [14]. Due to very low susceptibility to overtraining by the new network, the stop procedure, based on the validation set, was omitted in the learning procedure. The correctness of this decision was conﬁrmed by the obtained results. The quality of the image transformations, for: 50, 100, 1000 and 5000 epochs, was monitored during the learning process, as well as the course of changes in the MSE for both training and testing set. The same network parameters was applied to each variant of the experiment. In the ﬁrst stage of research the network was trained to perform the image vertical scale transformation. The goal was to resize the image by 50% of height. The main layer with correlated weights, was compatible with the dimensions of the images in the training and test set, and had 1024 inputs and 1024 outputs, so that each neuron correspond to one pixel in 32 32 grid. The subnet, calculating weights of the main layer, consisted of 3 layers containing 8, 4, 1 neurons respectively, with a sigmoidal transition function. So a single sigmoid neuron providing the value in the range of [0..1] is present on the output of the sub-network. This is in line with the nature of the mapped transformations in which there are no negative weight values and values greater than 1. The convergence of the learning algorithm was measured using the mean squared error (per pixel) for each training epoch: S X N 2 1 X ðOÞ y d kj S N ðOÞ k¼1 j¼1 kj ðOÞ

MSE ¼

ð10Þ ðOÞ

where: S – number of examples in the set, N(O)- number of network outputs, ykj j-th output of the network for k-th example, dkj – desired value of the j-th output (pixel brightness) for k-th example. Figure 3 shows the decrease of the MSE during the learning process and a sample of images from the training set with the stages of scaling results during the learning process. It was observed that, after 1000 learning epochs, the outline of correct transformation appeared. After 5000 epochs, the quality of transformed image was

New Architecture of Correlated Weights Neural Network

63

satisfactory, although there were still disturbances in case of some images (see the fourth image – the plane). The black stains are areas where pixel values have exceeded the limit value 1. This is the result of a large proportion of bright pixels in the plane image in comparison to other images. Based on the MSE graph it can be concluded that the learning process can still be continued, which should result in further improvement of the quality of the transformation.

Source image

50 EP

100 EP 1000 EP 5000 EP

MSE

The MSE error for vertical scaling. 0.01E+03 1.00E+00 0.10E+00 0.01E+00 1.00E-03 0.10E-03 0.01E-03

Training set Test set

1000 2000 3000 4000 5000 Epochs

Fig. 3. The MSE error and quality of transformed image for vertical scaling.

image

50 EP

100 EP 1000 EP 5000 EP

The MSE error for translation 0.01E+03 1.00E+00 0.10E+00 0.01E+00 1.00E-03 0.10E-03 0.01E-03 1.00E-06

Training set Test set

MSE

Source

1000

2000 3000 Epochs

4000

Fig. 4. The MSE error and quality of transformed image for translation.

5000

64

S. Golak et al.

Source image

50 EP

100 EP 1000 EP 5000 EP

The MSE error for rotation 1.00E+03 1.00E+02 1.00E+01 1.00E+00 1.00E-01 1.00E-02 1.00E-03 1.00E-04

MSE

Training set Test set

1000

2000

3000

4000

5000

Epochs

Fig. 5. The MSE error and quality of transformed image for rotation.

The same network was trained to perform an image translation. Quality of obtained results was acceptable after 5000 epochs, so the learning process was terminated. During the learning process the continuous decrease of the MSE was observed (Fig. 4) in both training and testing set. After 1000 learning epochs, the outline of the transformation appeared. The ﬁgure also shows stages of the image transformation during the learning process. Image rotation by 30° was the most difﬁcult challenge for the network due to the necessity of mapping trigonometric relations. During the learning process the decrease of MSE error is similar to the previous cases (Fig. 5). In analyzed range there was no overtraining of the network. After 100 epochs we can observe an outline of the rotated image, but the picture itself is blurry. After 1000 epochs the image become clearer and it is possible to recognize the shape and details of the source image in transformed picture. Like in the previous calculations quality of the obtained results was acceptable after 5000 epochs.

6 Conclusion Proposed neural network represents a signiﬁcant extension of the concept of network with convolutional layers. It use the current CNN idea of similarity between the weights for individual neurons in the layer, but breaks with their direct sharing concept. Based on the observation on the correlation between, the values of weights and coordinates of neurons inputs and the coordinates of the neurons themselves, it can be stated that the CWNN network can implement transformations not available for the CNN network. At the same time, the network retains main advantage of CNN, which is the small number of parameters that should be optimized in the learning process. This

New Architecture of Correlated Weights Neural Network

65

paper propose a new structure of the neural network and its learning method. The concept of network has been veriﬁed by checking its ability to implement typical global pattern transformations. The results conﬁrms the ability of the CWNN to perform any global transformations. Presented research has been conducted based on a single-layer CWNN. Further research will focus on creation of networks with multiple layers, and ability to combine these layers with convolution layers, as well as with standard layers with a full pool of connections. This should give a chance to develop new solutions in the area of deep networks, which will allow to get competitive results in more complex tasks. Acknowledgments. This research was supported in part by PL-Grid Infrastructure under Grant PLGJAMA2017.

References 1. Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. In: NIPS 2015 (Spotlight), vol. 2, pp. 2017–2025 (2015) 2. Ferreira, A., Giraldib, G.: Convolutional Neural Network approaches to granite tiles classiﬁcation. Expert Syst. Appl. 84, 1–11 (2017) 3. Krizhevsky, A., Sutskever I., Hinton, G.E.: Imagenet classiﬁcation with deep convolutional neural networks. In: NIPS-2012, pp. 1097–1105 (2012) 4. Zhang, Y., Zhao, D., Sun, J., Zou, G., Li, W.: Adaptive convolutional neural network and its application in face recognition. Neural Process. Lett. 43(2), 389–399 (2016) 5. Radwan, M.A., Khalil, M.I., Abbas, H.M.: Neural networks pipeline for offline machine printed Arabic OCR. Neural Process. Lett. (2017). https://doi.org/10.1007/s11063-0179727-y 6. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556. http://arxiv.org/abs/1409.1556. Accessed 19 May 2018 7. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient based learning applied to document recognition. IEEE 86(11), 2278–2324 (1998) 8. Abdel-Hamid, O., Mohamed, A., Jiang, H., Deng, L., Penn, G., Yu, D.: Convolutional Neural Networks for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(10), 1533–1545 (2014) 9. Wang, Y., Zu, C., Hu, G., et al.: Automatic tumor segmentation with Deep Convolutional Neural Networks for radiotherapy applications. Neural Process. Lett. (2018). https://doi.org/ 10.1007/s11063-017-9759-3 10. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by backpropagating errors. Nature 323(6088), 533–536 (1986) 11. Golak, S.: Induced weights artiﬁcial neural network. In: Duch, W., Kacprzyk, J., Oja, E., Zadrożny, S. (eds.) ICANN 2005. LNCS, vol. 3697, pp. 295–300. Springer, Heidelberg (2005). https://doi.org/10.1007/11550907_47 12. Cire, D., Meier, U., Schmidhuber, J.: Multi-column deep neural networks for image classiﬁcation. Arxiv preprint arXiv:1202.2745 (2012) 13. Christian, I., Husken, M.: Empirical evaluation of the improved Rprop learning algorithms. Neurocomputing 50, 105–123 (2003) 14. Nguyen, D., Widrow, B.: Improving the learning speed of 2-layer neural networks by choosing initial values of adaptive weights. In: IJCNN, pp. III-21–26 (1989)

Compression-Based Clustering of Video Human Activity Using an ASCII Encoding Guillermo Sarasa1(B) , Aaron Montero1 , Ana Granados2 , and Francisco B. Rodriguez1 1

Grupo de Neurocomputaci´ on Biol´ ogica, Escuela Polit´ecnica Superior, Universidad Aut´ onoma de Madrid, Madrid, Spain [emailprotected], [emailprotected], [emailprotected] 2 CES Felipe II, Universidad Complutense de Madrid, Aranjuez, Madrid, Spain [emailprotected] http://arantxa.ii.uam.es/∼gnb/

Abstract. Human Activity Recognition (HAR) from videos is an important area of computer vision research with several applications. There are a wide number of methods to classify video human activities, not without certain disadvantages such as computational cost, dataset speciﬁcity or low resistance to noise, among others. In this paper, we propose the use of the Normalized Compression Distance (NCD), as a complementary approach to identify video-based HAR. We have developed a novel ASCII video data format, as a suitable format to apply the NCD in video. For our experiments, we have used the Activities of Daily Living Dataset, to discriminate several human activities performed by diﬀerent subjects. The experimental results presented in this paper show that the NCD can be used as an alternative to classical analysis of video HAR. Keywords: Data mining · Normalized Compression Distance Clustering · Dendrogram · Image processing Human Activity Recognition · Silhouette Coeﬃcient · Similarity

1

Introduction

Human Activity Recognition (HAR) [4,6,31] from videos represent a relevant area of computer vision research. Its utility in many areas has increased the demand of broader analysis in the ﬁeld, producing an increase of publications related with Computer Vision in HAR [4,26,31]. Some of its applications are: human health care [17], video labeling [27,28], surveillance [21,26]and humancomputer interaction [1,24], among others. There are many approaches in the literature to identify human activities from video with remarkable results. However, dealing with video implies solving certain issues that eventually lead to some drawbacks in the ﬁnal systems of HAR video processing. Some examples are high computational costs, dataset speciﬁcity or the dependency of the temporal movement sequence. c Springer Nature Switzerland AG 2018 V. K˚ urkov´ a et al. (Eds.): ICANN 2018, LNCS 11140, pp. 66–75, 2018. https://doi.org/10.1007/978-3-030-01421-6_7

Compression-Based Clustering of Video HAR in ASCII Encoding

67

Vision-based HAR can be summarized as a combination of extracting some features from a sequence, and discriminating between activities by means of a classiﬁcation system. The most important diﬃculties of feature extraction in video processing are: (i) overlap and variability between and within classes, (ii) temporal diﬀerences between samples (iii) impact and complexity of the environment and (iv) quality of the data. As an example of the ﬁrst problem, a video may contain activities that include similar movements (e.g. reading and using a tablet) but also can include activities that are carried out diﬀerently by diﬀerent people (e.g. cooking). Following this last case we can ﬁnd others examples of the second problem. Among others, the duration, repetition or even order of execution of an activity can diﬀer greatly, causing variations in the temporal structure, or sequence, of the activity. Finally, the capability to identify the background depends on many factors such as color diﬀerence, movement of the camera, or even quality of the recorded video. There are a considerable variety of methods that aim to solve these problems in the literature [4,6,31]. However, as we introduced before, the inherent drawbacks of these methods require additional adjustments in order to be used in a real-world application. In this work, we aim to use compression algorithms as a parameter free dissimilarity approach (among other reasons, see Sect. 2.1) to identify human activities in video ﬁles. The idea behind using a parameter free method is to identify the relevant information without performing any low level analysis on the data. This is to increase the applicability of the method (due to the lack of speciﬁcity and parameters) while decreasing its computational costs (that some times make the system prohibitive to real-world implementations). Also, the use of compression distances over video data represents a novel application with remarkable applications for video analysis. In this work, we have developed a video-to-ASCII processing method to locate and convert the activity of the video ﬁles into suitable objects for a compression algorithm. In order to test the capabilities of these methodology, we have performed experiments over the Activities of Daily Living Dataset [22] (see Sect. 3). This dataset is composed of diﬀerent videos of human activities, performed by diﬀerent subjects. Each video is recorded from a ﬁxed point of view and stored in Audio Video Interleave (AVI) format, using the Motion JPEG video codec. In our experiments we try to discriminate between each pair of activities, parsing each video into our ASCII video format and using a widely used compression distance (the so-called Normalized Compression Distance or NCD) together with a hierarchical clustering. The results obtained using our methodology report a good separability between most of the pairs of activities. These results suggest that this measure could be used as an alternative methodology to identify video HAR.

2

Methodology

As mentioned before, we have used the Activities of Daily Living Dataset [22]. This data set has been used in several studies on human activity recognition in

68

G. Sarasa et al.

the literature [2,20]. In this Section we will introduce the compression distances, (as the methodology that we have used in this work) the methodology to convert video streams into ASCII objects and the clustering procedure to measure the identiﬁcation capabilities of the NCD.

Fig. 1. Video activities examples, obtained from the Activities of Daily Living Dataset [22]. The ﬁve upper pictures belong to the activities labeled as: “answer phone”, “chop banana”, “dial phone”, “drink water” and “eat banana”. The ﬁve lower pictures belong to the activities labeled as: “eat snack”, “look up in phonebook”, “peel banana”, “use silverware” and “write on whiteboard”.

2.1

Normalize Compression Distances

Compression distances are dissimilarity measures that make use of compression algorithms to identify common properties between objects. These measures search for the information shared between ﬁles, and use it, to deﬁne how diﬀerent, in general terms, two objects are. The Normalized Compression Distance (NCD), is a generalization deﬁned in [8,19] that deﬁnes the distance between two objects x and y, as the relation between the size of each object compressed alone (C(x) and C(y)), and the size of their concatenation (xy) compressed (C(xy)). Hence, if the concatenation of two objects can be compressed better than each object alone, it means that the objects share some information. The mathematical formulation of the NCD can be deﬁned as: NCD(x, y) =

max{C(xy) − C(x), C(yx) − C(y)} , max{C(x), C(y)}

where C is a compression algorithm and C(x) and C(xy) are the size of the C-compressed versions of x and the concatenation of x and y, respectively. The NCD has been used in diﬀerent areas of knowledge, with remarkable results, due to its high noise tolerance, wide applicability and capabilities among diﬀerent types of data (audio, images, text, etc.). Among many others, compression distances have been used from document clustering [13,14] to spyware and phishing detection [7,18], image analysis [10,11,16,29],earth observation [5,15] and music clustering [12,25]. Due to the fact that compression distances are based on the skill of a compressor to identify similar features in big amounts of data, one would expect that video data should not be an exception. However, the video codecs used to store

Compression-Based Clustering of Video HAR in ASCII Encoding

69

video streams (sequence of images) in video ﬁles, already compress the information. In contrast to a text book or a bitmap picture, where the information is fully accessible, a video ﬁle contains the information compressed, making its identiﬁcation by a compression algorithm almost impossible. The way in which the information is compressed, depends on the codec used for the video ﬁle. In the data used in this paper, each video sequence is stored using the Motion JPEG codec (one of the few lossless video codecs), which compress each frame individually as a separate image. This however is not the only issue that the NCD has with video objects. Among others, the high percentage of noise or the big heterogeneity of sizes, are examples of other drawbacks to applying NCD directly to the video format. For all these reasons, we propose a novel video ASCII representation, in order to mitigate some of these drawbacks. 2.2

Data Format: From Video to ASCII

In order to transform the activity videos into a format that could be appropriate to be used by compression algorithms, we have developed a video preprocessing method. The aim of this process is to extract the optical ﬂow [3] of the video objects and to obtain the motion signature of each task that takes place in them. This motion signature is the one that will be encoded in ASCII format to be analyzed by the compressor. This encoding allows reducing the size of the original video ﬁles from 14.4–211 MBs to a ﬁxed 17 KB for the ASCII format (which also solves the size problem mentioned before). The video preprocessing consists of the following steps: 1. We extract 10 video frames from the video, equally separated in time, on which we perform a grayscale conversion, see panel (a) in Fig. 2 2. We calculate the optical ﬂow (through Horn–Schunck method [30]) of the selected frames and apply a thresholding to obtain the image points with greater activity, see panel (b) in Fig. 2. 3. We divide the image into binary boxes (1 = movement, 0 otherwise) and calculate the total activity produced in each one of them. This will generate an activity map, see panel (c) of Fig. 2. The dimensions of the boxes used are 16 × 16 pixels. 4. We obtain the motion signature adding the diﬀerent activity maps into a unique one, see panel (d) in Fig. 2. 5. We assign identiﬁers to each of the image boxes using a diagonal zigzag order (used in image encoding such as MPEG [23]), see panel (e) in Fig. 2. 6. Once the boxes are organized by means of the identiﬁers, we sort them according to the total activity (given by the optical ﬂow) of each of the boxes. This is the information that will be stored into an ASCII ﬁle and, later on, analyzed by the NCD, see panel (f) in Fig. 2. 2.3

Clustering of ASCII Objects Using String Compression

Once the video objects have been parsed into our proposed ASCII video objects, it is necessary to deﬁne a methodology to measure the eﬀect of the NCD into

70

G. Sarasa et al.

Fig. 2. Video preprocessing for conversion from the original AVI to ASCII format. Firstly, we extract a certain number of video frames and convert them to grayscale, panel (a). Secondly, we calculate the optical ﬂow of these frames and apply a threshold on them to obtain the image points with greater activity, panel (b). Thirdly, we divide the image into boxes and calculate the total activity produced in each of them in order to generate an activity map, panel (c). Subsequently, we make the sum of the diﬀerent activity maps to obtain the motion signature, panel (d). Finally, we read the image matrix (motion signature) in zigzag order, panel (e), and sort the information as a function of the total activity of each of the boxes, panel (f).

these new ASCII objects. Due to the NCD only reports a distance between two objects, we make use of a hierarchical clustering algorithm (based on the MQTC algorithm [8] from the CompLearn toolkit [9]) to parse the NCDs between objects into a dendrogram. For instance, given the case of a set of ASCII video objects, for two of the classes of Fig. 1, we can measure the NCDs between every pair of ﬁles and transform it into a hierarchical dendrogram. Finally, in order to measure how well each class is separated, we have made use of the Silhouette Coeﬃcient (SC) (detailed in [14]) as an unbiased clustering quality measure.

3

Experiments and Results

For our experiments we have taken all the data provided by the Activities of Daily Living Dataset [22] to measure the capabilities of our methodology. This dataset includes 10 diﬀerent tasks performed by 5 diﬀerent subjects, 3 times each one of them. The objective in these experiments is to discriminate two sets of 15 objects each, from two classes of the videos of Fig. 1. In this ﬁgure, we show a representative frame of each class along with the names of the diﬀerent tasks to classify. As an example to motivate the complexity of this problem, in Fig. 3, we

Compression-Based Clustering of Video HAR in ASCII Encoding

71

show 10 samples of processed activity maps before the zig-zag sort (described in Sect. 2.2) and the dendrogram produced by the NCD-driven clustering over the 30 video objects (15 of each class). The left ﬁgures are obtained from videos of two activities performed by 5 diﬀerent subjects. In this ﬁgure, one can see that the activity classes have diﬀerent signatures, but are not easily diﬀerentiable at simple sight. In order to identify these signatures we made use of a NCD-driven clustering (described in Sect. 2.3) which, as the right dendrogram of the ﬁgure shows, identify the two classes perfectly. 0.282

chopBanana chopBanana

0.263

0.309 chopBanana 0.253 chopBanana

chopBanana

0.326

0.363 0.360

chopBanana

0.363 chopBanana 0.306

chopBanana chopBanana chopBanana 0.399 0.307 chopBanana chopBanana 0.365 0.349 chopBanana chopBanana 0.348 chopBanana 0.363

useSilverware 0.368

0.296 useSilverware

useSilverware

0.340 0.362 useSilverware useSilverware 0.317

0.374 useSilverware

useSilverware 0.350 0.293 useSilverware useSilverware

useSilverware

0.252

0.301

useSilverware 0.303

0.243 useSilverware useSilverware

0.225 0.229

useSilverware useSilverware 0.228 S(T)=0.984799

Fig. 3. Sample maps of activity for Chop banana and Use silverware, for diﬀerent subjects. Each heatmap is produced by the process described in Sect. 2.2 until the zig-zag sort. This is equivalent to the d panel of Fig. 2. The right heatmaps, A and B, belong to Use silverware and Chop banana, respectively. As we can see, the classes are not easily diﬀerentiable at simple sight. The dendrogram of the ﬁgure shows how well our method identify each activity for all the subjects samples. The Silhouette Coeﬃcient in this case is 0.51

In Fig. 4 one can see that the proposed format, together with the NCD, report remarkable task identiﬁcation results for the majority of tasks pairs. However, there is some tasks that are more diﬃcult to identify than others. For example, while “chopBanana” and “eatSnack” are very well separated, “peelBanana” and “eatSnack” are not. Following the ﬁrst case (“chopBanana” and “eatSnack”), in Fig. 5 we show the dendrogram corresponding to the ﬁeld marked with an X of Fig. 4, with and without our video-to-ASCII process (right and left dendrograms, respectively). One can see that the clustering is only achieved in the right dendrogram, where all the video objects are processed into the activity ASCII objects. Thus, the conversion of the video objects prove to be essential to the analysis.

72

G. Sarasa et al.

Fig. 4. Color map comparison of the clustering quality obtained from the diﬀerent experiments. Each point of the map, corresponds to the S.C. obtained from parse the video to our video format (described in Sect. 2.2) and applying a NCD-driven clustering (described in Sect. 2.1). The diagonal of the matrix is not deﬁned due to the fact that a task cannot be compared with itself. The dendrogram of the ﬁelds marked with an X is depicted in Fig. 5 right panel.

Compression-Based Clustering of Video HAR in ASCII Encoding Original videos (AVI) (before our method) 1.000

Parsed videos (ASCII) (after our method)

1.000

chopBanana

1.000

73

chopBanana

1.000

chopBanana

eatSnack 1.000

chopBanana

1.000 chopBanana chopBanana

1.000

1.000

1.000

chopBanana

chopBanana

chopBanana

eatSnack

1.000

1.000 1.000

chopBanana

1.000

eatSnack

eatSnack

1.000

1.000

0.363

0.346

eatSnack

eatSnack

chopBanana

0.307

0.348 1.000

0.303

1.000

0.347

chopBanana

chopBanana chopBanana

eatSnack

eatSnack

0.375

0.326 chopBanana

1.000

0.347

0.307

chopBanana

0.285

eatSnack

eatSnack

1.000

0.309

eatSnack

eatSnack

0.365

chopBanana

chopBanana 0.387

eatSnack

eatSnack

chopBanana

0.287

chopBanana

1.000 1.000

chopBanana eatSnack eatSnack

eatSnack 0.282 1.000

1.000 eatSnack

chopBanana

chopBanana

0.263 1.000

chopBanana

chopBanana

eatSnack

chopBanana

1.000

chopBanana

chopBanana

0.271

0.355

eatSnack

0.375

eatSnack

0.289 0.352

eatSnack

0.363

1.000

eatSnack

eatSnack eatSnack

0.253 eatSnack

chopBanana

0.270

eatSnack

chopBanana 0.363

1.000

eatSnack

eatSnack

0.315 0.297

1.000

0.314

0.360

eatSnack 1.000

1.000

S(T)=0.991937

Fig. 5. Sample dendrograms, produced by the clustering of the activities: Chop banana and Eat snack, for the original ﬁles and the processed ﬁles. One can easily see that both activities are well separated in the right dendrogram (where the videos are transformed into our proposed format) while the left dendrogram (obtained from the original videos) reports almost no separability. Additionally, the Silhouette Coeﬃcient for these dendrograms is 0.546 and 0.123, respectively. The right dendrogram corresponds to the ﬁelds marked with an X of Fig. 4.

4

Conclusions

The approach presented in this work aims to identify diﬀerent human activities from video sequences addressing some of the drawbacks that classical systems have. The way in which we have performed that consist of adapting a generic, low costly and parameter-free methodology, compression distances, to our speciﬁc case by means of a video ASCII format. Particularly, we have used the wellknown Normalized Compression Distance (NCD). In order to use the NCD over video streams we deﬁned a video-to-ASCII conversion methodology. This allows us to make use of compression distances with video objects with successfully results. In this manner, the activity of the video samples is located and casted into text ﬁles based on its location in the video frames. Our assumption is that each activity should be expressed with a particular movement signature which, on average, should be shared among various subjects. To corroborate this assumption, we have tested this methodology over diﬀerent video samples using the Activities of Daily Living Dataset [22]. The results presented in this paper show that applying our methodology produces a remarkable clustering along the dataset, which suggests the NCD can be applied to the context of video HAR with success. In the same vein, Fig. 4 shows that the majority of the activities, for this speciﬁc database, are ﬁne identiﬁed while only a minority are not. This means, that some pairs of activities are too similar to discriminated which videos belong to each activity using this analysis. With this approach, we achieved reasonable results without taking in consideration the particularities of the dataset.

74

G. Sarasa et al.

As future work we plan to test and improve our new format over diﬀerent data sets. In the same vein, we intend to produce alternative video-to-ASCII formats to measure diﬀerent characteristics of the video activity, and thereby, to add robustness to the system (redundancy). Measuring the vector movement (instead of the activity index) or segmenting the video into multiple ASCII ﬁles, are examples of possible alternatives to our method. In summary, we expect to improve the capabilities of the methodology presented in this work exploring diﬀerent compression algorithms, conversion methodologies and video representations. Acknowledgment. This work was funded by Spanish project of MINECO/FEDER TIN2014-54580-R and TIN2017-84452-R, (http://www.mineco.gob.es/).

References 1. Akkaladevi, S.C., Heindl, C.: Action recognition for human robot interaction in industrial applications. In: 2015 IEEE International Conference on Computer Graphics, Vision and Information Security (CGVIS), pp. 94–99, November 2015 2. Avgerinakis, K., Briassouli, A., Kompatsiaris, I.: Recognition of activities of daily living for smart home environments. In: 2013 9th International Conference on Intelligent Environments, pp. 173–180, July 2013 3. Beauchemin, S.S., Barron, J.L.: The computation of optical ﬂow. ACM Comput. Surv. 27(3), 433–466 (1995) 4. Bux, A., Angelov, P., Habib, Z.: Vision based human activity recognition: a review. In: Angelov, P., Gegov, A., Jayne, C., Shen, Q. (eds.) Advances in Computational Intelligence Systems. AISC, vol. 513, pp. 341–371. Springer, Cham (2017). https:// doi.org/10.1007/978-3-319-46562-3 23 5. Cerra, D., Datcu, M.: Expanding the algorithmic information theory frame for applications to earth observation. Entropy 15(1), 407–415 (2013) 6. Chaaraoui, A.A., Climent-P´erez, P., Fl´ orez-Revuelta, F.: A review on vision techniques applied to Human Behaviour Analysis for Ambient-Assisted Living. Expert. Syst. Appl. 39(12), 10873–10888 (2012) 7. Chen, T.C., Dick, S., Miller, J.: Detecting visually similar web pages: application to phishing detection. ACM Trans. Internet Technol. 10(2), 5:1–5:38 (2010) 8. Cilibrasi, R., Vitanyi, P.M.B.: Clustering by compression. IEEE Trans. Inf. Theory 51(4), 1523–1545 (2005) 9. Cilibrasi, R., Cruz, A.L., de Rooij, S., Keijzer, M.: CompLearn Home. CompLearn Toolkit. http://www.complearn.org/ 10. Cohen, A.R.: Extracting meaning from biological imaging data. Mol. Biol. Cell 25(22), 3470–3473 (2014) 11. Cohen, A., Bjornsson, C., Temple, S., Banker, G., Roysam, B.: Automatic summarization of changes in biological image sequences using algorithmic information theory. IEEE Trans. Pattern Anal. Mach. Intell. 31(8), 1386–1403 (2009) 12. Gonz´ alez-Pardo, A., Granados, A., Camacho, D., de Borja Rodr´ıguez, F.: Inﬂuence of music representation on compression-based clustering. In: IEEE World Congress on Evolutionary Computation, pp. 2988–2995 (2010) 13. Granados, A., Cebrian, M., Camacho, D., de Borja Rodriguez, F.: Reducing the loss of information through annealing text distortion. IEEE Trans. Knowl. Data Eng. 23(7), 1090–1102 (2011)

Compression-Based Clustering of Video HAR in ASCII Encoding

75

14. Granados, A., Koroutchev, K., de Borja Rodr´ıguez, F.: Discovering data set nature through algorithmic clustering based on string compression. IEEE Trans. Knowl. Data Eng. 27(3), 699–711 (2015) 15. Gueguen, L., Datcu, M.: A similarity metric for retrieval of compressed objects: application for mining satellite image time series. IEEE Trans. Knowl. Data Eng. 20(4), 562–575 (2008) 16. Guha, T., Ward, R.K.: Image similarity using sparse representation and compression distance. IEEE Trans. Multimed. 16(4), 980–987 (2014) 17. Khan, Z.A., Sohn, W.: Abnormal human activity recognition system based on Rtransform and kernel discriminant technique for elderly home care. IEEE Trans. Consum. Electron. 57(4), 1843–1850 (2011) 18. Lavesson, N., Axelsson, S.: Similarity assessment for removal of noisy end user license agreements. Knowl. Inf. Syst. 32(1), 167–189 (2012) 19. Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P.: The similarity metric. IEEE Trans. Inf. Theory 50(12), 3250–3264 (2004) 20. Liu, M., Chen, C., Liu, H.: Time-ordered spatial-temporal interest points for human action classiﬁcation. In: 2017 IEEE International Conference on Multimedia and Expo (ICME), pp. 655–660, July 2017 21. Maddalena, L., Petrosino, A.: A self-organizing approach to background subtraction for visual surveillance applications. IEEE Trans. Image Process. 17(7), 1168– 1177 (2008) 22. Messing, R., Pal, C., Kautz, H.: Activity recognition using the velocity histories of tracked keypoints. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 104–111, September 2009 23. Qiao, L., Nahrstedt, K.: Comparison of MPEG encryption algorithms. Comput. Graph. 22(4), 437–448 (1998) 24. Roitberg, A., Perzylo, A., Somani, N., Giuliani, M., Rickert, M., Knoll, A.: Human activity recognition in the context of industrial human-robot interaction. In: 2014 Asia-Paciﬁc Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1–10, December 2014 25. Sarasa, G., Granados, A., Rodriguez, F.B.: An approach of algorithmic clustering based on string compression to identify bird songs species in xeno-canto database. In: 2017 3rd International Conference on Frontiers of Signal Processing (ICFSP), pp. 101–104, September 2017 26. Wang, X.: Intelligent multi-camera video surveillance: a review. Pattern Recognit. Lett. 34(1), 3–19 (2013) 27. Wu, S., Oreifej, O., Shah, M.: Action recognition in videos acquired by a moving camera using motion decomposition of Lagrangian particle trajectories. In: 2011 International Conference on Computer Vision, pp. 1419–1426, November 2011 28. Yan, Y., Ricci, E., Liu, G., Sebe, N.: Egocentric daily activity recognition via multitask clustering. IEEE Trans. Image Process. 24(10), 2984–2995 (2015) 29. Yu, T., Wang, Z., Yuan, J.: Compressive quantization for fast object instance search in videos. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 726–735, October 2017 30. Zhang, G., Chanson, H.: Application of local optical ﬂow methods to high-velocity free-surface ﬂows: validation and application to stepped chutes. Exp. Therm. Fluid Sci. 90, 186–199 (2018) 31. Zhang, S., Wei, Z., Nie, J., Huang, L., Wang, S., Li, Z.: A review on human activity recognition using vision-based method. J. Healthc. Eng. 2017 (2017)

Medical/Bioinformatics

Deep Autoencoders for Additional Insight into Protein Dynamics Mihai Teletin1 ✉ , Gabriela Czibula1, Maria-Iuliana Bocicor1, Silvana Albert1, and Alessandro Pandini2 (

)

1

2

Babes-Bolyai University, Cluj-Napoca, Romania [emailprotected], {gabis,iuliana,albert.silvana}@cs.ubbcluj.ro Institute of Environment, Health and Societies, Brunel University London, London, UK [emailprotected]

Abstract. The study of protein dynamics through analysis of conformational transitions represents a signiﬁcant stage in understanding protein function. Using molecular simulations, large samples of protein transitions can be recorded. However, extracting functional motions from these samples is still not automated and extremely time-consuming. In this paper we investigate the usefulness of unsupervised machine learning methods for uncovering relevant information about protein functional dynamics. Autoencoders are being explored in order to highlight their ability to learn relevant biological patterns, such as structural char‐ acteristics. This study is aimed to provide a better comprehension of how protein conformational transitions are evolving in time, within the larger framework of automatically detecting functional motions. Keywords: Protein molecular dynamics · Autoencoders Unsupervised learning

1

Introduction

Proteins are large biomolecules having crucial roles in the proper functioning of organ‐ isms. They are synthesized using information contained within the ribonucleic acid (RNA), when by means of the process known as translation, building blocks, the amino acids, are chained together in a sequence. Although this sequence is linear, the protein acquires a complex arrangement in its physiological state, as intramolecular forces between the amino acids and the hydrophobic eﬀect lead to a folding of the protein into its three dimensional shape, which determines the protein’s function [27]. The stable three dimensional structure of a protein is unique, however this shape undergoes signif‐ icant changes to deliver its biological function, according to various external factors from the protein’s environment (e.g. temperature, interaction with other molecules). Thus, a protein will acquire a limited number of conformations during its lifetime, having the ability to transition between alternative conformations [26]. The study and prediction of conformational transitions represents a signiﬁcant stage in understanding protein function [21]. In this paper we investigate protein molecular © Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11140, pp. 79–89, 2018. https://doi.org/10.1007/978-3-030-01421-6_8

80

M. Teletin et al.

motions and conformational transitions starting from the structural alphabet devised by Pandini et al., a representation which provides a highly informative encoding of proteins [22]. In this description, each fragment consists of 4 residues and is deﬁned by three internal angles: two pseudo-bond angles between the Cα atoms (Cα is the ﬁrst carbon atom that attaches to a functional group) of residues 1-2-3 and 2-3-4 and one pseudo‐ torsion angle formed by atoms 1-2-3-4 [22]. These internal angles entirely deﬁne each structural fragment which can be also encoded as a letter from a Structural Alphabet (SA) [22]. In addition to the previously mentioned representation based on angles, we investigate whether enhancing the structural alphabet states (represented by the three angles) with relative solvent accessibility information might bring further insight into the matter at hand. Relative solvent accessibility (RSA) of amino acid residues is a value indicating the degree to which the residue is exposed [20], being able to characterize the spatial distribution of amino acids in a folded protein. RSA is signiﬁcant for predicting protein-interaction sites [20] and it is used in protein family classiﬁcation [1]. The intu‐ ition is that, even if RSA values independently do not oﬀer a unique characterization of a protein, being individually non-speciﬁc, new structural states deﬁned by the three angles together with RSA values could bring additional information. Using molecular simulations, large samples of protein transitions can be recorded. However, extracting functional motions from these samples is still not automated and extremely time-consuming. Therefore, we consider that computational methods such as unsupervised learning could be a well suited solution for better understanding protein dynamics. We are investigating the usefulness of deep autoencoder neural networks to acquire a clearer sense of proteins’ structure, with the long term goal of learning to predict proteins’ conformational transitions. Several approaches in the literature were proposed for analyzing and modeling protein structural conformations using both super‐ vised and unsupervised machine learning techniques. Support vector machine’s performance was tested in [14] by classifying gene function from heterogeneous protein data sets and comparing results with various kernel methods. In [28], a Radial Basis Function Network (RBFN) is proposed for classifying protein sequences. Fifteen super‐ vised learning algorithms were evaluated in [9] by automating protein structural clas‐ siﬁcation from pairs of protein domains and Random Forests were proven to outperform the others. Additional insight into protein molecular dynamics (MD) is gained in [16] by employing L1-regularized reversible Hidden Markov Models. Self-organizing maps have also been used alongside hierarchical clustering in [6], for the purpose of clustering molecular dynamics trajectories. A methodology for detecting similarity between three dimensional structures of proteins was introduced by Iakavidou et al. in [8]. The contribution of the paper is twofold. Our ﬁrst main goal is to investigate the capability of unsupervised learning models, more speciﬁcally of autoencoders, to capture the internal structure of proteins represented by their conformational transitions. Secondly, we propose two internal representations for a protein (one using the structural alphabet states deﬁned by three angle values, as introduced in [22] and one in which these states are extended with RSA information) with the aim of analyzing which of them is more informative and would drive an autoencoder to better learn structural rela‐ tionships between proteins. The experiments performed are aimed at evaluating the extent by which the combination of a reduced representation and an autoencoder is

Deep Autoencoders for Additional Insight into Protein Dynamics

81

suitable to compress the complex MD data into a more interpretable representation. With this aim we propose a proof of concept that considers only two similar but unrelated proteins where learning on one can be used on the other. The literature regarding protein data analysis reveals that a study similar to ours has not been hitherto performed. The study can be further extended on a large scale where evolutionary relationship are considered, with the goal of answering how much the “closeness” of proteins in evolu‐ tionary space can aﬀect the eﬃciency of the encoding. To sum up, in this paper we seek answers to the following research questions: RQ1 What is the potential of autoencoders to unsupervisedly learn the structure of proteins and how does the internal representation for a protein impact the learning process?; and RQ2 Are autoencoders able to capture biologically relevant patterns? More speciﬁcally, are our computational ﬁndings obtained by answering RQ1 and RQ2 correlated with the biological perspective? The remainder of the paper is organized as follows. The autoencoder model used in our experiments is described in Sect. 2. Section 3 provides our methodology and Sect. 4 contains the results of our experiments, as well as a discussion regarding the obtained results, both from a computational and biological perspective. The conclusions of our paper and directions for future work are summarized in Sect. 5.

2

Autoencoders

Autoencoders were successfully applied in diﬀerent complex scenarios such as image analysis [13] and speech processing [5]. An autoencoder [7] is a feed forward neural network. The input of the network is a real numbered vector x ∈ Rn. An autoencoder is composed of two main components: (1) an encoder: g: Rn → Rm , g(x) = h and (2) a decoder: f:Rm → Rn , f (h) = ̂ x . The two components are stacked together, hence the goal of the autoencoder is to model a function: f (g(x)) ≈ x. We notice that the input and the label of the model are the same vector. Thus the autoencoders may be considered self-supervised learning techniques. If m < n then the autoencoder is called undercomplete. We consider the learning process of autoencoders as minimizing a loss function )2 ( ) 1 ∑n ( ̂ xi − xi . The optimization is performed using stochastic gradient L ̂ x, x = i=1 n descent with backpropagation. One may notice that the goal of the autoencoder is to copy the input x into the output value. However, such a model would not be useful at all. In fact, the goal of the autoencoder is to come up with useful representation of data in the hidden state, h. Good encoded values may be useful for various tasks such as information retrieval and data representation. A sparse autoencoder is a technique used to help the model avoid the simple copying of the input to the output by introducing a sparsifying penalty to the loss function. Usually this sparsing penalty is the L1 regula‐ rization on the encoded state. The penalty term is scaled using a small real number )2 ( ) 1 ∑n ( denoted as 𝜆. Thus the employed loss becomes L ̂ ̂ x − xi + x, x = n i=1 i ∑n 𝜆 i=1 ||hi ||.

82

M. Teletin et al.

Denoising autoencoders represent another technique to avoid the mere copying of the input data to the output layer, forcing the hidden layers to learn the best deﬁning, most robust features of the input. To achieve this, a denoising autoencoder is fed stochastically corrupted input data and tries to reconstruct the original input data. Thus, in the case of denoising autoencoders the loss function to be minimized is L(g(f (̃x)), x), where the input given to the autoencoder is represented by x̃ - input data corrupted by some form of noise [7]. Therefore, the autoencoder will not simply elicit the input data, but will learn a signiﬁcant representation of it. Various experiments proved that autoencoders are better than Principal Component Analysis (PCA) [7]. This is mainly because autoencoders are not restricted to perform linear mapping. One can consider that a single layer autoencoder with linear activation function has the same capacity as PCA. However, the capacity of autoencoders can be improved by tuning the complexity of the encoder and decoder functions.

3

Methodology

In this section we present the experimental methodology used in supporting our assump‐ tion that autoencoders can capture, from a computational viewpoint, biologically rele‐ vant patterns regarding structural conformational changes of proteins. With the goal of answering the ﬁrst research questions formulated in Sect. 1, the experiments will inves‐ tigate the ability of an autoencoder to preserve the structure of a protein. Two types of representations will be considered in order to identify the one that is best suited for the analysis we are conducting. These representations will be detailed in Sect. 3.1. 3.1 Protein Representations A protein is a macromolecule with a very ﬂexible and dynamic innate structure [18] that changes shape due to both external changes from its environment and internal molecular forces. The resulting shape is a diﬀerent conformation. For each conformation of a protein, two diﬀerent representations of the local geometry of the molecule will be used in our study. The ﬁrst representation for a protein’s conformation, which we call the representa‐ tion based on angles (Angles), consists of conformational states given by the three types of angles mentioned in Sect. 1 [22]. In this representation, a conformation of k fragments (letters from the structural alphabet [22]) is represented as 3k dimensional numerical sequence. This sequence contains three angles for each fragment from the conformation. The second way to represent a protein conformation, named in the following the combined representation (Combined) is based on enhancing the conformational states given by angles with the RSA values of the amino acid residues (see Sect. 1). In our second representation, a conformation of k states is visualized as a 4k dimensional numerical vector. The ﬁrst 3k positions from this vector contain the conformation’s representation based on angles, whereas the following k positions contain the RSA values.

Deep Autoencoders for Additional Insight into Protein Dynamics

83

3.2 Autoencoder Architecture In the current study we use sparse denoising autoencoders to learn meaningful, lower‐ dimensional representations for proteins’ structures, considering their conformational transitions. Hence, the loss function will be computed as shown in Sect. 2, where ̂ x = g(f (̃x)) and x̃ represents the corrupted input data. We chose a denoising autoencoder in our experiments, because experimental measurements of biological processes and information generated by particle methods (e.g. MD simulations) can be noisy or subject to statistical errors. We are going to use such an autoencoder in order to reduce the dimensionality of our data. Considering that one of our purposes is to be able to visualize our data sets, all the techniques implied are going to encode the protein representations into 2 dimensional vectors. The sparse denoising autoencoder learns a mapping function from an n-dimensional space (where n can have diﬀerent values, according to the employed representation) to a 2 dimensional hidden state. We performed several experiments, with variable numbers of hidden layers and using various activation functions, in order to reduce dimension‐ ality. More speciﬁcally, the activation functions we employed for the hidden layers are: rectiﬁed linear unit (ReLU), exponential linear unit (ELU) [4] and scaled exponential linear unit (SELU) [12]. As a regularization strategy, we use the dropout technique [24], with dropout rates in {0.1, 0.2, 0.3}. Since we have only 2 values in the encoded state we are going to use a small value for 𝜆 hyperparameter: 10−6. The encoded values are then reconstructed using a similar decoding architecture. Optimization of the autoencoder is achieved via stochastic gradient descent enhanced with the adam optimizer [11]. We employ the algorithm in a minibatch perspective by using a batch size of 16. The batch size aﬀects the performance of the model. Usually, large batch sizes are not recommended since it may reduce the capacity of the model to generalize. Adam is a good optimizer since it also deals with the adjustment of the learning rate. The data set is shuﬄed and 10% is retained for validation. We keep the best performing model on the validation phase by measuring the validation loss. The loss obtained on the validation set was 0.555 for 1P1L and 0.378 for 1JT8 for the ReLU activation function, with 0.2 dropout rate. Regarding the encoding architecture, we experimented with 2 and 3 hidden layers, containing diﬀerent numbers of neurons (depending on the size of the input data), and each of the hidden layers beneﬁt from batch normalization. The decoding architecture is similar, having the same dimensions for the hidden layers, but in reverse. 3.3 Evaluation Measures In order to determine whether the representation learned by the autoencoder preserves the similarities found in the original protein data we deﬁne the intra-protein similarity measure, IntraPS, which evaluates the degree of similarity between conformations within a protein and we will use this as an indication of how well the intra-protein conformational relations are maintained in the lower-dimensional representation learned by the autoencoder. IntraPS is based on the cosine similarity measure, which is employed to evaluate the likeness between two conformations of a protein.

84

M. Teletin et al.

Cosine similarity (COS) is widely used as a measure for computing the similarity between gene expression proﬁles. It is a measure of the direction-length similitude between two vectors and is deﬁned as the cosine of the angle between the high dimen‐ sional vectors. To deﬁne the intra-protein similarity measure, ( we consider ) that a protein p is represented as a sequence of n conformations, i.e. p = cp1 , cp2 , … , cpn . Each confor‐ mation cpi of the protein is visualized as an m-dimensional numerical vector (i.e. the representation based on angles or the combined representation ( )previously described). The Intra-protein similarity of a protein p = cp1 , cp2 , … , cpn , denoted as IntraPS(p), is deﬁned as the average of the absolute cosine similarities between two consecutive ( p p )| ∑n−1 | |COS ci , ci+! | i=1 | |. conformations, i.e. IntraPS(p) = n−1 In computing the IntraP measure, we decided to use the absolute values for the cosine between two conformations, since our assumption was that for protein data the relative strengths of positive and negative cosine values between RSA vectors is the same. This was experimentally conﬁrmed in our experiments. For computing the similarity/dissim‐ ilarity between two protein conformational transitions, diﬀerent methods were investi‐ gated (Euclidian distance, Pearson correlation, Biweight midcorrelation) and the cosine similarity has proven to be the most appropriate. Since the dimensionality of the original protein conformations is signiﬁcantly reduced by the autoencoder (i.e. two dimensions), Euclidian, Pearson and Biweight midcorrelation are not good options for measuring the similarity: the Euclidean distance is larger between points in a high dimensional space than in a two dimensional one; Pearson and Biweight are not suitable in 2D (the corre‐ lation between two dimensional points is always 1).

4

Results and Discussion

The experiments we performed for highlighting the potential of deep autoencoders to capture the proteins’ structure will be further presented, using the experimental meth‐ odology presented in Sect. 3. The proteins used in our study are described in Table 1 which shows a brief depiction of the proteins together with their superfamily and sequence length. The proteins from Table 1 were chosen based on data availability (conformational transitions and RSA values), the fact that they have the same sequence length (which enables us to carry out our investigations related to RQ2 from Sect. 1. Table 1. Proteins selected for analysis [2]. Protein 1P1L 1JT8

Description Component of sulphur-metabolizing organisms Protein involved in translation

Superfamily 3.30.70.120 2.40.50.140

Sequence length 102 102

For both these proteins, 10000 conformational transitions were recovered from the MoDEL database [17] (i.e. n = 10000), where each transition consists of a sequence of 99 fragments of the structural alphabet [22]. Thus, as described in Sect. 3.1, in the

Deep Autoencoders for Additional Insight into Protein Dynamics

85

representation based on angles, a conformation has a length of 297, whereas in the combined representation a conformation is visualized as a 396-dimensional point. For both proteins, the two representations proposed in Sect. 3.1 will be further used. Before applying the autoencoder, the protein data sets are standardized, i.e. transformed to mean 0 and standard deviation 1. Furthermore, considering that the employed technique is a denoising autoencoder, the input data is corrupted by adding noise (random samples from a standard normal distribution). 4.1 Results The experiment described below is conducted with the aim of answering our ﬁrst research question RQ1 and of investigating if and how the internal representation for a protein impacts the learning process. For each protein data set, we trained a number of denoising sparse autoencoders (Sect. 3.2). For the autoencoder we have employed the Keras implementation available at [3]. The autoencoders presented in Sect. 3.2 are used to reduce the dimensionality of our data and to visualize the protein data sets. Figures 1 and 2 depict the visualization of the proteins from our data set using trained sparse denoising autoencoders. The axes on Figs. 1 and 2 represent the range of values obtained within the 2-dimensional encoding of the input data set (the values of the two hidden nodes representing the encoder output). Colours were added to better emphasize the representations of successive conformations).

Fig. 1. Visualization of protein 1JT8.

Fig. 2. Visualization of protein 1P1L.

The original data fed to the autoencoder for each protein represents a timely evolution of the protein’s structure (albeit for an extremely small interval of time - nanoseconds), considering its transitional conformations. From one conformation to another, the protein might remain unchanged, or certain parts of it might incur minor modiﬁcations. The autoencoders used to obtain these representations were trained on original data in its combined representation, they employ 6 hidden layers (3 for encoding and 3 for decoding), with ReLU as activation function, batch normalization and a dropout rate of 0:2. Nevertheless, we experimented with the representation based on angles, as well as with various combinations of parameters (number of neurons, layers, dropout rate, acti‐ vation functions), as described in Sect. 3.2 and all resulting plots denote an evolution of the data output by the autoencoder (henceforth referred to as encoded data), thus

86

M. Teletin et al.

suggesting that autoencoders are able to identify the most relevant characteristics of the original representations. The two dimensional representations of the proteins as captured by the autoencoders, illustrated in Figs. 1 and 2, reﬂect the autoencoder’s ability to accurately learn biological transitions. Successive conformations in the original data are progressively chained together in the autoencoder’s output data thus denoting a visual evolution. Figures 1 and 2 also show that the considered protein data are relevant for machine learning models, as it correctly captures biological chained events, by encoding successive conformations into points that are close in a 2-dimensional space. Further, to decide whether the autoencoder maintains the relationships found within the original data, we use the IntraPS measure. Thus, ﬁrst we compute these similarities for the original data and then for the two-dimensional data output by the autoencoder, for both considered representations. The results are shown in Table 2. For each protein, in addition to the values for the IntraPS measure, we also present the minimum (Min), maximum (Max) and standard deviation (Stdev) of the absolute values of cosine simi‐ larities between two consecutive conformations, for both representations. We mention that Min, Max and Stdev were computed using batches of 100 successive conformations. These results are also illustrated in Figs. 3 and 4, which show the comparative evolution of average IntraPS values for each 100 conformations in the 10000 conformations that characterize each considered protein. We notice that for both proteins 1JT8 and 1P1L the results output by the autoencoder (denoted by “Encoded data” in the images) are slightly larger, but, on average, particularly similar to the values computed for the orig‐ inal data. All these results suggest that the original proteins’ conformations have a high degree of cosine similarity (highlighted in Table 2), which is still preserved in the data resulted from the autoencoder. One observes from Fig. 3 that there is a spike in the encoded data, which is not visible in the original data. Analyzing protein 1JT8, we observed that there is an event in the protein structure, but it happens with about 100 conformations before the spike, thus it needs further investigation. Table 2. IntraPS for proteins 1JT8 and 1P1L, using the two considered representations.

Protein 1JT8 1P1L

Original Encoded Original Encoded

Angles

Combined

0.9960 0.9939 0.9779 0.9912

0.9913 0.9985 0.9573 0.9962

Min/Max/Stdev (COS) Angles Combined 0.9894/0.9995/0.0023 0.9213/0.9999/0.0161 0.9593/0.9896/0.0064 0.9315/0.9999/0.0119

0.9843/0.9962/0.0022 0.9573/0.9999/0.0044 0.9464/0.9695/0.0054 0.9661/0.9999/0.0052

With regard to the used internal representations, we conclude that these do not seri‐ ously inﬂuence the learning process. This may be due to the signiﬁcant reduction of data dimensionality (two dimensions). Still, for the combined representation which is richer in information than the representation based on angles, slightly better results were obtained. As highlighted in Table 2, for both proteins, IntraPS values are larger for the encoded data and the standard deviation of the cosine similarities between two consec‐ utive conformations is smaller, as well. If the data were reduced to a higher dimensional

Deep Autoencoders for Additional Insight into Protein Dynamics

87

space, the RSA values might bring additional improvements, which induces an interesting matter for future investigations.

Fig. 3. Protein representation).

1JT8

(combined Fig. 4. Protein representation).

1P1L

(combined

With the aim of answering research question RQ2, we are analyzing in the following the biological relevance of the above presented computational results. The molecular dynamics sampled by the ensemble of structures in the two data sets is consistent with small consecutive changes in the protein structure occurring on the nanosecond time scale. These changes are typical of the ﬁrst stages of the functional motions and they are generally dominated by local transitions and signiﬁcant resampling of the confor‐ mational space. The autoencoder is able to capture both these features, as demonstrated by the obtained results: changes are encoded in chained events that resample the confor‐ mational space eﬀectively. In addition, there is evidence that evolutionary related proteins are also similar in their functional motions [23]. The study performed in this paper with the aim to highlight the ability of autoen‐ coders to uncover relevant information about protein dynamics is new. Autoencoders have been previously used in the literature for protein structure analysis, but from perspectives which diﬀer from ours. Autoencoders were proven to be eﬀective for analysis of protein internal structure in [15] where the authors initialized weights, reﬁned them by backpropagation and used each layer’s input back to itself in order to predict backbone Cα angles and dihedrals. In [10], autoencoders were employed for improving structure class prediction by repre‐ senting the protein as a “pseudo-amino acid composition” meaning the model consisted of normalized occurrences of the each of the 20 amino acids in a protein, combined with the order of the amino acid sequence. The algorithm called DL-Pro [19] is designed for classifying predicted protein models as good or bad by using a stacked sparse autoen‐ coder which learns from the distances between two Cα atoms residues. Sequence based protein to protein interaction was also predicted using a sparse autoencoder in [25].

5

Conclusions and Further Work

We have conducted in this paper a study towards applying deep autoencoders for a better comprehension of protein dynamics. The experiments conducted on two proteins

88

M. Teletin et al.

highlighted that autoencoders are eﬀective unsupervised models able to learn the struc‐ ture of proteins. Moreover, we obtained an empirical evidence that autoencoders are able to encode hidden patterns relevant from a biological perspective. Based on the study performed in this paper and on previous investigations regarding protein data analysis, we aim to advance our research towards predicting protein confor‐ mational transitions using supervised learning models. Furthermore, we plan to continue our work by using a two-pronged strategy: from a biological viewpoint we will consider other proteins and examine how their evolutionary relationships are reﬂected within the resulting data; computationally, we will investigate diﬀerent architectures for the sparse autoencoder used in our experiments (e.g. model’s architecture, diﬀerent optimizers for the gradient descent) and we will apply variational and contractive autoencoders instead of sparse ones.

References 1. Asgari, E., Mofrad, M.: Continuous distributed representation of biological sequences for deep proteomics and genomics. Plos One (2015). https://doi.org/10.1371/journal.pone. 0141287 2. Berman, H., et al.: The protein data bank. Nucleic Acids Res. 28, 235–242 (2000) 3. Chollet, F., et al.: Deep learning for humans (2015). https://github.com/fchollet/keras 4. Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (ELUS). arXiv preprint arXiv:1511.07289 (2015) 5. Deng, J., Zhang, Z., Marchi, E., Schuller, B.: Sparse autoencoder-based feature transfer learning for speech emotion recognition. In: ACII, pp. 511–516. IEEE (2013) 6. Fraccalvieri, D., Pandini, A., Stella, F., Bonati, L.: Conformational and functional analysis of molecular dynamics trajectories by self-organising maps. Bioinformatics 12, 1–18 (2011) 7. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016) 8. Iakovidou, N., Tiakas, E., Tsichlas, K., Manolopoulos, Y.: Going over the three dimensional protein structure similarity problem. Artif. Intell. Rev. 42(3), 445–459 (2014) 9. Jain, P., Garibaldi, J.M., Hirst, J.: Supervised machine learning algorithms for protein structure classiﬁcation. Comput. Biol. Chem. 33, 216–223 (2009) 10. Liu, J., Chi, G., Liu, Z., Liu, Y., Li, H., Luo, X.-L.: Predicting protein structural classes with autoencoder neural networks. In: CCDC, pp. 1894–1899 (2013) 11. Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv: 1412.6980 (2014) 12. Klambauer, G., Unterthiner, T., Mayr, A., Hochreiter, S.: Self-normalizing neural networks. In: NIPS (2017) 13. Le, Q.: Building high-level features using large scale unsupervised learning. In: ICASSP, pp. 8595–8598. IEEE (2013) 14. Lewis, D., Jebara, T., Noble, W.S.: Support vector machine learning from heterogeneous data: an empirical analysis using protein sequence and structure. Bioinformatics 22(22), 2753–2760 (2006) 15. Lyons, J., et al.: Predicting backbone Cα angles and dihedrals from protein sequences by stacked sparse auto-encoder deep neural network. J. Comput. Chem. 35(28), 2040–2046 (2014)

Deep Autoencoders for Additional Insight into Protein Dynamics

89

16. McGibbon, R., Ramsundar, B., Sultan, M., Kiss, G., Pande, V.: Understanding protein dynamics with L1-regularized reversible hidden Markov models. In: ICML. pp. 1197–1205 (2014) 17. Meyer, T., et al.: MoDEL: a database of atomistic molecular dynamics trajectories. Structure 18(11), 1399–1409 (2010) 18. Moon, K.K., Jernigan, R.L., Chirikjian, G.S.: Eﬃcient generation of feasible pathways for protein conformational transitions. Biophys. J. 83(3), 1620–1630 (2002) 19. Nguyen, S., Shang, Y., Xu, D.: Dl-PRO: a novel deep learning method for protein model quality assessment. In: IJCNN, pp. 2071–2078. IEEE (2014) 20. Palmieri, L., Federico, M., Leoncini, M., Montangero, M.: A high performing tool for residue solvent accessibility prediction. In: Böhm, C., Khuri, S., Lhotská, L., Pisanti, N. (eds.) ITBAM 2011. LNCS, vol. 6865, pp. 138–152. Springer, Heidelberg (2011). https://doi.org/ 10.1007/978-3-642-23208-4_13 21. Pandini, A., Fornili, A.: Using local states to drive the sampling of global conformations in proteins. J. Chem. Theory Comput. 12, 1368–1379 (2016) 22. Pandini, A., Fornili, A., Kleinjung, J.: Structural alphabets derived from attractors in conformational space. BMC Bioinform. 11(97), 1–18 (2010) 23. Pandini, A., Mauri, G., Bordogna, A., Bonati, L.: Detecting similarities among distant hom*ologous proteins by comparison of domain ﬂexibilities. Protein Eng. Des. Sel. 20(6), 285–299 (2007) 24. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent ANNs from overﬁtting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014) 25. Sun, T., Zhou, B., Lai, L., Pei, J.: Sequence-based prediction of protein protein interaction using a deep-learning algorithm. BMC Bioinform. 18(1), 277 (2017) 26. Tokuriki, N., Tawﬁk, D.: Protein dynamism and evolvability. Science 324(9524), 203–207 (2009). https://doi.org/10.1126/science.1169375 27. Voet, D., Voet, J.: Biochemistry, 4th edn. Wiley, Hoboken (2011) 28. Wang, D., Lee, N., Dillon, T.: Extraction and optimization of fuzzy protein sequences classiﬁcation rules using GRBF neural networks. Neural Inf. Process. Lett. Rev. 1(1), 53–57 (2003)

Pilot Design of a Rule-Based System and an Artificial Neural Network to Risk Evaluation of Atherosclerotic Plaques in Long-Range Clinical Research Jiri Blahuta(B) , Tomas Soukup, and Jakub Skacel Silesian University in Opava, The Institute of Computer Science, Bezruc Sq. 13, 74601 Opava, Czech Republic [emailprotected] http://www.slu.cz/fpf/en/institutes/the-institute-of-computer-science

Abstract. Early diagnostics and knowledge of the progress of atherosclerotic plaques are key parameters which can help start the most eﬃcient treatment. Reliable prediction of growing of atherosclerotic plaques could be very important part of early diagnostics to judge potential impact of the plaque and to decide necessity of immediate artery recanalization. For this pilot study we have a large set of measured data from total of 482 patients. For each patient the width of the plaque from left and right side during at least 5 years at regular intervals for 6 months was measured Patients were examined each 6 months and width of the plaque was measured using ultrasound B-image and the data were stored into a database. The ﬁrst part is focused on rulebased expert system designed for evaluation of suggestion to immediate recanalization according to progress of the plaque. These results will be veriﬁed by an experienced sonographer. This system could be a starting point to design an artiﬁcial neural network with adaptive learning based on image processing of ultrasound B-images for classiﬁcation of the plaques using feature analysis. The principle of the network is based on edge detection analysis of the plaques using feed-forwarded network with Error Back-Propagation algorithm. Training and learning of the ANN will be time-consuming processes for a long-term research. The goal is to create ANN which can recognize the border of the plaques and to measure of the width. The expert system and ANN are two diﬀerent approaches, however, both of them can cooperate. Keywords: Atherosclerotic plaque · Ultrasound Rule-based system · Image processing with ANN

1

· Expert system · B-image recognition

Atherosclerotic Plaques, Their Risk and Measurement

In general, atherosclerosis is one of the most important causes of mortality. Early diagnostics and prediction of atherosclerosis is a key part of modern medicine. c Springer Nature Switzerland AG 2018 V. K˚ urkov´ a et al. (Eds.): ICANN 2018, LNCS 11140, pp. 90–100, 2018. https://doi.org/10.1007/978-3-030-01421-6_9

Rule-Based System and ANN for Evaluation of Atherosclerotic Plaques

91

This paper has two parts. The ﬁrst part is focused on a design of rule-based expert system which can be used for decision what next steps are needed depending on progress of the plaque. This system is based on deﬁned rules as a decisionmaking system. Designed expert system should be a valuable tool for evaluation of the progress of the plaques during series of examinations. Early diagnostics of the plaques and reliable evaluation of their progress are two diﬀerent, but closely related parts to avoid needless death and for starting the most optimal treatment as well. The second part of the paper is devoted to design of a model of artiﬁcial neural network (ANN) which could be able to recognize border of the plaque. ANN should be designed as a feed-forward model with Error Back-Propagation algorithm. In this paper an idea how to create ANN with supervised learning as one of many types of neural network models designed for image processing is discussed.

2

Input Data

For this study a set of measured width of the plaques from total of 482 patients is used. This is a long-term study; each patient was examined for 5 years at regular intervals of 6 months. In this study the data of width of the plaques measured from B-image is used, see Fig. 1. More detailed description of principles of Bimaging of the plaques is available in [1] and a general view of image processing approaches in medicine is available in [2].

Fig. 1. Measured width of the plaque on B-image

There are diﬀerent progress models of the plaque during long-term study according to stored data: – – – –

stable plaque with no signiﬁcant changes stable plaque with regular increasing/decreasing, no peaks unstable progress of the plaques, peaks highly unstable plaques with many peaks and extreme changes between examinations

92

J. Blahuta et al.

These four progress models are a starting point for deﬁnition of exact rules for the expert system.

3

Design of Rules Used in the Expert System

Input data represent width of the plaque measured from left (L) and right (R) side at regular intervals of 6 months, see Table 1. Table 1. An example of measured width of the plaque for 4 patients side / measurement L R L R L R L R

1 4.6 3.6 2.7 3.1 3.3 3.0 2.0 4.3

2 3.6 2.5 4.3 4.2 2.3 3.0 2.4 4.3

3 4.7 3.2 4.1 4.2 2.2 2.5 2.4 4.3

4 4.1 3.2 4.0 4.2 2.6 2.3 2.3 4.3

5 4.6 3.4 4.8 4.3 2.2 2.8 3.4 4.3

6 4.2 3.6 4.8 4.5 2.5 2.7 2.6 4.3

7 4.9 3.8 4.8 4.5 2.5 2.7 2.6 4.3

8 4.3 3.9 5.7 3.4 2.7 2.5 2.6 2.8

9 4.3 3.8 5.7 5.0 2.7 2.5 2.6 3.4

10 4.3 3.8 4.2 5.5 2.7 2.5 2.6 3.3

11 3.9 3.2 N/A N/A 3.4 2.5 2.6 3.3

12 4.1 3.2 N/A N/A 3.6 2.7 2.7 3.7

Highlighted measurement was visually judged as erroneous. Let t1 , t2 , t3 ,...tn where n = 10 is a series of an examination during 5 years at regular intervals of 6 months. The principle of this system is based on using IF-THEN rules from which the ﬁnal consequent is decided; it is a rule-based decision system which can be brieﬂy described as follows. The rules are based on the four following criteria: – maximum and minimum value from all measured data – diﬀerence Δt is not considered in absolute value; if Δt < 0 width increases and if Δt < 0 width decreases – number of occurrences of diﬀerence below or under threshold value – trend of the progress for 4 consecutive measurements (increasing or decreasing) The diﬀerence Δt is not considered in absolute value, thus if Δt > 0, the plaque width is growing and if Δt < 0, the width of the plaque is decreased. The expert system is designed using the following exact if-then rules: – – – – – –

Rule Rule Rule Rule Rule Rule

A: IF max(Δt ) > 2 mm THEN M oderateRisk B: IF count of Δt > 2 mm at least 2 THEN M oderateRisk C: IF min(Δt ) < −2 mm THEN M oderateRisk D: IF count of Δt < −2 mm at least 2 THEN M oderateRisk E: IF at least of 4 consecutive diﬀerences Δt < 0 THEN M oderateRisk F: IF at least of 4 consecutive diﬀerences Δt > 0 THEN HighRisk

Rule-Based System and ANN for Evaluation of Atherosclerotic Plaques

93

– Rule G: IF min(Δt ) < −0.8 mm ∧ max(Δt ) < 0.8 mm THEN LowRisk – Rule H: IF no previous rules are applied THEN LowRisk (the plaques with no peaks) So, there are 3 options (output variables) for recommended steps: – LowRisk - no immediate steps are recommended – M oderateRisk - check the plaque progress – HighRisk - check if the measurement is correct (no error), immediate recanalization is strongly recommended The following rules union produces: – A ∧ B THEN HighRisk – E ∧ G THEN LowRisk – F ∧ G THEN M oderateRisk The inference engine of the system is designed to produce a reasoning on the rules. In Table 1, there are examples of reasonings. In the past, we have designed a similar expert system to evaluation of substantia nigra hyperechogenicity and the results were published in technical papers [3–5] and also in clinical studies [6,7] (Table 2). Table 2. Output variables and their reasoning Variable

Comment

LowRisk

no immediate steps are needed, the plaques seem stable

ModerateRisk check the progress which could be starting point of a problem HighRisk

critical growing of the plaque, high risk of stenosis and rupture

However, a sonographer can set more rules, their relations and reasoning; the system is extensible and modular.

4

Evaluation of the Outputs

According to outputs of the expert system immediate recanalization should be recommended. The next step is to verify the reliability of the designed expert system with experienced sonographer. Consider the following example. Let 3.1; 3.8; 3.8; 3.8; 3.8; 3.0; 3.0; 3.8; 2.4; 4.1; 3.2; 3.8; 3.8; 3.8; 3.5; 3.6; 3.8 be input data of measured width into the system. Maximum diﬀerence is 1.7 mm, the minimum diﬀerence is -0.3 mm. The plaque does not have at least 4 consecutive diﬀerences higher than 0. The Rule H is applied because there are no signiﬁcant peaks and extreme diﬀerences.

94

J. Blahuta et al.

In the second example: 3; 2.9; 2.8; 2.8; 2.6; 2.6; 2.6; 5.2; 5.2; 5.3 the obvious maximum diﬀerence is 2.6 mm. The Rule A is applied and the plaque is evaluated as moderately risk (M oderateRisk). 4.1

Adaptable Rules to Quality Improvement of the Results

One of the main advantages of this system is adaptability to improving quality of results for more reliable diagnostics. The rules can be modiﬁed and/or add new rules. Thus, adding and modifying rules can be useful to create the expert system with high accuracy supported by an experienced sonographer. Another way is to create adjustable expert system; a user can modify rules depending on measurement, e.g. set for high resolution, low resolution, diﬀerent gamma correction, etc.

5

Using Neural Network in Long-Range Research

The designed expert system should be a helpful software tool to evaluate progress of atherosclerotic plaques using set of IF-THEN rules to decide next steps, i.e. treatment, immediate recanalization, etc. All results must be analyzed by an experienced sonographer. If the system is considered reliable, the next step should be to create a model of artiﬁcial neural network (ANN) as a learning platform which will be adapted depending on training set with many examples of outputs and desired outputs. It is a second phase of this study and the second approach; diﬀerent from Decision-Making expert system. In 2016, the authors published a paper focused on the idea of diﬀerent approaches how to detect atherosclerotic plaques in B-image [8]. 5.1

An Idea How to Design ANN to Classification of Risk of the Plaques

The idea of the ANN is diﬀerent from the principle of the expert system. The input data are B-images with displayed atherosclerotic plaques in diﬀerent progress of the plaque instead of stored numerical values. The goal of the ANN is to learn how to classify the plaques according their width and other features. On Fig. 1 the width of the plaque is displayed There are key questions: – – – –

What features should be used? How to determine plaque from the artery wall? What accuracy of the ANN is acceptable for clinical studies? How to deﬁne training set to classiﬁcation learning?

Rule-Based System and ANN for Evaluation of Atherosclerotic Plaques

95

We have the following idea of ANN architecture: – a feedforward multi-layer network with supervised learning – Error Back-Propagation algorithm to minimization of the global error – developed in MATLAB (with NN Toolbox) [9] or similar software for ANN modeling Figure 2 shows an example of the ANN which could be used.

Fig. 2. ANN model with Error Back-Propagation principle

The input is B-image on input layer, each input is multiplied by a weight w; j-th input is multiplied by weight wj . The principle of the idea is based on weight modiﬁcation depending on computed network error. The learning of the network is based on comparison of the error for each input. P E = y j − dj where yj is a real output and dj is a desired output for j-th input. Global error is the sum of all partial errors. Thus, when the large set of examples in training set is available, the network could learn a lot of cases of plaque types. Crucial problem is to determine features which could be used to compute width of the plaque for ﬁnal output. Designed ANN has the following properties: – input in form of the matrix m × n of digitized B-image with detected edges – hidden layers computes edge detection algorithm and visibility of the plaque – output layer has 4 neurons to classiﬁcation the plaque (no visible plaque, low risk plaque, medium risk plaque and high plaque) similarly to output of designed ES Edge Detection. Images are preprocessed using edge detection. The network is designed to evaluate plaques using features from edge detection.

96

J. Blahuta et al.

Fig. 3. Input B-image, extract edges with inverted colors) and the output to risk classiﬁcation

Fig. 4. Prewitt and Kirsch operator with bordered edges to training the network

Edge detection could be an eﬃcient way how to recognize the border of the plaque to measurement of width and evaluation of the risk. There are two major problems: – isolated pixels which can be considered as a part of the plaque – artery wall can be also considered as a part of the plaques Kirsch or Prewitt operator could reach well-bordered shape for training and learning process. On Fig. 4 Prewitt and Kirsch edge detection with border is applied on images from Fig. 3. Training and Learning Process. The principle is based on a training set in which the input-desired output pairs are paired, i.e. for each input the desired output to learn the network is known. The training set should be supplemented by new examples. The learning of the network is based on error minimization depending on an improvement of the training set. A sonographer must determine error threshold for acceptable accuracy. To determine network error MSE (Mean Squared Error) is used in many applications to learn accuracy evaluation of the network. For simpliﬁcation, the following preconditions are required: – all images with the same resolution, zoom level and section – all images from the same section, e.g. cross-sectional – all images have the same zoom level

Rule-Based System and ANN for Evaluation of Atherosclerotic Plaques

97

Nevertheless, the design of the ANN is a time-consuming process till the network is useful with reliable results for clinical studies. The most diﬃcult part is selection of the most appropriate features in B-images due to many diﬀerent types of the plaques caused by ﬁbrosis, calciﬁcation, inﬂammation and other factors, see Fig. 5.

Fig. 5. Diﬀerent types of the plaques on B-image

The Goal of the Learning. The goal of the learning process is to minimize the network error. For each input desired output is assigned. It is computed partial error P E and the global error. The training set contains well-bordered plaques of many types. The goal is to learn ANN to recognize border and measure width depending on scale axis. To reach better accuracy the training set is supplemented with new examples. When global error is lower than a determined value, the learning process is ended and ANN will work with required accuracy. 5.2

First Experimental Results with ANN

As the ﬁrst step, we constructed a simple feed-forward ANN with implemented algorithm to border detection. We used a set of 20 images with signiﬁcant plaque and for each image well-bordered plaque shape determined by experienced sonographer was used. When testing, we use these images to check if the border as output from ANN is considered as well or not to correct classiﬁcation, see Table 3. The results represent the ﬁrst run of the network without training/learning process with a large set of patterns. Table 3. Experimental ﬁrst results using untrained ANN Edge/image 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Prewitt

T T F F T T F F F F

T T F

T F

Kirsch

T F F F F T T F T F

F

T T F

T F

F

T T F

F

F

T

F

F

where T (true) is an acceptable result and F (false) is a rejected result by sonographer. Correctness of edge detection is key for reliable risk classiﬁcation of the plaque. Untrained ANN shows error rate > 50 % for both edge detection operators. It is strongly unsatisfactory for a clinical study. The aim is to train

98

J. Blahuta et al.

the network to reach reliability > 85 % with determined partial error between each output and desired output. To training the network we need to use at least 482 B-images from which the data for designed expert system is extracted. Edge detection must be trained to recognize the shape of the plaque and how to separate artery wall, see Fig. 6. Brief description of the functionality of the ANN model: – – – –

input m × n neurons according the image resolution transfer function is logistic sigmoid initial uniform weight distribution 100 epochs of training

Until the accuracy is not reached, ANN must be modiﬁed (weights, number of hidden layers) or another ANN architecture must be used [10], e.g. convolutional neural network (CNN) based on deep learning using GPU acceleration [11]. CNN could be a very perspective solution how to recognize shape of the plaque but it is a time-consuming problem. There is also possibility to use fuzzy neural network FUZNET [12] which could be used as fuzzy-neural system for classiﬁcation of the plaques. However, after trying many ANN models could be decided that the plaques cannot be recognized with adequate accuracy. 5.3

Cooperation of the ANN with Expert System

Even though the expert system and the ANN are considered as diﬀerent approaches, these systems can be closely related. – designed ES is focused on evaluation of progress risk of the plaque from measured data for 5 years – designed ANN is focused on recognition of the plaque on B-image and evaluation of the risk based on edge detection (width of the plaque) When the risk level is decided by using expert system, the same plaque can be compared by output from ANN (concordance of measured width).

6

Conclusions and Future Work

This study is focused on application of two diﬀerent approaches in neurology for early diagnostics of atherosclerosis. The ﬁrst part is to design rule-based expert system focused on decision of risk level of the progress of atherosclerotic plaques from a large set of measured data. This system can be modular with option to add and/or modify the rules for better decisions. All outputs must be validated by an experienced sonographer. The second part is to design the artiﬁcial neural network based on Error Back-Propagation algorithm. The goal of the network is to compute width of the plaque from B-image using image feature analysis from edge detection. ANN can learn a lot of cases of the plaques using large training set with examples of “good” and “bad” plaques. Well-learned neural

Rule-Based System and ANN for Evaluation of Atherosclerotic Plaques

99

network should be a useful tool to fast and reliable decisions depending on the width of the plaque. This long-range research is at the beginning. Design of the expert system is relatively fast; the rules are determined by a sonographer and will be adaptable in the future. Design of the neural network is time-consuming due to complexity of image analysis of ultrasound B-images, i.e. selection of suitable architecture and features for computing of the width of the plaque. This research is a challenge for a large team of experts how to create a helpful software to early diagnostics of the atherosclerosis from measured data and from ultrasound B-images. Acknowledgments. This work was supported by The Ministry of Education, Youth and Sports from the National Programme of Sustainability (NPU II) project IT4Innovations excellence in science - LQ1602.

References 1. Saijo, Y., van der Steen, A.F.W.: Vascular Ultrasound. Springer, Japan (2012). https://doi.org/10.1007/978-4-431-67871-7. (softcover reprint from 2003) 2. Dougherty, G.: Digital Image Processing for Medical Applications, 1st edn. Cambridge University Press (2009). ISBN 978-0-521-86085-7 3. Blahuta, J., Soukup, T. Cermak, P., Rozsypal, J., Vecerek, M.: Ultrasound medical image recognition with artiﬁcial intelligence for Parkinson’s disease classiﬁcation. In: Proceedings of the 35th International Convention, MIPRO 2012 (2012) 4. Blahuta, J., Cermak, P., Soukup, T., Vecerek, M.: A reproducible application to BMODE transcranial ultrasound based on echogenicity evaluation analysis in deﬁned area of interest. In: 6th International Conference on Soft Computing and Pattern Recognition (2014) 5. Blahuta, J., Soukup, T., Martinu, J.: An expert system based on using artiﬁcial neural network and region-based image processing to recognition substantia nigra and atherosclerotic plaques in b-images: a prospective study. In: Rojas, I., Joya, G., Catala, A. (eds.) IWANN 2017. LNCS, vol. 10305, pp. 236–245. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59153-7 21 6. Blahuta, J., et al.: A new program for highly reproducible automatic evaluation of the substantia nigra from transcranial sonographic images. Biomed. Papers, 158(4), 621–627 (2014) 7. Skoloudik, D., et al.: Transcranial Sonography of the Insula: Digitized Image Analysis of Fusion Images with Magnetic Resonance. Ultraschall in der Medizin, Georg Thieme Verlag KG Stuttgart (2016) 8. Blahuta, J., Soukup, T., Cermak, P.: How to detect and analyze atherosclerotic plaques in B-MODE ultrasound images: a pilot study of reproducibility of computer analysis. In: Dichev, C., Agre, G. (eds.) AIMSA 2016. LNCS (LNAI), vol. 9883, pp. 360–363. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-44748-3 37 9. Marvin, L.: Neural Networks with MATLAB. CreateSpace Independent Publishing Platform (2016). ISBN 978-1539701958 10. Herault, J.: Vision: Images, Signals and Neural Networks: Models of Neural Processing in Visual Perception (Progress in Neural Processing) 1st edn. World Scientiﬁc Publishing Company (2010). ISBN 978-9814273688

100

J. Blahuta et al.

11. Hijazi, S., Kumar R., Rowen, Ch.: Using Convolutional Neural Networks for Image Recognition. Cadence (2016) 12. Cermak, P., Pokorny, P.: The fuzzy-neuro development system FUZNET. In: 18th International Conference on Methods and Models in Automation and Robotics (MMAR), vol. 75, no. 80, pp. 26–29 (2013). ISBN 978-1-4673-5506-3

A Multi-channel Multi-classifier Method for Classifying Pancreatic Cystic Neoplasms Based on ResNet Haigen Hu1 , Kangjie Li1 , Qiu Guan1(B) , Feng Chen2(B) , Shengyong Chen1 , and Yicheng Ni3 1

College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou 310023, People’s Republic of China [emailprotected] 2 The First Aﬃliated Hospital, College of Medicine, Zhejiang University, Hangzhou 310006, People’s Republic of China [emailprotected] 3 Department of Imaging and Pathology, KU Leuven, Leuven, Belgium

Abstract. Pancreatic cystic neoplasm (PCN) is one of the most common tumors in the digestive tract. It is still a challenging task for doctors to diagnose the types of pancreatic cystic neoplasms by using Computed Tomography (CT) images. Especially for serous cystic neoplasms (SCNs) and mucinous cystic neoplasms (MCNs), doctors hardly distinguish one from the other by the naked eyes owing to the high similarities between them. In this work, a multi-channel multiple-classiﬁer (MCMC) model is proposed to distinguish the two pancreatic cystic neoplasms in CT images. At ﬁrst, multi-channel images are used to enhance the image edge of the tumor, then the residual network is adopted to extract features. Finally, the multiple classiﬁers are applied to classify the results. Experiments show that the proposed method can eﬀectively improve the classiﬁcation eﬀect, and the results can help doctors to utilize the CT images to achieve reliable non-invasive disease diagnosis. Keywords: Non-invasive disease diagnosis · Multi-channel images Multi-classiﬁer · ResNet · Pancreatic cystic neoplasms (PCNs)

1

Introduction

Pancreatic cystic neoplasm (PCN) [1–4], mainly characterized by the proliferation of pancreatic ductal (or acinar epithelial cells) and the secretion of cysts, is a type of pancreatic cystic lesions (PCLs). According to the histopathological criteria, Pancreatic cystic neoplasms (PCNs) are loosely grouped into non-mucinous tumors and mucinous tumors by World Health Organization (WTO) in 2010, which mainly contain serous cystic neoplasms (SCNs) and mucinous cystic neoplasms (MCNs), respectively. Generally speaking, the levels of CEA and CA199 c Springer Nature Switzerland AG 2018 V. K˚ urkov´ a et al. (Eds.): ICANN 2018, LNCS 11140, pp. 101–108, 2018. https://doi.org/10.1007/978-3-030-01421-6_10

102

H. Hu et al.

are ﬁrstly detected by adopting ﬁne-needle aspiration biopsy through the endoscopic ultrasonography or biopsies, and then the detection results are used to identify the benign and malignant neoplasms during the process of preoperative diagnosis. However, there are still some problems and limitations in these methods. For example, there are improper puncture techniques, the cyst is too small to be accurately located, and the puncture specimens are contaminated. All these problems come down to identifying accurately the pathological types of PCNs before operating, and these limitations have resulted in prohibiting the widespread use of the techniques. Therefore, it is of great importance and value for doctors to accurately diagnosis PCNs by image examination in determining the treatment and operation chance of patients. Figure 1 shows two diﬀerent kinds of PCNs: SCNs and MCNs. According to clinical statistics, SCNs belong to a kind of benign neoplasms, and the patients with SCNs do not need surgery immediately. In contrast, MCNs have a high probability of malignant transformation. For instance, as shown in Figs. 1(c) and (d), it is hardly distinguished from the other by the naked eyes owing to the high similarities between SCNs and MCNs. Therefore, it is essential to explore some computer-aided diagnosis methods to help clinicians to achieve reliable non-invasive disease diagnosis and to improve the objectivity and rationality of treatment.

(a) SCN

(b) MCN

(c) SCN

(d) MCN

Fig. 1. Two diﬀerent kinds of PCNs. (c) and (d) are almost indistinguishable, but they belong to diﬀerent kinds of PCNs.

With the development of computer vision technology, the above issue attracts more and more attention in the society of medical image process. For example,

A Multi-channel Multi-classiﬁer Method

103

Li et al. [5] verify the eﬀectiveness of additional information from the spectral CT for distinguishing serous oligocystic adenomas from mucinous cystic neoplasms using machine-learning algorithms. In [7], a method is proposed based on a Bayesian combination of a random forest classiﬁer and a CNN to make use of both clinical information about the patient and ﬁne imaging information from CT scans. However, The above mentioned methods require the segmentation of images in advance, then the obtained cysts or pancreas are classiﬁed by using classiﬁers. In recent years, the deeplearning-based [6] methods are proposed for classiﬁcation [8–10], detection [11] and segmentation in the area of medical image process. In 2017, Esteva et al. [12] applies deep learning in skin cancer classiﬁcation, and the eﬀect of skin cancer classiﬁcation can reach the level of a dermatologist. The rest of the paper is organized as follows. In Sect. 2, a MCMC method is proposed by integrating multiple channels and multiple classiﬁers based on ResNet. Section 3 describes the experimental results and discussions of the proposed method on PCNs datasets. Finally Sect. 4 presents conclusions and future Work.

2 2.1

Methods ResNet

ResNet [13] is a ‘shortcut connection’, and as shown in Fig. 2, one or more layers can be skipped in the network. Each skip-connected computation is called a residual block, and its output yl is deﬁned as yl = yi−1 + H(yi−1 )

(1)

where H contains convolution, batch normalization (BN) [14] and rectiﬁed linear units (ReLU) [15].

Fig. 2. The structure of the residual block. The convolutional layer is an important layer in CNNs, and it realizes partial receptive ﬁelds. Relu is the most popular activation function in DNNs owing to its simplicity and eﬃciency, and it can partly avoid and rectiﬁes vanishing gradient problem. With the network deepening, the characteristic distribution gradually shifts or changes, and the convergence g slows down during the training process. The essential cause is that the gradient of the low-level neural network disappears in the backward propagation. Therefore, the above problem is solved by batch normalization layer.

104

H. Hu et al.

In this work, a ResNet-50 model is used to extract features and classify PCNs with a Softmax classiﬁer by the end-to-end training. The ResNet-50 model includes 16 residual blocks, through these residual blocks, the feature information is transmitted to avoid the gradient disappearing. 2.2

MCMC

In this section, a multi-channel multi-classiﬁer method is proposed for the classiﬁcation of PCNs in detail, and the corresponding framework is illustrated in Fig. 3. Firstly, a single-channel image is converted into a multi-channel image by adjusting the window width and window level of the original single-channel image, by using the Canny edge detection, and by calculating the gradient magnitude, respectively. In this way, the original image can be clearer and obtain enhanced edge information. Secondly, the residual network is used for end-to-end training to classify images and extract features. And then, the 2048-dimensional features obtained from the residual network are classiﬁed by adopting Bayesian classiﬁer [16] and k-Nearest Neighbor (KNN) classiﬁer [17]. Signiﬁcantly, the outputs from the residual network and the two classiﬁer are probability values of a class. Finally, the obtained probability values are classiﬁed by adopting a random forest method [18].

Fig. 3. The structure consists of two parts: (i) Multi-channel and (ii) Multi-classiﬁer. The multi-channel is constructed by adjusting the window width and window level of the original image, by using the Canny edge detection, and by calculating the gradient magnitude, respectively. The multi-classiﬁer includes Softmax, Bayesian, KNN and random forest.

3 3.1

Experiments and Results Datasets and Experiments

The dataset comes from the First Aﬃliated Hospital of Zhejiang University, China. It contains 3,076 CT images of PCNs, and consists of the two most

A Multi-channel Multi-classiﬁer Method

105

common PCNs: 1763 SCN images and 1313 MCN images. Thereinto, 615 PCNs (about 20%) are randomly selected as a testing data set, among them including 340 SCNs and 275 MCNs. All experiments are implemented on a computer with a E5-494 2620 2.0 GHz processor, and six nuclear of CPU, 32 GB RAM, and single tesla K20 graphics cards. 3.2

Results and Discussions

The results are shown in Table 1, and the accuracies obtained by using traditional methods [19,20] are all less than 80%, meanwhile, the accuracies obtained by using convolutional neural networks is greatly improved due to the extraction ability of features. By comparing the results among SC-Resnet and MCMC, it shows that adding edge features in the multi-channel image can eﬀectively improve the evaluation index of each classiﬁcation result. The number of invalid feature values extracted from single-channel images is twice that of multi-channel images through the statistics for the 2048-dimensional features extracted from ResNet-50. Therefore, the increase of valid features obviously contributes to the improvement of multi-channel image classiﬁcation eﬀect. Table 1. Results of diﬀerent methods Methods

Sensitive Speciﬁcity Precision Accuracy F-score

Gabor-KNN

94.12%

9.09%

56.14%

56.10%

70.33%

Gabor-Bayesian

75.00%

30.91%

57.30%

55.28%

64.97%

GLCM-KNN

72.65%

46.55%

62.69%

60.98%

67.30%

GLCM-Bayesian 85.29%

30.55%

60.29%

60.81%

70.64%

SC-ResNet

92.06%

68.73%

78.45%

81.63%

84.71%

MCMC

92.65%

85.45%

88.73%

89.43%

90.65%

From the results, the integrated multi-classiﬁer (i.e., MCMC) can obtain good eﬀects in many performance indicators. The sensitivity by adopting the proposed MCMC method is not the best among all these methods, but multiple performance indicators are the best, such as speciﬁcity, precision, F-score and accuracy. Compared with Gabor-KNN, Gabor-Bayesian, GLCM-KNN and GLCMBayesian, We ﬁnd these traditional methods have high sensitivity, but the low speciﬁcity. From Tables 2, 3, 4 and 5, the classiﬁcation results under the confusion matrix are further investigated, and we ﬁnd that a large number of mucinous cystic neoplasms are incorrectly identiﬁed as serous cystic neoplasms. There are maybe two reasons for that cases: (1) No key features are extracted, (2) Overﬁtting happens. Moreover, as shown in Tables 6 and 7, the results of classiﬁcation

106

H. Hu et al. Table 2. Results of Gabor-KNN

Table 3. Results of Gabor-Bayesian

Ground Truth Prediction(%) SCN MCN

Ground truth

SCN

94.12% 5.88%

SCN

75.00% 25.00%

MCN

90.91% 9.09%

MCN

69.09% 30.91%

Table 4. Results of GLCM-KNN

Prediction(%) SCN MCN

Table 5. Results of GLCM-Bayesian

Ground truth Prediction(%) SCN MCN

Ground Truth

SCN

72.65% 27.35%

SCN

85.29% 14.71%

MCN

53.45% 46.55%

MCN

69.45% 30.55%

Prediction(%) SCN MCN

by using ResNet have obviously been improved. Especially in Table 7, the classiﬁcation results obtained by our proposed method are the best. Therefore, the speciﬁcity of the classiﬁcation in Table 1 is the highest among these methods according to the proposed MCMC method. Table 6. Results of SC-ResNet

4

Table 7. Results of MCMC

Ground truth Prediction(%) SCN MCN

Ground truth Prediction(%) SCN MCN

SCN

92.06%

SCN

92.65%

MCN

31.27% 68.73%

MCN

14.55% 85.45%

7.94%

7.35%

Conclusion and Future Work

In this work, a multi-channel and multi-classiﬁer method is proposed for the PCNs classiﬁcation problem. A multi-channel image is transformed from an original CT image by adjusting the window width and window level, by using the Canny edge detection, and by calculating the gradient magnitude, respectively. A series of comparison experiments are conducted, and the results show enhancing edge features and integrating multi-classiﬁer contribute to the classiﬁcation eﬀect. The proposed MCMC methods can obtained the best results in the comprehensive assessment index F-score and accuracy, and the performance parameters of sensitivity, speciﬁcity and precision have also a relatively high ranking among all methods. The results can help doctors to utilize the CT image to achieve reliable non-invasive disease diagnosis. In the future, clinical information, positioning and segmentation of PCNs will be integrated to auxiliary diagnosis.

A Multi-channel Multi-classiﬁer Method

107

Acknowledgements. The authors would like to express their appreciation to the referees for their helpful comments and suggestions. This work was supported in part by Natural Science Foundation of Zhejiang Province (Grant No. LY18F030025), and in part by National Natural Science Foundation of China (Grant No. 61374094, U1509207, 31640053).

References 1. Hruban, R.H., et al.: Pancreatic intraepithelial neoplasia: a new nomenclature and classiﬁcation system for pancreatic duct lesions. Am. J. Surg. Pathol. 25(5), 579– 586 (2001) 2. Hruban, R.H., et al.: An illustrated consensus on the classiﬁcation of pancreatic intraepithelial neoplasia and intraductal papillary mucinous neoplasms. Am. J. Surg. Pathol. 28(8), 977–987 (2004) 3. Brugge, W.R., Lauwers, G.Y., Sahani, D., Fernandez-del Castillo, C., Warshaw, A.L.: Cystic neoplasms of the pancreas. N. Engl. J. Med. 351(12), 1218–1226 (2004) 4. Brugge, W.R., et al.: Diagnosis of pancreatic cystic neoplasms: a report of the cooperative pancreatic cyst study. Gastroenterology 126(5), 1330–1336 (2004) 5. Li, C., Lin, X.Z., Wang, R., Hui, C., Lam, K.M., Zhang, S.: Diﬀerentiating pancreatic mucinous cystic neoplasms form serous oligocystic adenomas in spectral ct images using machine learning algorithms: a preliminary study. In: International Conference on Machine Learning and Cybernetics (ICMLC), vol. 1, pp. 271–276, Tianjin (2013) 6. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classiﬁcation with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012) 7. Dmitriev, K., et al.: Classiﬁcation of pancreatic cysts in computed tomography images using a random forest and convolutional neural network ensemble. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 150–158, Quebec City (2017) 8. Bayramoglu, N., Heikkil¨ a, J.: Transfer learning for cell nuclei classiﬁcation in histopathology images. In: Hua, G., J´egou, H. (eds.) ECCV 2016. LNCS, vol. 9915, pp. 532–539. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-494098 46 9. Anthimopoulos, M., Christodoulidis, S., Ebner, L., Christe, A., Mougiakakou, S.: Lung pattern classiﬁcation for interstitial lung diseases using a deep convolutional neural network. IEEE Trans. Med. Imaging 35(5), 1207–1216 (2016) 10. Hussein, S., Kandel, P., Corral, J.E., Bolan, C.W., Wallace, M.B., Bagci, U.: Deep multi-modal classiﬁcation of intraductal papillary mucinous neoplasms (IPMN) with canonical correlation analysis. arXiv preprint arXiv:1710.09779 (2017) 11. Hu, H., Guan, Q., Chen, S., Ji, Z., Yao, L.: Detection and recognition for life state of cell cancer using two-stage cascade CNNs. IEEE/ACM Trans. Comput. Biol. Bioinform. (2017) 12. Esteva, A., et al.: Dermatologist-level classiﬁcation of skin cancer with deep neural networks. Nature 542, 115–118 (2017) 13. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

108

H. Hu et al.

14. Ioﬀe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015) 15. Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectiﬁer neural networks. In: Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics, pp. 315–323 (2011) 16. Chen, J., Huang, H., Tian, S., Qu, Y.: Feature selection for text classiﬁcation with Na¨ıve Bayes. Expert Syst. Appl. 36(3), 5432–5435 (2009) 17. Ma, L., Crawford, M.M., Tian, J.: Local manifold learning-based k-nearestneighbor for hyperspectral image classiﬁcation. IEEE Trans. Geosci. Remote Sens. 48(11), 4099–4109 (2010) 18. Gislason, P.O., Benediktsson, J.A., Sveinsson, J.R.: Random forests for land cover classiﬁcation. Pattern Recogn. Lett. 27(4), 294–300 (2006) 19. Liu, C., Wechsler, H.: Gabor feature based classiﬁcation using the enhanced ﬁsher linear discriminant model for face recognition. IEEE Trans. Image Process. 11(4), 467–476 (2002) 20. Jain, S.: Brain cancer classiﬁcation using GLCM based feature extraction in artiﬁcial neural network. Int. J. Comput. Sci. Eng. Technol. 4(7), 966–970 (2013)

Breast Cancer Histopathological Image Classiﬁcation via Deep Active Learning and Conﬁdence Boosting Baolin Du1,2(&) , Qi Qi1,2(&) , Han Zheng1,2(&) Yue Huang1,2(&) , and Xinghao Ding1,2(&) 1

2

,

Fujian Key Laboratory of Sensing and Computing for Smart City, Xiamen University, Xiamen 361005, Fujian, China [emailprotected], [emailprotected], [emailprotected] School of Information Science and Engineering, Xiamen University, Xiamen 361005, Fujian, China {yhuang2010,dxh}@xmu.edu.cn

Abstract. Classify image into benign and malignant is one of the basic image processing tools in digital pathology for breast cancer diagnosis. Deep learning methods have received more attention recently by training with large-scale labeled datas, but collecting and annotating clinical data is professional and time-consuming. The proposed work develops a deep active learning framework to reduce the annotation burden, where the method actively selects the valuable unlabeled samples to be annotated instead of random selecting. Besides, compared with standard query strategy in previous active learning methods, the proposed query strategy takes advantage of manual labeling and auto-labeling to emphasize the conﬁdence boosting effect. We validate the proposed work on a public histopathological image dataset. The experimental results demonstrate that the proposed method is able to reduce up to 52% labeled data compared with random selection. It also outperforms deep active learning method with standard query strategy in the same tasks. Keywords: Breast cancer Histopathological image analysis Deep active learning Query strategy

1 Introduction Breast cancer is ranked as the most common cancer in women worldwide, and it also featured with high morbidity and mortality among women worldwide [1]. The diagnosis by histopathological images under microscopy is one of the golden standards in clinical applications. With the development of imaging sensors, histopathological slides can be scanned and saved as digital images. As the digital image sizes increase dramatically with the magniﬁcation, it would be ideal to develop image processing and analysis tools, e.g. classiﬁcation, in computer-aided diagnosis (CAD) for breast cancer.

© Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11140, pp. 109–116, 2018. https://doi.org/10.1007/978-3-030-01421-6_11

110

B. Du et al.

Hand-crafted features, such as Scale invariant feature transform (SIFT), histogram of oriented gradient (HOG), gray-level co-occurrence matrix, kernel methods have been reported in the recognition or classiﬁcation tasks in breast cancer histopathological image analysis. Some well-known classiﬁers, e.g. support vector machines (SVM), has been reported as well. Recently, deep learning methods, for example convolutional neural networks (CNN), has receive more attention and impressive performances in many tasks of histopathological image processing for breast cancer research, including recognition, classiﬁcation and segmentation [2]. Chen et al. [3] detected cell mitosis in breast histology images using deep cascading CNN, which dramatically improves detection accuracy over other methods in 2014 ICPR MITOS-ATYPIA Challenge. Wang et al. [4] used CNN, which includes 27-layer breast cancer metastasis test and then won ﬁrst place in Metastasis Detection Challenge of ISBI2016. Spanhol et al. [5] trained the classiﬁcation of benign and malignant breast cancer pathological images by Alexnet [6], whose result is 6% higher than the traditional machine learning classiﬁcation algorithm. Bayramoglu et al. [7] used deep learning to magniﬁcation independent breast pathology image classiﬁcation and the recognition rate is 83%. Spanhol et al. [8] proposed an assessment of BC-recognition for caffeine-free features, increasing the accuracy to 89%. Wei et al. [9] proposed a novel breast cancer histopathological image classiﬁcation method based on deep convolutional neural networks, named as BiCNN model, resulting to a higher classiﬁcation accuracy (up to 97%). The reported state-of-the-art methods strongly rely on the large-scale labeled data in training the network. However, in the view of real-world application, large-scale labeling in medical images are tedious and extremely expensive. Strong professional skills are usually required in the applications compared with annotating natural images. Very limited reports have been contributed to reduce the labeling burden in the proposed task. We proposed a deep domain adaptation method with PCAnet and a domain alignment operation to reduce the labeling cost by transferring knowledge from the source dataset to the target one [10]. We also introduced self-taught learning to PCAnet to reduce the burden of labeling [11]. However, labeled images in the training data are still randomly selected in the previous works. In the proposed work, we want to improve the deep learning architecture for the classiﬁcation task in breast cancer histopathological images by a deep active learning framework. Instead of random selection, active learning methods usually actively select samples with lowest conﬁdence (highest entropy) as valuable samples, and they are added to query, and then the network can be ﬁne-tuned incrementally [12]. In the proposed method, inspired by boosting, the query strategy is also improved, where samples with both high and low conﬁdence are considered simultaneously to emphasize the conﬁdence boosting. We consider that the network should be ﬁne-tuned with additional supervision and its previous regularization simultaneously. The contributions of the proposed work can be summarized as: (1) the labeling cost can be reduced labeling effort with random selection; (2) The method outperforms standard active learning query strategy by the entropy boosting effect.

Breast Cancer Classiﬁcation via Deep Active Learning

111

2 Proposed Method As a topic in machine learning, active learning is to seek for the most informative samples in a large number of unlabeled dataset actively to annotation query, in order to reduce the labeling effort. We consider introducing the idea of active learning into our method to reduce the labeling cost required for deep learning methods in breast cancer pathological image classiﬁcation. Firstly, the network is initiated with very limited random selected labeled data. Secondly, the key problem in active learning is how to deﬁne the criteria of ‘valuable’ samples. In the standard query strategy, ‘worthness’ is usually deﬁned by the entropy calculated with deep architecture, as: ei ¼

log pj;k pj;k i i k¼1

XY

ð1Þ

Where pi is the conﬁdence value of the network for a sample xi , and Y represents the number of categories in the work. Entropy captures the uncertainty of classiﬁcation system in each prediction. A larger entropy value denotes higher uncertainty of the system. In the standard query strategy, active learning methods select a certain number of high-entropy samples to the annotation query until the query size is full. Then the network is ﬁne-tuned with the labeled samples incrementally. In the proposed work, we believe that the evolution of the network should be ﬁnetuned incrementally by two factors, the additional supervision from manual labeling and the regulations from previous network. Thus in the proposed query strategy, inspired by the idea of boosting, samples with high entropy values and low entropy values are both considered for a boosting effect. It should be mentioned that the samples with high conﬁdence or low entropy values are labeled by the previous network instead of manual annotation, so there is no additional cost of labeling with the standard active learning query strategy. The algorithm is detailed illustrated as follows. Algorithm 1: The proposed query strategy. : The training set for the specified dataset ; Pre trained CNN Input learning times ; Labelled queue size ; Compare queue size Output : The final fine-tuned CNN model 1 2 for do for Samples UntagPool do 3 Entropy, EClass = ComputeEntropy ( ,Samples) ; 4 5 end Labelled queue, Compare queue, UntagPool = SelectSamplebyEntropy 6 (Entropy, EClass, n, m); Active batch = (Active batch, Labelled queue); 7 UntagPool = UpdateUntagPool (UntagPool, Active Batch) ; 8 Train batch = concatenate (Active batch, Compare queue); 9 = TrainNet(Mt−1,Train batch); 10 11 end

; Active

112

B. Du et al.

As shown in Algorithm 1, let B represents the whole dataset with nB images, and it is divided into training set and test set Btrain and Btest . The CNN model, denoted as M0 , is initiated with ni randomly selected samples in each category, ni is set to be very small value, for example, two. For convenience, Btrain is divided into labeled data Bl , and remaining unlabeled data Bu . The sizes of Bl and Bu are nl and nu respectively, where nl þ nu ¼ nB . And nl and nu are changing during the incrementally network learning, since the main idea of active learning is to select most valuable samples from Bu to annotation queue At for manual annotation. In each query round, the network is ﬁnetuned with all the labeled samples in At , and then nl turns to nl þ n, and Bu turns to nu n, where n is the size of At . A widely-used criteria is to select n=2 samples with highest entropy values in each category. The number of query is set to Ta , so the ﬁnetuned network after each query is denoted as Mj , j ¼ ð0; ; Ta Þ. In the proposed work, the network in each query is ﬁne-tuned with samples with both high entropy and low entropy. Besides n manual annotated samples, At contains additional m samples with lowest entropy values in each category. It should be mentioned that the labels of these m samples are auto-labeled by the previous network.

3 Experiment 3.1

Dataset Description

The proposed framework is evaluated on a public dataset of breast cancer histopathological images, BreaKHis [13]. The large-scale dataset contains 7909 images from 82 patients of breast cancer. The dataset is divided into benign and malignant tumors that are scanned with four magniﬁcation factors: 40X, 100X, 200X, and 400X. Pathological images are with size of 700 460 in RGB format. The details of the database are shown in Fig. 1.

40X

100X

200X

400X

40X

100X

200X

400X

Fig. 1. Breast cancer histopathological image samples in the BreaKHis. (Top: benign. Bottom: malignant.)

Breast Cancer Classiﬁcation via Deep Active Learning

3.2

113

Implementation Details

In this section, the proposed algorithm is implemented with TensorFlow framework. The basic CNN architecture is AlexNet pre-trained at ImageNet [14]. The basic settings of the server is intel 2.2-GHz CPU and a NVIDIA GeForce GTX 1080Ti GPU. The dataset has also been divided into training data (70%) and testing data (30%) randomly with no overlapping. In both training and testing set, the size of each category is balanced to be the same. In our work, the proposed work is evaluated on the imagelevel binary classiﬁcation, that is, each image is predicted with benign or malignant. Since two categories have been balanced, classiﬁcation accuracy is used as the metric in the validation, as follow: Image level accuracy ¼

Nc Nim

ð2Þ

Where Nim the total number of images in the dataset, and the Nc represents the total number of images that are correctly classiﬁed. The network is initiated with one benign sample and one malignant sample randomly selected from the training data. In each experiment, there are 5 query round, where query size for manual labeling in each round is Nm . It should be mentioned that the network is ﬁne-tuned incrementally with 64 labeled images after each query, 48 of them are manual labeling, and the other 16 are auto-labeling. 3.3

Experiment Result

Experimental results on four magniﬁcation factors are demonstrated in Fig. 2 and Table 1. It can be observed and concluded from the ﬁgures that both standard deep active learning methods and proposed framework have consistent better performances compared with incremental learning with random selection in all the experiments. Deep active learning methods can save up to 52% of the labeling cost compared to random selection to achieve a similar accuracy. This demonstrated that in the view of realworld application, the proposed framework is a better option in recognition task with deep learning methods. It also can be concluded that our proposed method also outperforms deep active learning method with strategy of only high entropy.

114

B. Du et al.

(a)40X

(b)100X

(c)200X

(d)400X

Fig. 2. Comparing the performance of entropy active learning, random active learning and our proposed method in 5 times active learning.

Table 1. Comparing to the annotation cost of our proposed method, random active learning and entropy active learning in similar accuracies. Thereinto, the cost refers to the number of labeled samples, which means the annotation cost. Strategy

Magniﬁcation 40X Accuracy cost Proposed 90.69% 288 EntropyAL 90.96% 500 RandomAL 90.96% 400

factors 100X Accuracy cost 90.46% 240 91.24% 400 90.46% 400

200X Accuracy cost 90.64% 192 91.98% 300 90.37% 400

400X Accuracy cost 90.96% 336 90.11% 450 89.75% 400

Breast Cancer Classiﬁcation via Deep Active Learning

115

4 Conclusions We proposed a deep active learning framework in histopathological image analysis for breast cancer research. The main purpose of the work is to reduce the tedious labeling burden in the medical application if deep learning methods are used. Instead of randomly selecting samples for annotation as training samples, the framework actively seeking for the most valuable unlabeled data to be manual labeled, and then ﬁne-tune the network incrementally. Besides, we also improve the query strategy with a conﬁdence boosting operation, where both samples predicted with high conﬁdence and low conﬁdence are used in network training in each query round. The samples with high conﬁdence are auto-labeled with the network, so there is no additional manual labeling cost compared with standard active learning methods. The experimental results validated on a large breast cancer histopathological images dataset have demonstrated that our proposed method signiﬁcantly reduces the labeling cost compared with random selection. It also has better performances with higher accuracy when compared with standard query strategy.

References 1. Lakhani, S.R., Ellis. I.O., Schnitt, S.: WHO classiﬁcation of tumours of the breast. In: International Agency for Research on Cancer, WHO Press, Lyon (2012) 2. Veta, M., Pluim, J.P.W., van Diest, P.J.: Breast cancer histopathology image analysis: a review. IEEE Trans. Biomed. Eng. 2(5), 1400–1411 (2014) 3. Chen, H., Dou, Q., Wang, X.: Mitosis detection in breast cancer histology images via deep cascaded networks. In: Thirtieth AAAI Conference on Artiﬁcial Intelligence, Phoenix, Arizona, pp. 1160–1166. AAAI Press (2016) 4. Wang, D., Khoslam, A., Gargeya, R.: Deep learning for identifying metastatic breast cancer. arXiv preprint arXiv:1606.05718 (2016) 5. Spanhol, F.A., Oliveira, L.S.: Breast cancer histopathological image classiﬁcation using convolutional neural networks. In: International Joint Conference on Neural Networks, Vancouver, BC, Canada, pp. 2561–2567. IEEE (2016) 6. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classiﬁcation with deep convolutional neural networks. In: International Conference on Neural Information Processing Systems, Lake Tahoe, Nevada, pp. 1097–1105. Curran Associates Inc. (2012) 7. Bayramoglu, N., Kannala, J., Heikkila, J.: Deep learning for magniﬁcation independent breast cancer histopathology image classiﬁcation. In: International Conference on Pattern Recognition (ICPR), Cancun, Mexico, pp. 2440–2445. IEEE (2017) 8. Spanhol, F.A., Cavalin P.R., Oliveira, L.S.: Deep features for breast cancer histopathological image classiﬁcation. In: IEEE International Conference on Systems, Los Angeles, CA, USA, pp. 1868–1873 (2017) 9. Weil, B., Han, Z., He, X.: Deep learning model based breast cancer histopathological image classiﬁcation. In: 2nd IEEE International Conference on Cloud Computing and Big Data Analysis, Chengdu, China, pp. 348–353. IEEE (2017) 10. Huang, Y., Zheng, H., Liu, C.: Epithelium-stroma classiﬁcation via convolutional neural networks and unsupervised domain adaptation in histopathological images. IEEE J. Biomed. Health Inform. 21(6), 1625–1632 (2017)

116

B. Du et al.

11. Yue Huang, Han Zheng, Chi Liu: Epithelium-stroma classiﬁcation in histopathological images via convolutional neural networks and self-taught learning. In: IEEE International Conference on Acoustics, pp. 1073–1077. IEEE, New Orleans, LA, USA (2017) 12. Huang, S.-J., Jin, R., Zhou, Z.-H.: Active learning by querying informative and representative examples. In: International Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, pp. 892–900. Curran Associates Inc. (2010) 13. Spanhol, F., Oliveira, L., Petitjean, C.: A dataset for breast cancer histopathological image classiﬁcation. IEEE Trans. Biomed. Eng. 61(7), 1455–1462 (2016) 14. Krizheevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classiﬁcation with deep convolutional neural networks. In: International Conference on Neural Information Processing Systems, Lake Tahoe, Nevada, pp. 1097–1105. Curran Associates Inc. (2012)

Epileptic Seizure Prediction from EEG Signals Using Unsupervised Learning and a Polling-Based Decision Process ( ) Lucas Aparecido Silva Kitano1, Miguel Angelo Abreu Sousa1 ✉ , Sara Dereste Santos1, 1 2 Ricardo Pires , Sigride Thome-Souza , and Alexandre Brincalepe Campo1

1

Federal Institute of Education, Science and Technology of São Paulo, Rua Pedro Vicente, 625, São Paulo, 01109-010, Brazil [emailprotected] 2 Institute of Psychiatry, Faculty of Medicine, University of São Paulo, R. Dr. Ovídio Pires de Campos, 785, São Paulo, SP 01060-970, Brazil

Abstract. Epilepsy is a central nervous system disorder deﬁned by spontaneous seizures and may present a risk to the physical integrity of patients due to the unpredictability of the seizures. It aﬀects millions of people worldwide and about 30% of them do not respond to anti-epileptic drugs (AEDs) treatment. Therefore, a better seizure control with seizures prediction methods can improve their quality of life. This paper presents a patient-speciﬁc method for seizure prediction using a preprocessing wavelet transform associated to the Self-Organizing Maps (SOM) unsupervised learning algorithm and a polling-based method. Only 20 min of 23 channels scalp electroencephalogram (EEG) has been selected for the training phase for each of nine patients for EEG signals from the CHB-MIT public data‐ base. The proposed method has achieved up to 98% of sensitivity, 88% of specif‐ icity and 91% of accuracy. For each subsequence of EEG data received, the system takes less than one second to estimate the patient state, regarding the possibility of an impending seizure. Keywords: Seizure prediction · Self-Organizing Maps Polling-based decision process

1

Introduction

According to the World Health Organization, epilepsy is a central nervous system disorder that affects approximately 50 million people worldwide, making it one of the most common neurological diseases in the world. Epilepsy is defined as sponta‐ neous seizures that start in the brain. Seizures are brief occurrences of involuntary movement that may involve a part of the body or the entire body, and are some‐ times associated to a loss of consciousness [19]. For many patients, anti-epileptic drugs (AEDs) can be given at sufficiently high doses to prevent seizures, frequently causing side effects. For 20–40% of patients with epilepsy, AEDs are not effective [8]. Patients with epilepsy may experience anxiety due to the possibility of a seizure occurring at any time, in addition to physical integrity risk when performing some © Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11140, pp. 117–126, 2018. https://doi.org/10.1007/978-3-030-01421-6_12

118

L. A. S. Kitano et al.

activities like driving or swimming. In this regard, the development of reliable tech‐ niques for seizure prediction could improve the quality of epilepsy patients’ life, reducing the risk of injuries and offering a better management of AEDs in indi‐ vidual bases, consequently reducing the side effects, improving the use of preven‐ tive-AEDs facing an imminent epileptic seizure [8]. There is evidence that the process of seizures generation (ictogenesis) is not random, originating from a brain region, and in most patients are associated to electroencepha‐ logram (EEG) patterns. The EEG is a measure of the electrical activity captured by the cerebral cortex nerve cells. Epileptic EEG signals can be classiﬁed in four states: ictal (the epileptic seizure itself), postictal (period immediately after the seizure), interictal (period between seizures, considered a normal state of the patient) and preictal (period immediately before the seizure onset) [4]. Success in epileptic seizures prediction requires diﬀerentiating the preictal state from the other three states. Since the preictal state is the transition from the interictal to the ictal, a binary classiﬁcation between the interictal and preictal states is of primary interest in seizures prediction. Patient-speciﬁc seizure onset and pre-seizure onset patterns suggest that patient-speciﬁc algorithms oﬀer some advantage for epileptic seizure prediction, from the Machine Learning perspective [11]. Thus, supervised learning techniques are used with recorded data from each patient to discriminate characteristics between the preictal and interictal states [4]. In previous studies, patient-speciﬁc classiﬁers were used to separate the preictal and interictal states. In [6], binary classiﬁcation with linear Bayes classiﬁer was used in 7 patients achieving 94% of accuracy and 93% of sensitivity. In [4], SVM, KNN, LDA classiﬁers were compared. SVM has achieved the best results (94% of accuracy, 96% of sensitivity and 90% of speciﬁcity). SVM was also used in [13], achieving 97.5% of sensitivity. In [2], SVM and Kalman ﬁlter were combined achieving a sensitivity of 100%. These results demonstrate that the seizure prediction researches have been improving over time: a research work achieved in 2003 a sensitivity of 62.5% [3]. The goal of the seizure prediction research works, besides good performance is real-time hardware application. Therefore, reducing the dimensionality of the EEG processed data may enable neural processes with lower computational costs and/or processing time. Moreover, it can increase the results. In [4], an adaptive algorithm for EEG channel selec‐ tion was proposed and compared with the use of all channels and with the Principal Components Analysis (PCA). The proposed adaptive algorithm for channel selection achieved the best results. In [6], Pearson’s Correlation matrix was used as a feature selec‐ tion to eliminate redundant information. Another important characteristic for developing real-time hardware applications is the number of EEG hours required for the training phase, which implies in large amounts of stored data and discomfort in patients during EEG recording. In [4], only 10 min of training data were used for each class (preictal and inter‐ ictal) indicating that methods based on short time training can predict seizures. In an attempt to ﬁnd possible consistent patterns in the EEG signals, Self-Organizing Maps (SOM) [9] have been applied herein. This type of neural network, also known as Kohonen neural network, is categorized as an unsupervised learning model and allows mapping the input data into clusters. In the present proposal, SOM is used to identify the clusters corresponding to preictal and interictal states, in order to predict the seizures. More speciﬁcally, in this work, SOM integrates a patient-speciﬁc prediction method

Epileptic Seizure Prediction from EEG Signals

119

along with 4-s EEG segmentation, wavelet preprocessing and a polling-based decision process. Only 20 min of data were used in the training phase, with 10 min for each state (preictal and interictal), in contrast to many hours of EEG used in related works [1, 10, 11, 13, 14]. Moreover, the time consumed to process the EEG signals was measured to quantify the computational eﬀort required by the proposed method.

2

Methods

The EEG signals analyzed are described in Subsect. 2.1. The initial processing step consists in segmenting each EEG channel data in 4-s windows. Then, feature extraction is performed as detailed in Subsect. 2.2. SOM is responsible to categorize the input data into preictal and interictal states, according to Subsect. 2.3. Finally, a sequence of 4-s windows is classiﬁed in a polling-based approach, as the prediction output of the system. 2.1 EEG Dataset The dataset used in this work was recorded at the Children’s Hospital Boston and is publicly available in the CHB-MIT EEG [7, 16]. This dataset comprises scalp EEG recordings from pediatric patients with intractable seizures. Due to the fact that scalp EEG is not invasive, it is advantageous compared to intracranial EEG. Pediatric EEG exhibits large variability in seizure and non-seizure activity [16]. The patients were monitored for several days following withdrawal of AEDs. All EEG signals were sampled at 256 Hz with 16-bit resolution, recorded mostly in 23 channels. The interna‐ tional 10–20 system [7] is used for the positioning of the EEG electrodes and for the nomenclature of the channels. In the present work, 9 of the 24 patients have been selected due to the fact that they have had, at least, 5 recorded seizures. Moreover, there was available, at least, 30 min of preictal state before seizure onset. Interictal state was deﬁned as the period farther than 30 min from the seizure. 2.2 Feature Extraction Methods of research in seizure predictions commonly use feature vectors built from EEG signals. In the present work, EEG data were segmented by non-overlapping 4s windows. Then, for each window, the Discrete Wavelet Transform (DWT) is computed and the number of zero-crossings of detail coefficients of level 1 is calcu‐ lated [12]. In [4] the mother Wavelet basis functions Haar, Daubechies-4 and Daube‐ chies-8 were compared. The Haar Wavelet function allowed the highest accuracy and, in addition, it has lower computational complexity compared to the other wavelet functions. Based on those results, in the present work, the mother wavelet basis used is Haar. As in [4], the zero-crossings of the detail coefficients of the first level computed for each window results in a vector of dimension D = 23 channels. The total number of vectors is n = T/4 s, where T is the EEG period selected in seconds. In order to allow an easy visualization of the results, the vectors were represented in grayscale, with darker tones for the windows located temporally

120

L. A. S. Kitano et al.

closer to the seizure (preictal state), and varying proportionality to light tones for the windows farther from the seizure (interictal state), as in Fig. 1.

Fig. 1. Relationship between windows and grayscale attributed for n windows in training phase. Half of the data are in preictal state and the other half is in interictal state.

2.3 SOM for Unsupervised Categorization of Epileptic EEG Signals Originally proposed by Teuvo Kohonen in 1982 [9], SOM has been widely used for multidimensional characterization and clustering tasks [15]. The network structure is composed of a set of prototypes, or neurons, represented by vectors of the same dimen‐ sionality as the input data. This neural structure is organized in one, two or three-dimen‐ sional arrays and the neurons are arranged in a topological order. The clustering occurs as a result of the comparison of the initial weights assigned to the neurons with the input data vectors. The weights are iteratively adjusted based on the distances between them, so that similar vectors in the input space are tend to be mapped onto neighboring neurons in the output array [5]. In the ﬁrst step of SOM training algorithm, the distance between an input vector xi and each neuron weight vector wj is computed, and the neuron whose distance is closest to the input is selected as the winning neuron c, or best match unit (BMU), according to Eq. 1. ) ( c = argminj dist wj , xi

(1)

In the following step, the values of the neuron vectors are adjusted. The new weights are computed by Eq. 2. ) ( wj (t + 1) = wj (t) + 𝛼(t) ⋅ hcj (t) ⋅ xi − wj

(2)

The learning rate α(t) is a problem-dependent parameter. Usually, its value decreases exponentially from an initial value αi towards a ﬁnal value αF. The neighborhood func‐ tion hcj(t) determines the magnitude of the adjustment in the neuron vectors next to the BMU according to the distance between them (distcj), as shown in Eq. 3. 𝟐

𝟐

hcj (t) = e−distcj ∕𝟐𝛔 (t)

(3)

Epileptic Seizure Prediction from EEG Signals

121

During the training phase, the magnitude of the adjustments in the neighboring neuron vectors is reduced by decreasing the width of the neighborhood function σ(t). Hence, for stabilization of the self-organization process, σ(t) usually decays exponen‐ tially between an initial value σi and a ﬁnal value σf. In this work, the input data comprises vectors of 23 dimensions resulting from the preprocessing of the 23 EEG channels, as described in Subsect. 2.2, and the network structure was conﬁgured as a two-dimensional array and as a one-dimensional array, both presented in Fig. 2. The aim of the initial essays using the SOM for the prediction of epileptic seizures was to verify the possible clustering eﬀect in EEG signals that precede a seizure, according to the preprocessing described in Subsect. 2.2. In the training phase, equal amounts of preictal and interictal data were selected and presented to the network.

Fig. 2. Top left, 2D SOMs from patients CHB01 and CHB05, where darker neurons indicate preictal states and lighter ones indicate interictal states. Non activated neurons are represented using the background color. The corresponding U-matrices are presented at the bottom, where darker colors indicate larger distances and lighter ones, smaller distances. Top right, 1D SOM from patient CHB01 and the corresponding U-matrix. Bottom right, 1D SOM from patient CHB05 and the corresponding U-matrix.

Figure 2 shows examples of typical results obtained in the tests, for the 2D SOMs, of the mapping of EEG signals from patients 01 and 05 of the CHB-MIT dataset. It can be seen that the SOM neurons were clustered into two distinct categories. One of the categories (represented by darker tones) is associated to signals with a maximum of 30 min before seizure onset and the other category (represented by the lighter tones) is associated to signals recorded farther than 30 min before seizure onset. The mapped categories may also be seen in the U-matrix of the SOM, which denotes the vector

122

L. A. S. Kitano et al.

distance among the neurons after the training process [18]. This favors the visual analysis of the grouping process which means that it is possible to distinguish diﬀerent patterns in the dataset. The U-matrices associated to the SOMs of CHB01 and CHB05 patients are illustrated at the bottom of Fig. 2. It can be noticed that the SOM was able to successfully categorize the interictal and preictal states of EEG signals in an unsuper‐ vised manner. Due to the consistent results obtained in the repetition of the initial experiments, in which the neural network succeeded in mapping EEG signals (splitting it into two clus‐ ters), the second sequence of essays aimed to explore the use of a 1D SOM architecture. The objective of using a 1D SOM architecture was to reduce the processing time of EEG categorization towards real-time hardware applications (as described in Sect. 1). There‐ fore, the purpose of the second sequence of experiments was to verify if the behavior of the new topology was able to achieve similar results to those obtained in the ﬁrst sequence of essays, i.e., if the 1D SOM was also able to successfully categorize the interictal and preictal states of EEG signals in an unsupervised manner. Figure 2 shows the typical results obtained with 1D SOM and their respective U-matrices. As shown in Fig. 2, 1D conﬁgurations of SOM were able to categorize the EEG signals that had been presented. In those examples, it is possible to note that the 1D SOMs identiﬁed two clusters, similarly to what had been observed in the 2D conﬁgu‐ rations: lighter neurons indicate the interictal states and darker ones, the preictal states (according to the grayscale depicted in Fig. 1). In the 1D U-matrices illustrated in the Fig. 2, it is possible to note that the two classes are located at opposite positions in the network. In both sequences of essays (1D and 2D SOMs), the training parameters employed were Euclidean distance; 10,000 epochs; αi = 0.1; αf = 0.01; σi = 8; and σf = 1.1. 2D SOMs training took approximately 207 s on an Intel Processor Dual Core i5-7200U, 3.1 GHZ, 8 GB DDR RAM, 3 MB cache, GNU/Linux operating system, Ubuntu 16.04 LTS distribution. In the same system, training 1D architectures took approximately 74 s. Besides the learning phase, decrease in processing time is also important during the inference phase. Section 3 describes the inference time results compared to the time periods of the EEG analyzed. 2.4 Classiﬁcation and Evaluation Subsections 2.2 and 2.3 presented EEG window segmentation, feature extraction and the proposed SOM model to categorize each of these windows. This subsection describes the proposal to diﬀerentiate the preictal state from the interictal state so that the seizure prediction can be performed. Moreover, this subsection presents the methods used for evaluating the quality of the prediction method results. Figure 3 depicts the categorization method. As described in Subsect. 2.2, the 23channel EEG signals are segmented in non-overlapping 4-s windows (Fig. 1). Then, the zero-crossings count of the DWT is computed for each one of the channels, building a vector (d1, d2, …, d23) and a previously trained SOM is used to cluster the vectors. The category of each window is thus indicated by the neuron that is activated at the output of the network. Finally, the assignment of a predicted state (interictal or preictal) is performed by counting the results in a series of windows. Hence, each sequence of SOM

Epileptic Seizure Prediction from EEG Signals

123

outputs is submitted to a polling-based decision process. In Fig. 3, the SOM neurons are identiﬁed by numbers, with neurons ranging from 0 to 4 indicating preictal states whereas neurons ranging from 5 to 9 indicate interictal states. A sequence of seven activated neurons is also represented in the ﬁgure: 9-8-8-6-2-2-5. For illustration purpose, assuming that a series of 3 windows is grouped in one polling, the sequence 9-8-8 results in the classiﬁcation of the current state as interictal (I). In another example, the poll groups the sequence 2-2-5, which results in the classiﬁcation of the state as preictal (P). The next section discusses seizure prediction for nine patients from the CHB-MIT dataset and the inﬂuence on the results of the number of windows in each series.

Fig. 3. Sequences of EEG data classiﬁed and evaluated by the proposed polling method.

3

Results and Discussion

To discuss the obtained results, three performance metrics are employed here: sensi‐ tivity, speciﬁcity and accuracy. These metrics are often used to evaluate seizure predic‐ tion methods [2, 4, 6, 11, 13]. The amount of data used for training was the same for all patients (20 min). For the inference, the selected time was longer due to the adopted criteria and available data, as explained in Subsect. 2.1. Table 1 shows the test time selected for each patient. Table 1. Amount of data test time for the patients. Patient chb01 Data 110 test (min)

chb03 212

chb05 100

chb06 150

chb08 229

chb10 279

chb11 66

chb20 166

chb20 342

Total 1654

The experiments involved a total of 59 seizures and 24810 EEG windows were indi‐ vidually classiﬁed. For each of the patients, the sensitivity, speciﬁcity and accuracy were calculated with nine diﬀerent numbers of windows grouped in a poll: 1, 5, 15, 45, 90, 135, 180, 225 and 270. Figure 4 shows the average over all patients for sensitivity (circles), speciﬁcity (squares) and accuracy (triangles) of the obtained results. These values and the corresponding standard deviations are presented in Table 2.

124

L. A. S. Kitano et al.

Fig. 4. Average values for sensitivity, speciﬁcity and accuracy.

Table 2. Averages (μ) and standard deviations (σ) for the accuracy, sensitivity and speciﬁcity for diﬀerent numbers of windows in each poll. 1 Accuracy Sensitivity Speciﬁcity

5

15

45

90

135

180

225

270

μ

85.95%

86.01%

86.85%

87.31%

87.67%

88.56%

88.93%

90.60%

91.10%

σ

11.92%

12.71%

13.07%

14.20%

14.89%

14.92%

15.44%

15.69%

15.32%

μ

87.29%

87.86%

88.95%

89.62%

90.12%

92.14%

93.36%

97.41%

98.09%

σ

14.02%

14.22%

13.67%

13.18%

14.44%

12.44%

11.53%

6.65%

5.72%

μ

85.58%

85.98%

86.65%

86.98%

87.38%

87.51%

87.18%

87.41%

87.99%

σ

17.32%

17.72%

18.15%

19.11%

20.43%

20.97%

21.69%

21.44%

20.57%

According to Fig. 4, sensitivity is the metric that mostly increases with the number of the windows in a polling sequence, varying from 87% to 98%. In [4, 6, 13], the sensitivity values reported are 93%, 96% and 97.5%, respectively. The best value achieved in the present work is higher than those in related works, although no dimen‐ sionality reduction technique is applied here. This means that the evaluation by polling seems to be a reasonable tool for seizure prediction. On the other hand, speciﬁcity is less sensitive to the increase of the number of windows in a sequence (about 3% higher). This value is related to the false positives. In [4] and [11], the speciﬁcity value achieved 90% and both works has used dimensionality reduction techniques, which suggests that speciﬁcity improvement may be related to dimensionality reduction. Finally, accuracy varied from 86% to 91% while in [4] and [6], it has reached 94%. In [17], SOM was used to distinguish preictal from interictal data, achieving 89.68% of accuracy. As the accu‐ racy can be improved through the speciﬁcity optimization, thus reducing the data dimensionality can also improve the accuracy of the method. This means that the results can be even better after a preprocessing for data dimensionality reduction. In summary, increasing the number of windows improves the performance of the classiﬁer. Moreover, this does not signiﬁcantly aﬀect the processing time to infer a result. This is because the system waits for the total number of windows only at the initializa‐ tion. From then on, the inference happens with each new window received, that is, every

Epileptic Seizure Prediction from EEG Signals

125

4 s. This new window simply replaces the oldest window in the sequence, and then a new polling is performed with the remaining windows. It takes 0.912 s to process and infer a window of 4 s and a SOM inference takes 3 ms. These times were measured in a notebook, where other background tasks were being executed concurrently. Towards real-time hardware application, the time spent can be reduced by using a dedicated chip (ASIC or FPGA). Techniques for dimensionality reduction may also diminish the processing time further and improve the results, as already discussed. Moreover, reducing the number of data acquisition channels helps decreasing the computational costs, besides taking into account that a 23-channel device may be uncomfortable for a day by day wearable application.

4

Conclusion

In this paper, a patient-speciﬁc seizure prediction method was proposed involving ﬁrst level of discrete wavelet transform, zero-crossings of the detail coeﬃcients, SOM unsu‐ pervised learning algorithm and a polling-based decision process to estimate the patient state as preictal or interictal. By increasing the number of windows in the polling process, improvements have been achieved in terms of accuracy, speciﬁcity and sensitivity, up to 91%, 88% and 98%, respectively. These values are close to the best results found in related works. The main contributions of this work are the short EEG time employed in the training phase and also the low processing time, even without using any technique of dimensionality reduction. Such characteristics are relevant for future real time hard‐ ware applications. Following this work, diﬀerent techniques for feature selection and data dimensionality reduction will be explored, in order to adapt a real-time monitoring system to continuous learning scenarios in which the EEG data may suﬀer from drift eﬀects. Acknowledgments. The authors acknowledge the National Council for Scientiﬁc and Technological Development (CNPq) for the undergraduate scholarship conceded to Lucas Aparecido Silva Kitano.

References 1. Alawieh, H., Hammoud, H., Haidar, M., Nassralla, M.H., El-Hajj, A.M., Dawy, Z.: Patientaware adaptive ngram-based algorithm for epileptic seizure prediction using EEG signals. In: 2016 IEEE 18th International Conference on e-Health Networking, Applications and Services (Healthcom), pp. 1–6 (2016) 2. Chisci, L., et al.: Real-time epileptic seizure prediction using AR models and support vector machines. IEEE Trans. Biomed. Eng. 57(5), 1124–1132 (2010) 3. D’Alessandro, M., Esteller, R., Vachtsevanos, G., Hinson, A., Echauz, J., Litt, B.: Epileptic seizure prediction using hybrid feature selection over multiple intracranial EEG electrode contacts: a report of four patients. IEEE Trans. Biomed. Eng. 50(5), 603–615 (2003) 4. Elgohary, S., Eldawlatly, S., Khalil, M.I.: Epileptic seizure prediction using zero-crossings analysis of EEG wavelet detail coeﬃcients. In: 2016 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), pp. 1–6 (2016)

126

L. A. S. Kitano et al.

5. Haykin, S.O.: Neural Networks and Learning Machines, vol. 3. Pearson, Upper Saddle River (2009) 6. Hoyos-Osorio, K., Castañeda-Gonzalez, J., Daza-Santacoloma, G.: Automatic epileptic seizure prediction based on scalp EEG and ECG signals. In: 2016 XXI Symposium on Signal Processing, Images and Artiﬁcial Vision (STSIVA), pp. 1–7 (2016) 7. Goldberger, A.L., et al.: PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101, 220 (2000) 8. Kaggle: American epilepsy society seizure prediction challenge. https://www.kaggle.com/c/ seizure-prediction. Accessed 25 Apr 2018 9. Kohonen, T.: The self-organizing map. Neurocomputing 21(1–3), 1–6 (1998) 10. Li, S., Zhou, W., Yuan, Q., Liu, Y.: Seizure prediction using spike rate of intracranial EEG. IEEE Trans. Neural Syst. Rehabil. Eng. 21(6), 880–886 (2013) 11. Liang, J., Lu, R., Zhang, C., Wang, F.: Predicting seizures from electroencephalography recordings: a knowledge transfer strategy. In: 2016 IEEE International Conference on Healthcare Informatics (ICHI), pp. 184–191 (2016) 12. Mallat, S.: A Wavelet Tour of Signal Processing. Academic Press, London (1999) 13. Park, Y., Luo, L., Parhi, K.K., Netoﬀ, T.: Seizure prediction with spectral power of EEG using cost-sensitive support vector machines. Epilepsia 52(10), 1761–1770 (2011) 14. Parvez, M.Z., Paul, M.: Epileptic Seizure Prediction by Exploiting Spatiotemporal Relationship of EEG Signals Using Phase Correlation. IEEE Trans. Neural Syst. Rehabil. Eng. 24(1), 158–168 (2016) 15. Pöllä, M., Honkela, T., Kohonen, T.: Bibliography of Self-organizing Map (SOM) Papers: 2002–2005 Addendum. Neural Computing Surveys (2009) 16. Shoeb, A.H.: Application of machine learning to epileptic seizure onset detection and treatment. Ph.D. thesis, Massachusetts Institute of Technology (2009) 17. Tafreshi, A.K., Nasrabadi, A.M., Omidvarnia, A.H.: Empirical mode decomposition in epileptic seizure prediction. In: 2008 IEEE International Symposium on Signal Processing and Information Technology, pp. 275–280 (2008) 18. Ultsch, A.: Self-organizing neural networks for visualisation and classiﬁcation. In: Opitz, O., Lausen, B., Klar, R. (eds.) Information and Classiﬁcation, pp. 307–313. Springer, Heidelberg (1993). https://doi.org/10.1007/978-3-642-50974-2_31 19. World Health Organization: Epilepsy. http://www.who.int/en/news-room/fact-sheets/detail/ epilepsy. Accessed 25 Apr 2018

Classiﬁcation of Bone Tumor on CT Images Using Deep Convolutional Neural Network Yang Li1 , Wenyu Zhou2, Guiwen Lv3, Guibo Luo1 , Yuesheng Zhu1 ✉ , and Ji Liu1 (

)

1

2

Communication and Information Security Lab, Shenzhen Graduate School, Peking University, Shenzhen, China [emailprotected], [emailprotected], [emailprotected], [emailprotected] Department of Spine Surgery, No. 1 Aﬃliated Hospital of Shenzhen University (Shenzhen No. 2 People’s Hospital), Shenzhen, China [emailprotected] 3 Department of Radiology, No. 1 Aﬃliated Hospital of Shenzhen University (Shenzhen No. 2 People’s Hospital), Shenzhen, China [emailprotected]

Abstract. Classiﬁcation of bone tumor plays an important role in treatment. As artiﬁcial diagnosis is in low eﬃciency, an automatic classiﬁcation system can help doctors analyze medical images better. However, most existing methods cannot reach high classiﬁcation accuracy on clinical images because of the high simi‐ larity between images. In this paper, we propose a super label guided convolu‐ tional neural network (SG-CNN) to classify CT images of bone tumor. Images with two hierarchical labels would be fed into the network, and learned by its two sub-networks, whose tasks are learning the whole image and focusing on lesion area to learn more details respectively. To further improve classiﬁcation accuracy, we also propose a multi-channel enhancement (ME) strategy for image prepro‐ cessing. Owing to the lack of suitable public dataset, we introduce a CT image dataset of bone tumor. Experimental results on this dataset show our SG-CNN and ME strategy improve the classiﬁcation accuracy obviously. Keywords: Bone tumor classiﬁcation Super label guided convolutional neural network · Multi-channel enhancement

1

Introduction

Bone tumors are tumors that occur in bones or their aﬃliated tissues. The incidence of bone tumors among all tumors is 2%–3% and rising in recent years [1]. In practice, bone tumor is not easy to detect accurately in the early time, and it is diﬃcult to cure completely in the later stage, often treated with extremely surgical methods such as resection. During diagnosis, to accurately diagnose doctors often use multiple methods like imaging, observing the clinical manifestations. And CT images have been proved to be an eﬀective imaging method [2]. To diagnose more eﬃciently, the introduction of eﬀective computer-aided CT image diagnosis system is very meaningful.

© Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11140, pp. 127–136, 2018. https://doi.org/10.1007/978-3-030-01421-6_13

128

Y. Li et al.

However, classiﬁcation of bone tumor using CT images is a challenging task. We ﬁrst try SVM [3] algorithm, but it does not work well. In recent years, deep learning algorithms develop fast, and have been shown to exceed human performance in visual tasks. Deep convolutional neural networks (CNNs) show a great advantage in image classiﬁcation. Many works tend to introduce deep learning methods to the ﬁeld of medi‐ cine image analysis. For example, Andre Esteva et al. [4] use GoogleNet to categorize skin cancer images and reach dermatologist-level classiﬁcation accuracy. Wang et al. [5] evaluate four classic CNN architectures, AlexNet [6], VGGNet [7], GoogLeNet [8], ResNet [9], on the classiﬁcation of thorax diseases. Also [10–13] prove CNNs have the potential in processing clinical images. However, CNNs may perform worse on medical images compared with on natural images. For example, when to categorize skin cancer images with GoogleNet the classiﬁcation accuracy is only 55% on nine-class disease partition [4] while the top-1 error rate of VGG-16 (VGG-16 has similar performance with GoogleNet, but there is only top-5 error rate of GoogleNet in [9]) is 28% on ImageNet [14] in which has 1000-class images [9]. In this paper, we apply CNNs to classiﬁcation of bone tumor on CT images. To the best of our knowledge, there is no suitable public datasets. The ﬁrst step is to make a CT image dataset of bone tumor. The dataset that we make contains 9 kinds of CT images of bone tumor, and every image in this dataset has a super label and a ﬁne-grained label. Later we train CNNs with our dataset, experiments are executed on AlexNet and VGG-13 network respectively to verify the performance on classic networks. But the results do not perform well enough. To improve classiﬁcation accuracy, we propose a super label guided convolutional neural network (SG-CNN) to classify bone tumor images. The network architecture can be seen as a ﬁne-grained image classiﬁcation network with two branches. We use images and their two hierarchical labels to feed the network without image annotations, then network can automatically crop the image under the guide of super label sub-network and generate a new image which is a copy of the lesion area. After this step, background area in global image is largely cut, which makes the network more focused on lesion area. The experimental results show the classiﬁcation accuracy is greatly improved by SG-CNN compared with genetic CNNs. To further improve the classiﬁcation accuracy we also introduce a multi-channel enhancement (ME) strategy to preprocess the CT images of bone tumor. We utilize two morphological methods to preprocess the input image to enhance the contrast of the edges of the lesion in the image, and then we merge the original image and the processed images together into a three channel image. The experimental results show this strategy also improve the classiﬁcation accuracy.

2

The Proposed Method

2.1 SG-CNN There are many classic CNN models to choose for categorization tasks such as AlexNet, VGGNet. These networks show good performance on natural images classiﬁcation like ImageNet. In natural images, the objects are usually in center position, and the diﬀerence between objects is obvious. But when it comes to ﬁne-grained visual classiﬁcation tasks

Classiﬁcation of Bone Tumor on CT Images Using Deep CNN

129

like medical images categorization, classic CNNs cannot reach a high level of classiﬁ‐ cation accuracy. To solve the ﬁne-grained classiﬁcation problems, scholars usually introduce new CNN structures. For example, Wei et al. [15] propose a novel end-to-end Mask-CNN model based on the part annotations of images. Zhang et al. [16] propose a part-based R-CNN model for ﬁne-grained categorization. Huang et al. [17] propose an architecture for ﬁne-grained visual categorization which consists of a fully convolutional network to locate multiple object parts and a two-stream classiﬁcation network that encodes object-level and part-level cues simultaneously based on manually-labeled strong part annotations. However, all these methods require image annotations which means these methods can consume too much time on making datasets. To tackle this issue, we design a new CNN that can generate ROI regions automatically by the network itself without using image annotations.

Fig. 1. The SG-CNN structure. One raw CT image and its two labels are fed into the network, without any annotations. The input image will be cropped under the guide of heat map created by one of the convolution layer and input into the other network branch. The guide layer conv x for cropping image can be any convolution layer. The output of network includes two predicted labels.

The proposed SG-CNN framework is presented in Fig. 1. It is an end-to-end network, the input includes CT image with two labels in hierarchical relationship, and the output contains two predicted labels. When making dataset, some diﬀerent ﬁne-grained labels share a same super label. And we use both super label and ﬁne-grained label to train SG-CNN and gain their classiﬁcation accuracies. In practice, we focus on classiﬁcation accuracy of ﬁne-grained label. The basic network for building sub-networks can be any CNN model, in this paper we choose AlexNet. For the architecture inside SG-CNN, basically, it has three components including super label sub-network, ﬁne-grained label sub-network, and the connection part of them. When we train the network, images are ﬁrst fed to the super label sub-network, and then all feather maps of the guide convolution layer of the sub-network would be summed up together and generate a heat map like Fig. 2. For CT images, the image background is less complex than natural images, the

130

Y. Li et al.

feature points in the heat map are distributed near lesion area. In heat map, the red part represents hot points whose value is large, the blue represents cold points whose value is small. We choose the hottest part in the heat map. The center point of hottest part is determined by formula 1, where k is the radius of the hottest part. For each (2k + 1) × (2k + 1) heat map area, we sum up all its values. Then we choose the largest one. H(x, y) =

∑k i=−k

∑k j=−k

X(x − i, y − i)

(1)

Fig. 2. Raw input image and heat maps. (a) is one input image. (b) is the heat map generated by conv1. (c) is the heat map generated by conv2. (d) is the heat map generated by conv3. The images show that with the network going deeper, heat map contains more abstract and semantic meanings. (Color ﬁgure online)

Next we find the corresponding point in the original image and select a 56 × 56 image whose center is the corresponding point of hottest point. After this selection, background interference can be greatly reduced. We then send the selected new image to the fine-grained label sub-network. In fine-grained label sub-network the fc8 layer is not only connected to the fc7 layer of fine-grained label sub-network, but also to the fc7 layer of super label sub-network. Also, in SG-CNN some deep learning techniques like dropout [6] and batch normalization layer [18] are applied to improve the generalization capability. The inspiration for designing SG-CNN comes from the thinking form of human being: start with a rough sketch, and then pay more attention to details, ﬁnally do a comprehensive judgment. In a CT image of bone tumor, the lesion area takes up only a limited part of the image. After we crop the image, most of background areas can be removed. In this way the network pays more attention to the lesion area. The location accuracy of cropped area is determined by super label sub-network. When to extract the cropped image, we can select any convolutional layer as the guide layer. Finally, the network output predicted ﬁne-grained label whose classiﬁcation accuracy is determined by two network branches simultaneously. 2.2 Multi-channel Enhancement In this paper, to further improve the classiﬁcation accuracy of CNNs, we propose an image preprocessing method to make the lesion area more distinct. We conduct dilation and erosion operation on images, and combine the processed images and the original

Classiﬁcation of Bone Tumor on CT Images Using Deep CNN

131

image together into a three channel image. As in CT images of bone tumor the lesion area and their surrounding areas are in high contrast, one of the new images can expend the border area between lesion area and normal area, thus CNNs can more easily locate the lesion area. Like in Fig. 3, the lesion area is in black color, dilation operation can expand the border area and make it more palpable. For images with white lesion area, erosion oper‐ ation will realize the same eﬀect. The complete steps of multi-channel enhancement strategy are shown in Fig. 3.

Fig. 3. Full steps of multi-channel enhancement. First, on the original 256 × 256 image we randomly cut a 224 × 224 image, then conduct dilation operation and erosion operation respectively, after these operations, we merge the processed images and original image into a three channel image.

3

Experiments

3.1 Dataset The CT images used in this paper are obtained from Shenzhen No. 2 People’s Hospital. Data is collected from patients diagnosed with bone tumors from 2014 to 2017. Original data is stored in DICOM format in which contains image data, patient information and tags. We extract the image data and resize the image to 256 × 256. In this paper, we use CT images in 2D form, which means we classify the bone tumor using just one layer CT image. As CT is a continuous scan, not every image in the sequence can clearly show the lesion features. The diﬀerent kind of images are shown in Fig. 4. To address this issue, we pick images that clearly show the features of the lesion. After all steps above, we get proper JPG format CT images.

132

Y. Li et al.

Fig. 4. (a)–(d) show the lesion features clearly, we select CT images like these, (d)–(f) are in the same CT sequence, however (e)–(f) show the features of the lesion not clearly as (d), we delete the two images. In this way, we get 2D CT images with clear lesion areas.

When training CNNs, the uniform distribution of diﬀerent sorts of images is a crucial issue. Although bone tumors can occur throughout the body, the incidence between organs exists a signiﬁcantly diﬀerence. It is not an easy work to collect more bone tumors CT images to balance the distribution, due, for instance, to the high costs in terms of money and time required to cooperate with other institutions. In this case, we only choose CT images of limbs to make the dataset and we ﬁnally get 6422 CT slices of bone tumor. The diagnosis results of each image is conﬁrmed by two or more doctors (including orthopedic surgeons and imaging doctors) and ﬁnalized with clinical manifestations, we use these diagnosis results as the label of CT images. In fact, there are over two hundred kinds of bone tumors, however, most of them are in low incidence. As it is not an easy work to get enough data of all kinds of bone tumors, thus, in this work, the bone tumors that we analyze contain only 9 class. It is noting that the similarities among the nine diseases are diﬀerent. Diseases in high similarity can have a same super label. We use two super label schemes, one being benign tumor and malignant tumor, the other being cartilage tumor, osteogenic tumor, and other tumor. Based on WHO2012, we develop a two-rank classiﬁcation strategy as shown in Table 1.

Classiﬁcation of Bone Tumor on CT Images Using Deep CNN

133

Table 1. The bone tumor classiﬁcation strategy Cartilage tumor

Osteogenic tumor Other tumor

Benign tumor Endogenous chondroma Osteochondroma Synovial chondroma Osteoid osteoma Osteoma Fibrous bone tumor

Malignant tumor Chondrosarcoma

Osteosarcoma Giant cell tumor of bone

When training, we randomly divide the dataset, 75% images of the dataset are used for training, and the rest of image are used for test. 3.2 Experiments and Results In this work, we use Tensorﬂow 1.0 as our CNN programming framework, run codes on a desktop PC equipped with a Intel i7-6700 K CPU, a NVIDIA TITAN X(Pascal) GPU and Ubuntu16.04 operating system. In experiment, we ﬁrst try traditional machine learning methods to classify CT images on our dataset. We use HOG [19] algorithm to extract image features and PCA [20] to do dimensionality reduction and the output dimensionality is 100, ﬁnally use a SVM classiﬁer with RBF kernel to classify images. Then we use CNN methods to do experiments. It is well-known that deep CNNs require a great deal of data for training. A small number of images can lead to overtraining. Before we feed images into the network, we use data augmentation to prepro‐ cess the input images [6]. When it comes to training mode, there are currently three major techniques that successfully employ CNNs to medical image classiﬁcation [21]: (1) training the CNN from scratch; (2) using oﬀ-the-shelf CNN features (without retraining the CNN) as complementary information channels to existing handcrafted image features; and (3) performing pre-training on natural or medical images and ﬁnetuning on medical target images. One key important factor in the choice of training strategies is the size of the dataset. In this paper, when training SG-CNN, we ﬁrst use a pre-trained CNN model to initialize the network. In practice, we ﬁrst perform our experiment with AlexNet on our dataset. Next, to see the performance on deeper classiﬁcation network, we use VGG-13 network to clas‐ sify the images. Then we do experiment to get the performance on our proposed SGCNN. Later we add the multi-channel enhancement method. Moreover, we do experi‐ ments using two series of super labels to test how the selection of super label schemes aﬀect the classiﬁcation accuracy. Additionally, we test how the selection of generation layer of the heat map inﬂuence classiﬁcation accuracy. We use Top-k error rate to evaluate our strategies. The results are shown in Table 2. From the table, it is obvious that all deep learning methods perform better than traditional machine learning method. For classic CNNs, with networks getting deeper, the error rate declines. But the best top-1 error rate is still 0.44 which is high. Our proposed SGCNN signiﬁcantly outperforms VGG-13 network and AlexNet.

134

Y. Li et al. Table 2. Comparison of ﬁve methods Methods HOG+PCA+SVM Alexnet VGG-13 SG-CNN

Top-1 error rate (%) 76 54 44 28

Top-2 error rate (%) 59 43 21 5

The comparison of two super label guided strategies is shown in Table 3. From the table we can see that the top-1 error rate of super label classiﬁcation has a relative 10% increase while the top-1 error rate of ﬁne-grained label classiﬁcation has a relative 3% reduction and top-2 error rate has a relative 1% reduction, when 2-class super label strategy is replaced by 3-class super label strategy. We can suppose that the classiﬁcation accuracy of ﬁne-grained label is determined by not only classiﬁcation accuracy of super label but also the number of super labels. In practice, optimizing only one factor may not improve the classiﬁcation accuracy of ﬁne-grained label. Table 3. Comparison of two super label schemes for SG-CNN Strategies 2-class super label 3-class super label

Top-1 error rate of super label (%) 7 17

Top-1 error rate of ﬁne-grained label (%) 28 25

Top-2 error rate of ﬁne-grained label (%) 5 4

The error rate comparison between diﬀerent generation layers of heat map is shown in Table 4. With the network going deeper, the heat map becomes more abstract and shows less edge information like Fig. 2. From Table 4 we can see conv1 is in high error rate. The later layers show better performance than conv1. But conv3-5 do not show obviously better performance than conv2, we assume that the reason is the feather map size getting smaller as layers going deeper. Table 4. Comparison of ﬁve cropping image generation layers Selected layers Conv1 Conv2 Conv3 Conv4 Conv5

Top-1 error rate (%) 35 28 29 30 28

Top-2 error rate (%) 16 5 6 6 8

Finally, we do experiments to test our ME strategy. From Table 5 we can observe that with the use of multi-channel enhancement, there are further reductions of top-1 and top-2 error rate on every networks. The results show that our ME strategy is useful for improving the performance of CNNs.

Classiﬁcation of Bone Tumor on CT Images Using Deep CNN

135

Table 5. Performance tests for ME strategy Methods AlexNet AlexNet+ME VGG-13 VGG-13+ME SG-CNN SG-CNN+ME

4

Top-1 error rate (%) 54 49 44 37 28 26

Top-2 error rate (%) 43 28 21 13 5 5

Conclusion and Future Work

In this paper, we have presented a novel end-to-end ﬁne-grained classiﬁcation network named SG-CNN and an image multi-channel enhancement strategy. Moreover, we produced a bone tumor CT image dataset based on the WHO2012 standard. With our dataset, we compared experimental performance of SVM, AlexNet, VGG-13 network, and our SG-CNN. Experimental results show our proposed SG-CNN can signiﬁcantly outperform SVM and classic CNNs. Additionally, our multi-channel enhancement strategy proves that it can achieve higher accuracy. Among all experimental results, the lowest top-1 error rate is 0.25 and top-2 error rate is 0.04. As future work, we would focus on obtaining more image data, 3D CNN modeling, and MRI image recognition. Acknowledgements. This work was supported in part by the Shenzhen Municipal Development and Reform Commission (Disciplinary Development Program for Data Science and Intelligent Computing), and by Shenzhen International cooperative research projects GJHZ20170313150021171.

References 1. Zhang, Y.J., Cui, X.F., Li, C.C., Li, S.J.: Eﬃcacy of DR, CT and MRI in bone tumors. ChineseGerman J Clin. Oncol. 13(4), 181–184 (2014) 2. Keidar, Z., Israel, O., Krausz, Y.: SPECT/CT in tumor imaging: technical aspects and clinical applications. Semin. Nucl. Med. 33(3), 205 (2003) 3. Sánchez A, V.D.: Advanced support vector machines and kernel methods. Neurocomputing. 55(1), 5–20 (2003) 4. Esteva, A., Kuprel, B., Novoa, R.A., Ko, J., Swetter, S.M., Blau, H.M., et al.: Dermatolo gistlevel classiﬁcation of skin cancer with deep neural networks. Nature 542(7639), 115–118 (2017) 5. Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.: Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classiﬁcation and localization of common thorax diseases. arXiv:1705.02315 (2017) 6. Krizhevsky, A., Sutskever, I., Hinton, G. E.: ImageNet classiﬁcation with deep convolutional neural networks. In: International Conference on Neural Information Processing Systems, pp. 1097–1105. Curran Associates Inc. (2012) 7. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014)

136

Y. Li et al.

8. Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015) 9. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770– 778 (2016) 10. Anthimopoulos, M., Christodoulidis, S., Ebner, L., Christe, A., Mougiakakou, S.: Lung pattern classiﬁcation for interstitial lung diseases using a deep convolutional neural network. IEEE Trans. Med. Imaging 35(5), 1207–1216 (2016) 11. Li, Q., Cai, W., Wang, X., Zhou, Y.: Medical image classiﬁcation with convolutional neural network. In: International Conference on Control Automation Robotics and Vision, pp. 844– 848. IEEE (2016) 12. Miki, Y., et al.: Classiﬁcation of teeth in cone-beam CT using deep convolutional neural network. Comput. Biol. Med. 80(C), 24–29 (2017) 13. Kumar, P., Grewal, M., Srivastava, M.M.: Boosted cascaded convnets for multilabel classiﬁcation of thoracic diseases in chest radiographs. arXiv:1711.08760 (2017) 14. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Li, F.F.: ImageNet: a large-scale hierarchical image database. In: Computer Vision and Pattern Recognition (CVPR), pp. 248–255. IEEE (2009) 15. Wei, X.S., Xie, C.W., Wu, J.: Mask-CNN localizing parts and selecting descriptors for ﬁnegrained image recognition. In: Conference and Workshop on Neural Information Processing Systems (NIPS) (2016). arXiv:1605.06878 16. Zhang, N., Donahue, J., Girshick, R., Darrell, T.: Part-based R-CNNs for ﬁne-grained category detection. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 834–849. Springer, Cham (2014). https://doi.org/ 10.1007/978-3-319-10590-1_54 17. Huang, S., Xu, Z., Tao, D., Zhang, Y.: Part-stacked CNN for ﬁne-grained visual categorization. In: Computer Vision and Pattern Recognition, pp. 1173–1182. IEEE (2016) 18. Ioﬀe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167 (2015) 19. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 886– 893. IEEE Computer Society (2005) 20. Abdi, H., Williams, L.J.: Principal component analysis. Wiley Interdisc. Rev. Comput. Stat. 2(4), 433–459 (2010) 21. Hoochang, S., et al.: Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans. Med. Imaging 35(5), 1285 (2016)

DSL: Automatic Liver Segmentation with Faster R-CNN and DeepLab Wei Tang, Dongsheng Zou(&), Su Yang, and Jing Shi Chongqing University, Chongqing10611, CN, No. 174 Shazheng Street, Shapingba District, Chongqing, China {weitang,dszou,yangsu,shijing}@cqu.edu.cn

Abstract. Liver segmentation is a crucial step in computer-assisted diagnosis and surgical planning of liver diseases. However, it is still a quite challenging task due to four reasons. First, the grayscale of the liver and its adjacent organ tissues is similar. Second, partial volume effect makes the liver contour blurred. Third, most clinical images have serious pathology such as liver tumor. Forth, each person’s liver shape is discrepant. In this paper, we proposed DSL (detection and segmentation laboratory) method based on Faster R-CNN (faster regions with CNN features) and DeepLab. The DSL consists of two steps: to reduce the scope of subsequent liver segmentation, Faster R-CNN is employed to detect liver area. Next, the detection results are input to DeepLab for segmentation. This work is evaluated on two datasets: 3Dircadb and MICCAISliver07. Compared with the state-of-the-art automatic methods, our approach has achieved better performance in terms of VOE, RVD, ASD and total score. Keywords: Faster R-CNN

DeepLab Detection Segmentation

1 Introduction Liver disease is largely endangering the health of men and women worldwide. As reported in 2015, the number of people suffering from liver disease worldwide reached 1.3 billion, including about 500 million in Europe and the United States. At present, non-alcoholic liver disease affects one-third of the world’s adults or about one billion people [25]. Liver disease is one of the main causes of premature death, so we need liver surgery to treat patients suffering from liver disease. Liver surgery is one of the main treatment methods for common liver benign and malignant diseases of the liver. Liver segmentation is a fundamental and essential step in the diagnosis and surgical planning of computer assisted liver disease. Manual segmentation is very time consuming, boring and poorly reproducible, because of the high similarity between liver tissue and its adjacent organs, and the difference between livers and the lesion. Therefore, an automatic liver segmentation method is promising to reduce the burden of manual segmentation and avoid the subjectivity of the experts. Medical image segmentation has attached more and more attention in the enhancement of the accuracy and efﬁciency of diagnosis and treatment. Automatic liver

© Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11140, pp. 137–147, 2018. https://doi.org/10.1007/978-3-030-01421-6_14

138

W. Tang et al.

segmentation is a key prerequisite for tasks such as living donor liver transplant, 3D reconstruction of medical images, 3D positioning in radiotherapy programs, and so on. In general, there are four reasons why liver segmentation is completely challenging, as shown in Fig. 1. First, the liver shares the similar intensity with its surrounding organs, such as heart and stomach. Second, most clinical images have serious pathology, such as large tumors and cirrhosis of the liver, which should be part of the liver. But their intensity is signiﬁcantly different from normal liver. Third, each person’s liver is different in shape. Fourth, partial volume effect makes the liver contour become blurred. Up to now, many methods have been used for liver segmentation and reviewed in [2]. However, to the best of our knowledge, the existing methods are difﬁcult to segment small and contour complex liver.

Fig. 1. Four challenges in liver segmentation.

In this paper, we proposed a fully automatic liver segmentation method using Faster R-CNN and DeepLab. This work makes three main contributions which are experimentally shown to have substantial practical merit. Firstly, Faster RCNN and DeepLab are combined for the ﬁrst time and applied to liver segmentation for achieving good results. Secondly, we solve the high similarity between liver tissue and its adjacent organs by detecting liver areas. Thirdly, we can segment small and contour complex livers, which is not found in the present methods. Our DSL method has achieved a promising performance on the liver segmentation with respect to VOE, RVD, ASD and total score.

DSL: Automatic Liver Segmentation with Faster R-CNN and DeepLab

139

2 Related Work In this section, we will briefly introduce previous work on liver segmentation. Considering whether human interaction is required, we simply categorize previous work as interactive segmentation method [6, 21, 22], semi-automatic segmentation method [6, 14, 27], automatic segmentation method [1, 11, 28]. The effect of interactive liver segmentation method is often superior to the effect of automatic and semi-automatic segmentation method, because it requires complete control by the researchers. But its interaction is very frequent and the workload is the largest. Dong et al. [7] raised an interactive liver segmentation method making use of random walks and narrow band threshold, which used minimal guidance to segment liver. Semi-automatic segmentation method can better segment the target contours that meet the willing of the researches, and the stability is stronger. But the workload is slightly larger. Yang et al. [29] came up with a classic hybrid semi-automatic segmentation method, which consisted of a customized fast-marching level-set method and a threshold-based level-set method. Liao [18] presented an efﬁcient liver segmentation method based on graph cut and bottleneck detection using intensity, local context and spatial correlation of adjacent slices. Automatic segmentation method mainly includes region growing based methods, rule based methods, graph cut based methods, statistical shape model based method, convolution neural network based method and so on. Gambino et al. [9] proposed an automatic texture based volumetric region growing method. Subsequently, Li et al. [17] used the graph cut method to effectively integrate the properties and correlations of the input image and the initialized surface. The effect of graph cut method was very well in the split larger CT images. To date, deep convolutional neural networks (DCNNs) have dominated many tasks in computer vision such as classiﬁcation, detection and segmentation. In recent years, DCNNs [20, 28] has been widely used in liver segmentation. Lu et al. [20] combined the convolution neural network and graphic cutting method. Yang et al. [28] put forward an Adversarial Image-to-Image Network to comply automatic liver segmentation. Although many automatic liver segmentation methods have been used to segment liver, the metrics have yet to be improved. In this paper, our work explores a novel DSL method which greatly improves the metrics.

3 Method 3.1

Overview

An overview of the proposed DSL method is described in Fig. 2. Its framework is based on Faster R-CNN and DeepLab. We divide the procedure into two parts: training part and testing part. In the training part, we ﬁrstly need to manually annotate the proposed CT volume image using bounding box, which accurately marks the position of the liver. Then the Faster R-CNN is trained making use of annotated image of the training data. Meanwhile, DeepLab is trained using the data that the pixel value beyond the bounding box of the annotated image is set as zero. In the testing part, the testing images are input into the trained Faster R-CNN to get the detection results. Then the set

140

W. Tang et al.

images, in which the pixel value outside the bounding box of the detection results is set as zero, are input into the trained DeepLab to obtain the liver segmentation results.

Fig. 2. Overview of the proposed framework.

3.2

Liver Detection Based on Faster R-CNN

Faster R-CNN [23] is proposed to reduce the computational burden of proposal generation. Faster R-CNN is improved by Fast R-CNN [10], which is developed from R-CNN. Faster R-CNN has evolved into a powerful framework for computer vision. Faster R-CNN has the state-of-the-art performance in terms of accuracy in image detection. The procedure of liver detection using Faster R-CNN is introduced in Fig. 3. First of all, we input test CT volume images. Then to extract features, the entire image is entered into CNN. VGG 16, which has a more accurate valuation of the image and space saving, is adopted as the fundamental network. Thirdly, we use region proposal network (RPN) to generate three hundred region proposals for each liver image. Each region proposal has several anchors. Fourthly, region proposals are mapped to the last layer of convolution feature map on CNN. Fifthly, each region of interest (ROI) engenders a ﬁxed size of feature map through the ROI pooling layer. Finally, classiﬁcation probability and bounding box regression are jointly trained by softmax loss and smooth L1 loss. The Faster R-CNN loss function is

Fig. 3. The procedure of detecting liver using Faster R-CNN.

DSL: Automatic Liver Segmentation with Faster R-CNN and DeepLab

P Lðfpi g; fti gÞ ¼

P Lcls pi ; pi pi Lreg ti ; ti þk Ncls Nreg

141

ð1Þ

where i is index of an anchor and pi denotes the probability that anchor predicts the liver. pi is the ground truth label, where the label value is 0 or 1. ti is a vector that indicates the four parameterized coordinates of the predicted bounding box, and ti is the coordinate vector of the ground truth of thebounding box corresponding to the positive anchor. The classiﬁcation loss Lcls pi ; pi is log loss on two classes (liver vs. not liver). The classiﬁcation loss is computed using Eq. 2. Lcls pi ; pi ¼ log pi pi þ 1 pi ð1 pi Þ

ð2Þ

For regression loss Lreg ti ; ti , Lref ti ; ti ¼ R ti ti

ð3Þ

where R represents smooth L1 function. fpi g and fti g form the output of the classiﬁcation layer and the regression layer. In this paper, we take advantage of Faster RCNN to detect liver, getting good performance, as shown in Fig. 4.

Fig. 4. Detection results by fast R-CNN.

3.3

Liver Segmentation Based on DeepLab

In this section, we will address the key aspects of DeepLab V2, which is developed by DeepLab V1 [3]. More detail technical acknowledge can be referred to the original paper [4]. To deal with reduced feature resolution, atrous convolution is introduced in the DeepLab V2. Atrous convolution has many advantages: atrous convolution magically recover full resolution feature maps, which are reduced by the repeated combination of max pooling and downsampling. Atrous convolution also can effectively enlarge the ﬁeld of view of ﬁlters without increasing the number of parameters, which is employed in subsequent convolution layers. Using the atrous convolution with rate = k can get

142

W. Tang et al.

the output feature map which increases k − 1 times than traditional convolution. However, the use of atrous convolution method has some shortcomings. For example, its computational cost is relatively high, and the need to deal with a large number of high-resolution feature map will consume a lot of memory resource. Therefore, DeepLab V2 takes a compromise approach that some feature maps use the bilinear interpolation method, and the others use the atrous convolution method. Taking into account the different scales of information, the most direct method is to input into the DCNN rescaled versions of the same image, and then these CNN feature maps are combined to generate the ﬁnal results. It proved to be a good performance, but operation is too cumbersome and too time consuming. Thus, DeepLab V2 uses atrous spatial pyramid pooling (ASPP), which can diametrically extract the multi-scale information on the basis of the input of the original image. ASPP is that we use multiple parallel atrous convolutional layers with different sampling rates. The features extracted for each sampling rate are further processed in separate branches and merged to produce the ﬁnal results.

4 Experiment and Analysis Faster R-CNN training is divided into four stages: training region proposal network (RPN), VGG-16, RPN and VGG-16. The learning rate of each stage is set as 0.001. We run the stochastic gradient descent (SGD) solver for 70000 in the training stage of RPN and 50000 in the training stage of VGG-16. We ﬁnetune the model weight of DeepLab composed of VGG-16, atrous convolution and fully connected CRF, to adapt them to the segmentation task, following the procedure of [4]. We replace the 1000-way ImageNet classiﬁer in the VGG-16 last layer with a two-class (including the background and liver) classiﬁer and run the SGD solver for 100000 iterations with a base of learning rate of 0.001. We evaluate the proposed method on the 3Dircadb data set and MICCAISliver07 data set, which are well-known challenge datasets. We ﬁrst report the main results on MICCAI-Sliver07, and immediately introduce the results of 3Dircadb. 4.1

MICCAI-Sliver07

Five metrics are calculated as in [13], Volumetric Overlap Error (VOE), Relative Volume Difference (RVD), Average Symmetric Surface Distance (ASD), Root Mean Square Symmetric Surface Distance (RMSD) and Maximum Symmetric Surface Distance (MSD). The score of VOE, RVD, ASD, RMSD, MSD are 80.2%, 94.1%, 80.5%, 76.4%, 71.9%, respectively. The metric comparison of the proposed methods and the other eight fully automatic liver segmentation methods [1, 12, 15, 17, 19, 20, 24, 26] based on MICCAI-Sliver07 test set, is shown in Table 1. It is obvious that the proposed method is better than DeepLab. Figure 5 presents the segmentation results of the proposed method and DeepLab. As can be seen, small and complex liver can be successfully segmented. Meanwhile, we also can observe that our proposed method performs very well. It achieves 80.6 total score, surpassing all the compared methods. The reasons why the proposed method has achieved much better performance are these:

DSL: Automatic Liver Segmentation with Faster R-CNN and DeepLab

143

Table 1. Compared with other state-of-the-art methods on MICCAI-Sliver07 test set Method

VOE (%)

Score- RVD (%)

Score -

ASD (mm)

Score -

RMSD (mm)

Score -

MSD (mm)

Score -

Total score

Li et al. [17] Shaikhli et al. [1] Kainmüller et al. [15] Wimmer et al. [26] Linguraru et al. [19] Heimann et al. [12] Kinda et al. [24] Fang et al. [20] Only DeepLab The proposed

6.24 6.44

– 74.9

1.18 1.53

– 89.7

1.03 0.95

– 76.3

2.11 1.58

– 78.1

18.82 15.92

– 79.1

– 79.6

6.09

76.2

−2.86 84.7

0.95

76.3

1.87

74

18.69

75.4

77.3

6.47

74.7

1.04

86.4

1.02

74.5

2

72.3

18.32

75.9

76.8

6.37

75.1

2.26

85

1

74.9

1.92

73.4

20.75

72.7

76.2

7.73

69.8

1.66

87.9

1.39

65.2

3.25

54.9

30.07

60.4

67.6

8.91

65.2

1.21

80

1.52

61.9

3.47

51.8

29.27

61.5

64.1

5.9

77

2.7

85.6

0.91

77.3

1.88

73.8

18.94

75.1

77.8

6.38

75.1

2.14

87.1

1.05

73.8

2.24

68.9

24.04

68.4

74.7

5.06

80.2

-0.09

94.1

0.78

80.5

1.7

76.4

23.42

71.9

80.6

ﬁrst, Faster R-CNN detects the liver area, which reduces the scope of follow-up liver segmentation and avoids the challenge that the grayscale of the liver and its adjacent organ tissues is similar. Second, because DeepLab is a method of semantic

Fig. 5. Example of liver segmentation results with the ground truth in green. The result by the proposed method is in red and the result by the only DeepLab is in blue. (Color ﬁgure online)

144

W. Tang et al. Table 2. Compared with other state-of-the-art methods on 3Dircadb data set.

3Dircadb

VOE [%]

RVD [%]

ASD [mm]

Chuang et al. [5] Kirscher et al. [16] Li et al. [17] Erdt et al. [8] Lu et al. [20] The proposed

12.99 ± 5.04 – 9.15 ± 1.44 10.34 ± 3.11 9.36 ± 3.34 8.67 ± 0.815

−5.66 ± 5.59 −3.62 ± 5.50 −0.07 ± 3.64 1.55 ± 6.49 0.97 ± 3.26 0.57 ± 2.53

2.24 1.94 1.55 1.74 1.89 1.37

± ± ± ± ± ±

1.08 1.10 0.39 0.59 1.08 0.41

RMSD [mm] MSD [mm] – 4.47 3.15 3.51 4.15 4.15

± ± ± ± ±

3.30 0.98 1.16 3.16 3.16

25.74 34.60 28.22 26.83 33.14 27.01

± ± ± ± ± ±

8.85 17.70 8.31 8.87 16.36 7.28

segmentation, there are serious pathologies in the pathological images that will not affect the results. Third, fully connected CRF can segment the contours of liver, which can address the challenge that partial volume effect makes the liver contour blurred. All in all, the proposed method outperforms the others, especially its VOE, RVD and ASD obtained the highest score. 4.2

3Dircadb

Table 2 displays the result of other state-of-the-art automatic segmentation methods [5, 8, 16, 17, 20] with our work on 3Dircadb data set. It can be seen that our explored method achieves much better performance than all the compared methods in terms of the measure of VOE, RVD and ASD. For the RMSD and MSD metric, the results of Chuang’s method and Erdt’s method show slightly better performance than ours. The segmentation result is shown in Fig. 6. We can observe that small and contour complex liver can be accurately segmented and the effect of our proposed method is better than

Fig. 6. Example of liver segmentation results with the ground truth in green. The result by the proposed method is in red and the result by the only DeepLab is in blue. (Color ﬁgure online)

DSL: Automatic Liver Segmentation with Faster R-CNN and DeepLab

145

DeepLab’s, because we ﬁrst use the Faster R-CNN to detect the liver area and then use DeepLab to segment. Overall, our proposed method achieves much better performance than the other compared methods.

5 Conclusion In this paper, we proposed DSL method for automatic liver segmentation in abdominal CT images. Speciﬁcally, to handle the high similarity between liver and its adjacent tissues, Faster R-CNN is used to detect liver region. The detection results are input to DeepLab for segmenting liver. The main advantage of our approach is that small and contour complex liver can be accurately segmented. Besides, Faster R-CNN and DeepLab are combined for the ﬁrst time and applied to a new scene, where no manual feature extraction or user interaction is required during the training and testing procedure. Experimental results prove the efﬁciency of our method. Compared with the stateof-the-art automatic liver segmentation methods, our proposed method is ranked in the front according to the total score. Especially, the VOE, RVD and ASD metrics are much higher than the other compared method’s. We plan to study new liver segmentation algorithm to boost our model’s ability in future work.

References 1. Al-Shaikhli, S.D.S., Yang, M.Y., Rosenhahn, B.: Automatic 3D liver segmentation using sparse representation of global and local image information via level set formulation. Computer Science (2015) 2. Campadelli, P., Casiraghi, E.: Liver segmentation from CT scans: a survey. In: WILF, pp. 520–528 (2007) 3. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected CRFs. Computer Science, pp. 357–361 (2014) 4. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, Atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. (2016) 5. Chung, F., Delingette, H.: Regional appearance modeling based on the clustering of intensity proﬁles. Comput. Vis. Image Underst. 117(6), 705–717 (2013) 6. Dawant, B.M., Li, R., Lennon, B., Li, S.: Semi-automatic segmentation of the liver and its evaluation on the MICCAI 2007 grand challenge data set. In: Workshop on 3D Segmentation in the Clinic (2007) 7. Dong, C., et al.: A knowledge-based interactive liver segmentation using random walks. In: International Conference on Fuzzy Systems and Knowledge Discovery, pp. 1731–1736 (2015) 8. Erdt, M., Steger, S., Kirschner, M., Wesarg, S.: Fast automatic liver segmentation combining learned shape priors with observed shape deviation. In: IEEE 23rd International Symposium on Computer-Based Medical Systems (CBMS), pp. 249–254 (2010)

146

W. Tang et al.

9. Gambino, O., et al.: Automatic volumetric liver segmentation using texture based region growing. In: International Conference on Complex, Intelligent and Software Intensive Systems, pp. 146–152 (2010) 10. Girshick, R.: Fast R-CNN. In: IEEE International Conference on Computer Vision, pp. 1440–1448 (2015) 11. He, B., et al.: Fast automatic 3D liver segmentation based on a three-level AdaBoost-guided active shape model. Med. Phys. 43(5), 2421–2434 (2016) 12. Heimann, T., van Ginneken, B., Styner, M.A., et al.: Comparison and evaluation of methods for liver segmentation from CT datasets. IEEE Trans. Med. Imaging 28(8), 1251–1265 (2009) 13. Heimann, T., Meinzer, H.P., Wolf, I.: A statistical deformable model for the segmentation of liver CT volumes. In: MICCAI Workshop on 3D Segmentation in the Clinic (2010) 14. Jansen, J., Schreurs, R., Dubois, L., Maal, T.J.J., Gooris, P.J.J., Becking, A.G.: Orbital volume analysis: validation of a semi-automatic software segmentation method. Int. J. Comput. Assist. Radiol. Surg. 11(1), 11–18 (2015) 15. Kainmüller, D., Lange, T., Lamecker, H.: Shape constrained automatic segmentation of the liver based on a heuristic intensity model. In: MICCAI Workshop On 3D Segmentation in the Clinic, pp. 109–116 (2008) 16. Kirschner, M.: The probabilistic active shape model: from model construction to flexible medical image segmentation. Ph.D. thesis, Technischen Universität Darmstadt (2013) 17. Li, G., Chen, X., Shi, F., Zhu, W., Tian, J., Xiang, D.: Automatic liver segmentation based on shape constraints and deformable graph cut in CT images. IEEE Trans. Image Process. 24 (12), 5315 (2015) 18. Liao, M., et al.: Efﬁcient liver segmentation in CT images based on graph cuts and bottleneck detection. Physica Med. 32(11), 1383 (2016) 19. Linguraru, M.G., Richbourg, W.J., Watt, J.M., Pamulapati, V., Summers, R.M.: Liver and tumor segmentation and analysis from CT of diseased patients via a generic afﬁne invariant shape parameterization and graph cuts. In: International MICCAI Workshop on Computational and Clinical Challenges in Abdominal Imaging, pp. 198–206 (2011) 20. Lu, F., Wu, F., Hu, P., Peng, Z., Kong, D.: Automatic 3D liver location and segmentation via convolutional neural network and graph cut. Int. J. Comput. Assist. Radiol. Surg. 12(2), 171–182 (2017) 21. Lu, J., Shi, L., Deng, M., Yu, S.C.H., Heng, P.A.: An interactive approach to liver segmentation in CT based on deformable model integrated with attractor force. In: International Conference on Machine Learning and Cybernetics, pp. 1660–1665 (2011) 22. Meena, S., Palaniappan, K., Seetharaman, G.: Interactive image segmentation using elastic interpolation. In: IEEE International Symposium on Multimedia, pp. 307–310 (2016) 23. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017) 24. Saddi, A.K., Rousson, M., Hotel, C.C., Cheriet, F.: Global-to-local shape matching for liver segmentation in CT imaging (2007) 25. Webster, N.J.G.: Alternative RNA splicing in the pathogenesis of liver disease. Front. Endocrinol. 8 (2017) 26. Wimmer, A., Soza, G., Hornegger, J.: A generic probabilistic active shape model for organ segmentation. In: International Conference on Medical Image Computing and ComputerAssisted Intervention, pp. 26–33 (2009) 27. Yan, J., Schwartz, L.H., Zhao, B.: Semiautomatic segmentation of liver metastases on volumetric CT images. Med. Phys. 42(11), 6283–6293 (2015)

DSL: Automatic Liver Segmentation with Faster R-CNN and DeepLab

147

28. Yang, D., et al.: Automatic liver segmentation using an adversarial image-to-image network. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 507–515 (2017) 29. Yang, X., et al.: A hybrid semi-automatic method for liver segmentation based on level-set methods using multiple seed points. Comput. Meth. Prog. Biomed. 113(1), 69–79 (2014)

Temporal Convolution Networks for Real-Time Abdominal Fetal Aorta Analysis with Ultrasound Nicol´o Savioli1(B) , Silvia Visentin2(B) , Erich Cosmi2(B) , Enrico Grisan1,3(B) , Pablo Lamata1(B) , and Giovanni Montana1,4(B) 1

Department of Biomedical Engineering, Kings College London, London SE1 7EH, UK {nicolo.l.savioli,enrico.grisan,pablo.lamata,giovanni.montana}@kcl.ac.uk 2 Department of Woman and Child Health, University Hospital of Padova, Padua, Italy {silvia.visentin.1,erich.cosmi}@unipd.it 3 Department of Information Engineering, University of Padova, Padua, Italy [emailprotected] 4 WMG, University of Warwick, Coventry CV4 71AL, UK [emailprotected]

Abstract. The automatic analysis of ultrasound sequences can substantially improve the eﬃciency of clinical diagnosis. In this work we present our attempt to automate the challenging task of measuring the vascular diameter of the fetal abdominal aorta from ultrasound images. We propose a neural network architecture consisting of three blocks: a convolutional layer for the extraction of imaging features, a Convolution Gated Recurrent Unit (C-GRU) for enforcing the temporal coherence across video frames and exploiting the temporal redundancy of a signal, and a regularized loss function, called CyclicLoss, to impose our prior knowledge about the periodicity of the observed signal. We present experimental evidence suggesting that the proposed architecture can reach an accuracy substantially superior to previously proposed methods, providing an average reduction of the mean squared error from 0.31 mm2 (state-of-art) to 0.09 mm2 , and a relative error reduction from 8.1% to 5.3%. The mean execution speed of the proposed approach of 289 frames per second makes it suitable for real time clinical use. Keywords: Cardiac imaging · Diameter · Ultrasound Convolutional networks · Fetal imaging · GRU · CyclicLoss

1

Introduction

Fetal ultrasound (US) imaging plays a fundamental role in the monitoring of fetal growth during pregnancy and in the measurement of the fetus wellbeing. Growth monitoring is becoming increasingly important since there is an c Springer Nature Switzerland AG 2018 V. K˚ urkov´ a et al. (Eds.): ICANN 2018, LNCS 11140, pp. 148–157, 2018. https://doi.org/10.1007/978-3-030-01421-6_15

Temporal Convolution Networks for Real-Time Abdominal Fetal Aorta

149

epidemiological evidence that abnormal birth weight is associated with an increased predisposition to diseases related to cardiovascular risk (such as diabetes, obesity, hypertension) in young and adults [1]. Among the possible biomarkers of adverse cardiovascular remodelling in fetuses and newborns, the most promising ones are the Intima-Media Thickness (IMT) and the stiﬀness of the abdominal aorta by means of ultrasound examination. Obtaining reliable measurements is critically based on the accurate estimation of the diameter of the aorta over time. However, the poor signal to noise ratio of US data and the fetal movement makes the acquisition of a clear and stable US video challenging. Moreover, the measurements rely either on visual assessment at bed-side during patient examination, or on tedious, errorprone and operator-dependent review of the data and manual tracing at later time. Very few attempts towards automated assessment have been presented [2,3], all of which have computational requirements that prevent them to be used in real-time. As such, they have reduced appeal for the clinical use. In this paper we describe a method for automated measurement of the abdominal aortic diameter directly from fetal US videos. We propose a neural network architecture that is able to process US videos in real-time and leverage both the temporal redundancy of US videos and the quasi-periodicity of the aorta diameter. The main contributions of the proposed method are as follows. First we show that a shallow CNN is able to learn imaging features and outperforms classical methods as level-set for fetal abdominal aorta diameter prediction. Second we add to the CNN a Convolution Gated Recurrent Unit (C-GRU) [15] for exploiting the temporal redundancy of the features extracted by CNN from the US video sequence. Finally, we add a new penalty term to the loss function used to train the CNN to exploit periodic variations.

2

Related Work

The interest for measuring the diameter and intima-media thickness (IMT) of major vessels has stemmed from its importance as biomarker of hypertension damage and atherosclerosis in adults. Typically, the IMT is assessed on the carotid artery by identifying its lumen and the diﬀerent layers of its wall on high resolution US images. The improvements provided by the design of semiautomatic and automatic methods based mainly on the image intensity proﬁle, distribution and gradients analysis, and more recently on active contours. For a comprehensive review of these classical methods we refer the reader to [4] and [5]. In the prenatal setting, the lower image quality, due to the need of imaging deeper in the mother’s womb and by the movement of the fetus, makes the measurement of the IMT biomarker, although measured on the abdominal aorta, challenging. Methods that proved successful for adult carotid image analysis do not perform well on such data, for which only a handful of methods (semi-automatic or automatic) have been proposed, making use of classical tracing methods and mixture of Gaussian modelling of blood-lumen and media-adventitia interfaces [2],

150

N. Savioli et al.

or on level sets segmentation with additional regularizing terms linked to the speciﬁc task [3]. However, their sensitivity to the image quality and lengthy computation prevented an easy use in the clinical routine. Deep learning approaches have outperformed classical methods in many medical tasks [8]. The ﬁrst attempt in using a CNN, for the measurement of carotid IMT has been made only recently [9]. In this work, two separate CNNs are used to localize a region of interest and then segment it to obtain the lumen-intima and media-adventitia regions. Further classical post-processing steps are then used to extract the boundaries from the CNN based segmentation. The method assumes the presence of strong and stable gradients across the vessel walls, and extract from the US sequence only the frames related to the same cardiac phase, obtained by a concomitant ECG signal. However, the exploitation of temporal redundancy on US sequences was shown to be a solution for improving overall detection results of the fetal heart [11], where the use of a CNN coupled with a recurrent neural network (RNN) is strategic. Other works, propose similar approach in order to detect the presence of standard planes from prenatal US data using CNN with Long-Short Term Memory (LSTM) [10].

3

Datasets

This study makes use of a dataset consisting of 25 ultrasound video sequences acquired during routine third-trimester pregnancy check-up at the Department of Woman and Child Health of the University Hospital of Padova (Italy). The local ethical committee approved the study and all patients gave written informed consent. Fetal US data were acquired using a US machine (Voluson E8, GE) equipped with a 5 MHz linear array transducer, according to the guidelines in [6,7], using a 70◦ FOV, image dimension 720 × 960 pixels, a variable resolution between 0.03 and 0.1 mm and a mean frame rate of 47 fps. Gain settings were tuned to enhance the visual quality and contrast during the examination. The length of the video is between 2 s and 15 s, ensuring that at least one full cardiac cycle is imaged. After the examination, the video of each patient was reviewed and a relevant video segment was selected for semi-automatic annotation considering its visual quality and length: all frames of the segment were processed with the algorithm described in [2] and then the diameters of all frames in the segments were manually reviewed and corrected. The length of the selected segments varied between 21 frames 0.5 s and 126 frames 2.5 s. The 25 annotated segments in the dataset were then randomly divided into training (60% of the segments), validation (20%) and testing (20%) sets. In order to keep the computational and memory requirements low, each frame was cropped to have a square aspect ratio and then resized to 128 × 128 pixels. The data supporting this research are [openly available].

Temporal Convolution Networks for Real-Time Abdominal Fetal Aorta

4

151

Network Architecture

Our output is the predicted value yˆ[t] of the diameter of the abdominal aorta at each time point. Our proposed deep learning solution consists of three main components (see Fig. 1): a Convolutional Neural Network (CNN) that captures the salient characteristics from ultrasound input images; a Convolution Gated Recurrent Unit (C-GRU) [15] exploits the temporal coherence through the sequence; and a regularized loss function, called CyclicLoss, that exploits the redundancy between adjacent cardiac cycles. Our input consists of a set of sequences whereby each sequence S = [s[1], ..., s[K]] has dimension N ×M pixels at time t, with t ∈ {1, . . . , K}. At each time point t, the CNN extracts the feature maps x[t] of dimensions D ×Nx ×Mx , where D is the number of maps, and Nx and Mx are their in-plane pixel dimensions, that depend on the extent of dimensionality reduction obtained by the CNN through its pooling operators. The feature maps are then processed by a C-GRU layer [15]. The C-GRU combines the current feature maps x[t] with an encoded representation h[t − 1] of the feature maps {x[1], . . . , x[t − 1]} extracted at previous time points of the sequence to obtain an updated encoded representation h[t], the current state, at time t: this allows to exploit the temporal coherence in the data. The h[t] of the C-GRU layer is obtained by two speciﬁc gates designed to control the information inside the unit: a reset gate, r[t], and an update gate, z[t], deﬁned as follows: (1) r[t] = σ(Whr ∗ h[t − 1] + Wxr ∗ x[t] + br ) z[t] = σ(Whz ∗ h[t − 1] + Wxz ∗ x[t] + bz )

(2)

Where, σ() is the sigmoid function, W· are recurrent weights matrices whose ﬁrst subscript letter refers to the input of the convolution operator (either the feature maps x[t] or the state h[t − 1]), and whose second subscript letter refers to the gate (reset r or update z). All this matrices, have a dimension of D × 3 × 3 and b· is a bias vector. In this notation, ∗ deﬁnes the convolution operation. The current state is then obtained as: h[t] = (1 − z[t]) h[t − 1] + z[t] tanh(Wh ∗ (r[t] ht−1 ) + Wx ∗ x[t] + b). (3) Where denotes the dot product and Wh and Wx are recurrent weight matrices for h[t−1] and x[t], used to balance the new information represented by the feature maps x[t] derived by the current input data s[t] with the information obtained observing previous data s[1], . . . , s[t − 1]. On the one hand, h[t] is then passed on for updating the state h[t + 1] at the next time point, and on the other is ﬂatten and fed into the last part of the network, built by Fully Connected (FC) layers progressively reducing the input vector to a scalar output that represent the current diameter estimate yˆ[t].

152

N. Savioli et al.

Fig. 1. The deep-learning architecture proposed for abdominal diameter aorta prediction. The blue blocks represent the features extraction through a CNN (AlexNet) which takes in input a US sequence S, and provides for each frame s[t] a features map x[t] that is passed to Convolution Gated Recurrent Units (C-GRU) (yellow circle) that encodes and combines the information from diﬀerent time points to exploit the temporal coherence. The fully connected block (FC, in green), takes as input the current encoded state h[t] as features to estimate the aorta diameter yˆ[t]. (Color ﬁgure online)

4.1

CyclicLoss

Under the assumption that the pulsatility of the aorta follows a periodic pattern with the cardiac cycle, the diameter of the vessel at corresponding instants of the cardiac cycle should ideally be equal. Assuming a known cardiac period Tperiod , we propose to add a regularization term to the loss function used to train the network as to penalize large diﬀerences of the diameter values that are estimated at time points that are one cardiac period apart. We call this regularization term CyclicLoss (CL), computed as L2 norm between pairs of predictions at the same point of the heart cycle and from adjacent cycles: Ncycles Tperiod CL = yˆ[t + (n − 1)Tperiod ] − yˆ[t + nTperiod ] 2 (4) n=1

t=0

The Tperiod is the period of the cardiac cycle, while Ncycles is the number of integer cycles present in the sequence and yˆ[t] is the estimated diameter at time t. Notably, the Tperiod is determined through a peak detection algorithm on y[t], and the average of all peak-to-peak detection distances deﬁne its value.

Temporal Convolution Networks for Real-Time Abdominal Fetal Aorta

153

While the Ncycles is the number of cycles present, calculated as the total length of the y[t] signal divided by Tperiod . The loss to be minimized is therefore a combination of the classical mean squared error (MSE) with the CL, and the balance between the two is controlled by a constant λ: Loss = M SE + λ · CL =

K 1 (y[t] − yˆ[t])2 + λ · CL K t=1

(5)

where y[t] is the target diameter at time point t. It is worth noting that the knowledge of the period of the cardiac cycle is needed only during training phase. Whereas, during the test phase, on unknown image sequence, the trained network provide its estimate blind of the periodicity of the speciﬁc sequence under analysis.

Fig. 2. Each panel (a–c) shows the estimation of the aortic diameter at each frame of fetal ultrasound videos in the test set, using the level set method (dashed purple line), the naive architecture using AlexNet (dashed orange line), the AlexNet+C-GRU (dashed red line), and AlexNet+C-GRU trained with the CyclicLoss (dashed blue line). The ground truth (solid black line) is reported for comparison. Panels (a, c) show the results on long sequences where more than 3 cardiac cycles are imaged, whereas panels (b, d) show the results on short sequences where only 1 or two cycles are available. (Color ﬁgure online)

154

4.2

N. Savioli et al.

Implementation Details

For our experiments, we chose AlexNet [12] as a feature extractor for its simplicity. It has ﬁve hidden layers with 11×11 kernels size in the ﬁrst layer, 5×5 in the second and 3 × 3 in the last three layers; it is well suited to the low image contrast and diﬀuse edges characteristic of US sequences. Each network input for the training is a sequence of K = 125 ultrasound frames with N = M = 128 pixels, AlexNet provides feature maps of dimension D ×N ×M = 256×13×13, and the ﬁnal output yˆ[t] is the estimate abdominal aorta diameter value at each frame. The loss function is optimised with the Adam algorithm [16] that is a ﬁrstorder gradient-based technique. The learning rate used is 1e−4 with 2125 iterations (calculated as number of patients × number of ultrasound sequences) for 100 epochs. In order to improve generalization, data augmentation of the input with a vertical and horizontal random ﬂip is used at each iteration. The λ constant used during training with CyclicLoss takes the value of 1e−6 .

5

Experiments

The proposed architecture is compared with the currently adopted approach in Sect. 4. This method provides fully-automated measurements in lumen identiﬁcation on prenatal US images of the abdominal aorta [3] based on edge-based level set. In order to understand the behaviour of diﬀerent features extraction methods, we have also explored the performance of new deeper network architectures whereby AlexNet was replaced it by InceptionV4 [13] and DenseNets 121 [14]. Table 1. The table show the mean (standard deviation) of MSE and RE error for all the comparison models. The combination of C-GRU and the CyclicLoss with AlexNet yields the best performance. Adding recurrent units to any CNN architecture improves its performance; however deeper networks as InceptionV4 and DenseNets do not show any particular beneﬁts with respect to the simpler AlexNet. Notably, we also consider the p-value for multiple models comparison with the propose network AlexNet+CGRU+CL, in this case the signiﬁcant level should be 0.05/7 using the Bonferroni correction [17]. Methods

MSE [mm2 ]

RE [%]

p-value

AlexNet

0.29(0.09)

8.67(10)

1.01e−12

AlexNet+C-GRU

0.093(0.191) 6.11(5.22)

1.21e−05

AlexNet+C-GRU+CL 0.085(0.17) 5.23(4.91) “-” DenseNet121

0.31(0.56)

9.55(8.52)

6.00e−13

DenseNet121+C-GRU

0.13(0.21)

7.72(5.46)

7.78e−12

InceptionV4

6.81(14)

50.4(39.5)

6.81e−12

InceptionV4+C-GRU

0.76(1.08)

16.3(9.83)

2.89e−48

Level-set

0.31(0.80)

8.13(9.39)

1.9e−04

Temporal Convolution Networks for Real-Time Abdominal Fetal Aorta

155

The performance of each method was evaluated both with respect to the mean squared error (MSE) and to the mean absolute relative error (RE); all values are reported in Table 1 in terms of average and standard deviation across the test set. In order to provide a visual assessment of the performance, representative estimations on four sequences of the test set are shown in Fig. 2. The naive architecture relying on a standard loss and its C-GRU version are incapable to capture the periodicity of the diameter estimation. The problem is mitigated by adding the CyclicLoss regularization on MSE. This is quantitatively shown in Table 1, where the use of this loss further decreases the MSE from 0.093 mm2 to 0.085 mm2 , and the relative error of from 6.11% to 5.23%. Strikingly, we observed that deeper networks are not able to outperform AlexNet on this dataset. Their limitation may be due to over-ﬁtting. Nevertheless, the use of C-GRU greatly improve the performance of both networks both in terms of MSE and of RE. Further, we also performed a non-parametric test (Kolmogorov-Smirnov test) to check if the best model was statistically diﬀerent compared to the others. The results obtained with the complete model AlexNet+C-GRU+CL are indeed signiﬁcantly diﬀerent from all others (p < 0.05) also, when the signiﬁcant level is adjusted for multiple comparison applying the Bonferroni correction [17,18].

6

Discussion and Conclusion

The deep learning (DL) architecture proposed shows excellent performance compared to traditional image analysis methods, both in accuracy and eﬃciency. This improvement is achieved through a combination of a shallow CNN and the exploitation of the temporal and cyclic coherence. Our results seem to indicate that a shallow CNNs perform better than deeper CNNs such as DenseNet 121 and InceptionV4; this might be due to the small dimension of the data set, a common issue in the medical settings when requiring manual annotations of the data. 6.1

The CyclicLoss Benefits

The exploitation of temporal coherence is what pushes the performance of the DL solution beyond current image analysis methods, reducing the MSE from 0.29 mm2 (naive architecture) to 0.09 mm2 with the addition of the C-GRU. The CyclicLoss is an eﬃcient way to guide the training of the DL solution in case of data showing some periodicity, as in cardiovascular imaging. Please note that the knowledge of the signal period is only required by the network during training, and as such it does not bring additional requirements on the input data for real clinical application. We argue that the CyclicLoss is making the network learn to expect a periodic input and provide some periodicity in the output sequence.

156

6.2

N. Savioli et al.

Limitations and Future Works

A drawback of this work is that it assumes the presence of the vessel in the current ﬁeld of view. Further research is thus required to evaluate how well the solution adapts to the scenario of lack of cyclic consistency, when the vessel of interest can move in and out of the ﬁeld of view during the acquisition, and to investigate the possibility of a concurrent estimation of the cardiac cycle and vessel diameter. Finally, the C-GRU used in our architecture, has two particular advantages compared to previous approaches [10,11]: ﬁrst, it is not subject to the vanishing gradient problem as the RNN, allowing to train from long sequences of data. Second, it has less computational cost compared to the LSTM, and that makes it suitable for real time video application. Acknowledgement. This work was supported by the Wellcome/EPSRC Centre for Medical Engineering at Kings College London (WT 203148/Z/16/Z). Dr. Lamata holds a Wellcome Trust Senior Research Fellowship (grant n.209450/Z/17/Z).

References 1. Visentin, S., Grumolato, F., Nardelli, G.B., Di Camillo, B., Grisan, E., Cosmi, E.: Early origins of adult disease: low birth weight and vascular remodeling. Atherosclerosis 237(2), 391–399 (2014) 2. Veronese, E., Tarroni, G., Visentin, S., Cosmi, E., Linguraru, M.G., Grisan, E.: Estimation of prenatal aorta intima-media thickness from ultrasound examination. Phys. Med. Biol. 59(21), 6355–6371 (2014) 3. Tarroni, G., Visentin, S., Cosmi, E., Grisan, E.: Fully-automated identiﬁcation and segmentation of aortic lumen from fetal ultrasound images. In: IEEE EMBC, pp. 153–156 (2015) 4. Molinari, F., Zeng, G., Suri, J.S.: A state of the art review on intimamedia thickness (IMT) measurement and wall segmentation techniques for carotid ultrasound. Comp. Meth. Prog. Biomed. 100(3), 201–221 (2010) 5. Loizou, C.P.: A review of ultrasound common carotid artery image and video segmentation techniques. Med. Biol. Eng. Comp 52(12), 1073–1093 (2014) 6. Cosmi, E., Visentin, S., Fanelli, T., Mautone, A.J., Zanardo, V.: Aortic intima media thickness in fetuses and children with intrauterine growth restriction. Obs. Gyn. 114, 1109–1114 (2009) 7. Skilton, M.R., Evans, N., Griﬃths, K.A., Harmer, J.A., Celermajer, D.S.: Aortic wall thickness in newborns with intrauterine growth restriction. Lancet 365, 1484– 14846 (2005) 8. Litjens, G., et al.: A survey on deep learning in medical image analysis. Med. Image Anal. 42, 60–88 (2017) 9. Shin, J.Y., Tajbakhsh, N., Hurst, R.T., Kendall, C.B., Liang, J.: Automating carotid intima-media thickness video interpretation with convolutional neural networks. In: IEEE CVPR Conference, pp. 2526–2535 (2016) 10. Chen, H., et al.: Automatic fetal ultrasound standard plane detection using knowledge transferred recurrent neural networks. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9349, pp. 507–514. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24553-9 62

Temporal Convolution Networks for Real-Time Abdominal Fetal Aorta

157

11. Huang, W., Bridge, C.P., Noble, J.A., Zisserman, A.: Temporal HeartNet: towards human-level automatic analysis of fetal cardiac screening video. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L., duch*esne, S. (eds.) MICCAI 2017. LNCS, vol. 10434, pp. 341–349. Springer, Cham (2017). https://doi.org/10. 1007/978-3-319-66185-8 39 12. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classiﬁcation with deep convolutional neural networks. In: NIPS 2012, pp. 1097–1105 (2012) 13. Szegedy, C., Ioﬀe, S., Vanhoucke, V.: Inception-v4, inception-ResNet and the impact of residual connections on learning. In: AAAI 2017, pp. 4278–4284 (2017) 14. Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: IEEE CVPR Conference, pp. 2261–2269 (2017) 15. Siam, M., Valipour, A., J¨ agersand, M., Ray, N.: Convolutional gated recurrent networks for video segmentation. In: IEEE ICIP Conference, pp. 3090–3094 (2017) 16. Kingma, D.P., Ba, L.J.: Adam: a method for stochastic optimization. In: 3rd International Conference for Learning Representations (2015) 17. Bonferroni, C.E.: Teoria statistica delle classi e calcolo delle probabilit. Pubblicazioni del Regio Istituto Superiore di Scienze Economiche e Commerciali di Firenze (1936) 18. Dunn, O.J.: Multiple comparisons among means. J. Am. Stat. Assoc. 56(293), 52–64 (1961)

An Original Neural Network for Pulmonary Tuberculosis Diagnosis in Radiographs Junyu Liu, Yang Liu, Cheng Wang, Anwei Li, Bowen Meng ✉ , Xiangfei Chai, and Panli Zuo (

)

Huiying Medical Technology (Beijing) Co., Ltd., Beijing, China [emailprotected], [emailprotected], [emailprotected], [emailprotected]

Abstract. Tuberculosis (TB) is a widespread and highly contagious disease that may lead serious harm to patient health. With the development of neural network, there is increasingly attention to apply deep learning on TB diagnosis. Former works validated the feasibility of neural networks in this task, but still suﬀer low accuracy problem due to lack of samples and complexity of radiograph informa‐ tion. In this work, we proposed an end-to-end neural network system for TB diagnosis, combining preprocessing, lung segmentation, feature extraction and classiﬁcation. We achieved accuracy of 0.961 in our labeled dataset, 0.923 and 0.890 on Shenzhen and Montgomery Public Dataset respectively, demonstrating our work outperformed the state-of-the-art methods in this area. Keywords: Tuberculosis · Classiﬁcation · DNN

1

Introduction

Tuberculosis is a highly contagious disease that may lead serious harm to patient health. According to the World Health Organization (WHO) [1], until the end of 2015, nearly 10 million people in the world suﬀered from tuberculosis and more than 1.5 million died. The WHO pointed out that early diagnosis and appropriate treatment can avoid the majority of tuberculosis deaths, and millions of people are saved each year. None‐ theless, huge number of people still suﬀers for high cost and lack of professional doctors. Therefore, reliable tuberculosis diagnosing system is an urgent demand. At present, a large number of medical image data has not yet been digitized, and the level of data sharing and interoperability among hospitals is still at a low level. It is a dilemma that advanced method usually requires big data, which is impossible for medical dataset. Also, it is difficult to obtain reliable labeling data in the medical imaging field for the interdisciplinary gap. In addition, medical images contain more difficult samples and pixel-scale features, making AI image analysis in the medical field more challenging than natural image recognition. This work proposes a neural network specialized for pulmonary tuberculosis diagnosis in radiographs, to solve all above difficulties.

© Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11140, pp. 158–166, 2018. https://doi.org/10.1007/978-3-030-01421-6_16

An Original Neural Network for Pulmonary Tuberculosis Diagnosis

2

159

Related Works

In 2012, Hinton’s team [2] ﬁrst adopted convolutional neural network into the ImageNet classiﬁcation challenge and achieved astonishing results, drastically reducing the Top5 error rate from 26% to 15%. This opened up a boom in deep learning. At present, deep learning has achieved remarkable results in the ﬁelds like image recognition, detection, segmentation, and so on [3, 5]. Deep learning technology was ﬁrst oﬃcially applied to medical image analysis in 2015. Convolutional neural networks (CNN) soon gained increasingly popularity due to their ability to learn mid and high-level image representations. Bar Y et al. explore the ability of a CNN to identify diﬀerent types of pathologies in chest x-ray images [6]. They used a pre-trained CNN on the ImageNet dataset as the ﬁrst descriptor, and the second descriptor is PiCoDes, which is a compact high-level representation of popular low-level features (SIFTs [6], GIST, PHOG, and SSIM) which is optimized over a subset of the ImageNet dataset containing approximately 70,000 images. They found that the best performance was achieved using a combination of features extracted from the CNN and a set of low-level features. Of course, the capacity of system will be limited for lack of training. U.K. Lopes et al. used a pre-trained CNN as a feature extractor, combining with traditional machine learning methods for tuberculosis detection [8]. They ﬁrst used detached networks to extract features, then integrated CNN features and ﬁnally created an ensemble classiﬁer by combining the SVMs trained using the features extracted from GoogLenet [9], ResNet [10], and VggNet [11]. The author of [12] proposed a novel method to detect pulmonary tuberculosis. The method is divided into two steps. The ﬁrst step is to use pre-trained networks to make a two classiﬁcation on chest X-rays. For classiﬁcation, the chest X-rays are resized to respectively corresponding network, and the results of the prediction of all classiﬁcation networks are averaged as the ﬁnal clas‐ siﬁcation result. The second step is that the sensitivity of softmax score to occlusion of a certain region in the chest X-Ray is used to ﬁnd which region in the image is responsible for the classiﬁcation decision. But the over-resize process will sharply reduce the accu‐ racy of system. Olaf Ronneberger et al. proposed a network called U-Net [13] for small-sample segmentation. The network consists of two parts, a contracted path is used to obtain contextual information and a symmetrical expansion path for precise positioning. At the same time, in order to make more eﬃcient use of the annotation data, they also use a variety of data enhancement methods. In 2016, Milletari et al. proposed an extension to the U-Net layout that incorporates ResNet-like residual blocks and a Dice loss layer, rather than the conventional cross-entropy [14]. Inspired by all the mentioned works, we propose a combination of segmentation and classiﬁcation deep neural network through the chest X-rays to detect tuberculosis. All chest X-rays were preprocessed to emphasize lung features. Main body of the network has two branches: one is a designed lung segmentation network to obtain chest masks, and the other a classiﬁcation network. We achieve accuracy of 0.965 in our dataset, 0.923 and 0.890 on Shenzhen and Montgomery Public Dataset respectively, proving us the state-of-the-art in this area.

160

3

J. Liu et al.

Proposed Methods

3.1 Method Overview We proposed an end-to-end network for tuberculosis judgement. The whole system consists of a Lung Segmentation Network, a classiﬁcation backbone and an output head. Heat maps are generated for further analysis and algorithm veriﬁcation. This is the ﬁrst work to combine all the steps of tuberculosis detection in a whole network, making a compromise between computational speed and preservation of image information. The whole system is demonstrated in Fig. 1. Lung Segmentation Network

ConvNet Heat Map

GAP Output

FC layer

Network Backbone

Fig. 1. The block diagram of the proposed network.

3.2 Lung Segmentation Network According to [14], lung segmentation is necessary for automatic tuberculosis diag‐ nosing. In this paper, we designed a simple and eﬀective CNN with atrous convolutional layers [18] to segment the chest from X-rays referring to U-net. Basic feature extraction part has 3 conv-pooling blocks with diﬀerent number of channels. Each conv-pooling block contains a pooling layer after a few convolutional blocks, while each convolutional block consists of a convolutional layer followed by a Batch-Norm layer and a ReLU activation layer. Totally 8 times subsampling was implemented and the network struc‐ ture is shown in Fig. 2. Output

Image

3*3*64

conv

3*3*64

Conv1

Conv1

Pooling ReLU

3*3*128

3*3*128

Conv2 3*3*256

rate=1

Atrous conv1

Conv2

rate=4

Conv3

Atrous conv2

×3

3*3*256

Conv3

×1

rate=8

Atrous conv3

Fig. 2. ConvNet conﬁguration for feature extraction.

×3

An Original Neural Network for Pulmonary Tuberculosis Diagnosis

161

Lungs in radiographs are of diﬀerent sizes due to individual diﬀerence and other factors. Therefore, multi-scale segmentation was also taken into consideration. We used 3 atrous convolutional layers with diﬀerent sample rates respectively. All the feature maps obtain by dilated convolution are added together and connected with the decoder of the network. Segmented results are generated by continuous up-sampling. In order to overcome the problem of low resolution after down-sampling in the FCN [17] method, we fused the feature map of each down-sampled feature with that of the corresponding up-sampling part. Chest segmentation results are shown in Fig. 3.

Fig. 3. Chest segmentation results. Left: original picture; Middle: segmentation result; Right: evaluation result.

3.3 Specialized Innovations Preprocessing. Radiographs need preprocessing before checking. The grayscale of chest X-ray pixels usually range from tens to thousands, and it’s impossible for human eyes to distinguish this huge change. Also, too large scales tend to cause the diagnosing network to divergent. Therefore, the original pixel values need adjustment according to WW (window width) and WP (window position). Because not all graphs are given guidance values of WW and WP, a standard set of WW and WP was generated from samples accompanied with WW and WP guidance values using cluster algorithm. We also found that histogram equalization operation can emphasize the features in lung while not signiﬁcantly changing the gray level in other organs and background. Original radiographs often have as many as two thousand pixels in length, which is a huge burden for computation. But considering that some granule infections can be really small, input images are bilinear interpolated to 1024 × 1024. Two Branches. The main body of proposed network has two branches, one for lung segmentation and the other for feature extraction with the network backbone. We choose 6 diﬀerent popular and practical backbones in total for this work. To limit computation memory and time, we subsampled the feature map by 32 instead of the original picture masked by the output of segmentation branch, allowing main body of two branches to work simultaneously. Network Head. There are two heads in the last part of network. The classiﬁcation head of the network is specialized for this task. As input of our system is much larger than normal classiﬁcation competitions, we need more times of subsampling than the original

162

J. Liu et al.

networks. In practical, we adopted 128 times down sampling in our network. High simi‐ larity is a dangerous character of radiographs in this task, tending to cause over-ﬁt. Therefore, we added a heat map head to analysis if the correct feature of graphs has been learned. For heat map generation, the second to last fully connected (FC) layer is replaced by a global average pooling (GAP) [18] layer, also reducing parameters in the network. Considering the imbalance of positive and negative samples, and also false negative (FN) is much more harmful in medical area, focal loss [4] is introduced into this work, giving positive samples a higher loss during training.

4

Experiments

4.1 Database Database used in this paper comes from 2 sources. The ﬁrst dataset was provided by Huiying Medical Technology (Beijing) Co., Ltd., containing 2443 frontal chest X-ray images (DICOM format), with labels marked by a reliable expert network. In the dataset, 2000 were randomly chosen as training set and the rest divided into validation and test ones. There are two public datasets [20] available on the Internet. Shenzhen Hospital dataset, which includes 662 frontal chest x-rays, was acquired from Shenzhen No. 3 People’s Hospital in Shenzhen, China. Montgomery County chest X-ray set (MC) was collected in collaboration with the Department of Health and Human Services, Mont‐ gomery County, Maryland, USA, consisting of 138 frontal X-rays. 4.2 Experimental Results To test the performance of network with diﬀerent backbones, parallel comparisons were made on our test dataset. Accuracy, sensitivity, speciﬁcity, AP, and AUC results are shown in Table 1. Inception-v4 backbone without mask branch was also tested. Table 1. Parallel comparisons of each method for our dataset Backbone VGG-19 ResNet-50 ResNet-101 ResNet-152 Inception v4 ResNet-Inception v2 Inception v4 (no mask)

AUC 0.974 0.983 0.989 0.991 0.995 0.982 0.953

Accuracy 0.893 0.875 0.879 0.923 0.961 0.934 0.908

AP 0.981 0.992 0.992 0.994 0.994 0.984 0.947

Sensitivity 0.988 0.979 0.972 0.960 0.966 0.948 0.821

To be intuitive, the P-R curves and ROCs are shown in Fig. 4.

Speciﬁcity 0.765 0.892 0.932 0.945 0.955 0.915 0.954

An Original Neural Network for Pulmonary Tuberculosis Diagnosis ROC

P-R 1

1

0.9

0.95 0.9

Inception v4 Resnet 101

0.85

Resnet 152

0.8

0.6

Resnet 50

0.8

Inception Resnet VGG19

0.75

Inception v4 Resnet 101 Resnet 152 Resnet 50 Inception Resnet VGG19

0.7

TPR

Precision

163

0.5 0.4

0.7

0.3

0.65

0.2

0.6 0.55

0.1

0.1

0.2

0.3

0.4

0.5 Recall

0.6

0.7

0.8

0.9

1

0.1

0.2

0.3

0.4

0.5 FPR

0.6

0.7

0.8

0.9

1

Fig. 4. P-R curves (left) and ROCs (right).

The results show that our method made a highest accuracy of over 96.1% on our test dataset, achieving by Inception v4. Mask branch contributed about 5.3% in accuracy. We also reselected training set and retrained our networks from the beginning to exclude the possibility of coincidence. We also checked the heat maps generated by our network, ﬁnding it reasonable although slight bias and blur happens due to 128 times subsampling. The visualized results are shown in Fig. 5.

Fig. 5. The heat map acquired in our network. Although slight positioning bias happens due to totally 128 times subsampling, the red area roughly reﬂects position of infection. (Color ﬁgure online)

Longitudinal comparisons with former works [8, 12, 15, 16] were also accomplished. To be fair and objective, we compared the results of proposed method and the other works on two public datasets. All the data of former works cited in this paper are the best results the authors claimed. The models we used were still the ones we trained on our dataset. Figure 6 shows the visualized results of our networks on Shenzhen Dataset. Comparison with former works are shown in Table 2.

164

J. Liu et al. ROC

P-R 1

1 0.9

0.9

0.8

Inception v4 Resnet 152 Inception Resnet

Inception v4 Resnet 152 Inception Resnet

0.7 0.6 TPR

Precision

0.8

0.7

0.5 0.4

0.6

0.3 0.2

0.5

0.1 0.4

0.1

0.2

0.3

0.4

0.5 Recall

0.6

0.7

0.8

0.9

1

0.1

0.2

0.3

0.4

0.5 FPR

0.6

0.7

0.8

0.9

1

Fig. 6. P-R curves (left) and ROCs (right) of our networks on Shenzhen Dataset.

Table 2. Performance for Shenzhen Dataset. Last three are proposed methods. Method U.K. Lopes et al. Mohammad et al. Sangheum et al. ResNet-152 Inception v4 Inception-ResNet v2

AUC 0.894 0.940 0.926 0.967 0.979 0.983

Accuracy 0.837 0.900 0.837 0.923 0.897 0.917

AP 0.940 0.971 0.965 0.985

Sensitivity 0.960 0.978 0.923 0.857

Speciﬁcity 0.960 0.986 0.937 0.981

Results on Montgomery Dataset are shown in Fig. 7 and Table 3. We found that many radiographs in the MC Dataset has large scale of black blocks and seriously disturbed histogram equalization, making the background of preprocessed graphs lighter than usual. We cut oﬀ the black blocks and resized the images, and saw an incredible improvement in results. P-R

ROC

1

1 0.9

0.9

Inception v4

Inception v4

0.7

Resnet 152

Resnet 152

Inception Resnet

0.6

Inception Resnet TPR

Precision

0.8

0.8

0.7

0.5 0.4

0.6

0.3 0.2

0.5

0.1 0.4

0.1

0.2

0.3

0.4

0.5 Recall

0.6

0.7

0.8

0.9

1

0.1

0.2

0.3

0.4

0.5 FPR

0.6

0.7

0.8

Fig. 7. P-R curves (left) and ROCs (right) of our networks on MC Dataset.

0.9

1

An Original Neural Network for Pulmonary Tuberculosis Diagnosis

165

Table 3. Performance for MC Dataset. Last three are proposed methods. Method U.K. Lopes et al. Stefan Jaeger et al. Sangheum et al. ResNet-152 Inception v4 Inception-ResNet v2

AUC 0.926 0.831 0.884 0.951 0.914 0.957

Accuracy 0.810 0.75 0.674 0.890 0.822 0.844

AP 0.890 0.935 0.884 0.965

Sensitivity ~0.5 0.711 0.654 0.618

Speciﬁcity ~0.9 0.955 0.938 0.913

Longitudinal and parallel experimental results show the superiority of our proposed network. The models achieved relatively good results on our own test set. It’s hard to explain why ResNet 152 seems to do better than other network backbones on the public datasets. But our models undoubtedly showed adaptability to public datasets, outper‐ forming the state-of-the-art results.

5

Conclusion and Future Work

We proposed an end-to-end network for pulmonary tuberculosis classiﬁcation, including preprocessing, lung segmentation and classiﬁcation. The system optimized the inference time, while guaranteeing the accuracy. Future work will include (1) making specialized optimization on network backbones (2) optimization of preprocessing to increase adaptability of network (3) extending this system to the detection of focus of infection. Acknowledgement. We would like to thank Huiying Medical Technology (Beijing) Co., Ltd. for providing essential resource and support for us.

References 1. World Health Organization (WHO): Global tuberculosis report (2017). http://www.who.int/ tb/publications/global_report/en/. Accessed 26 May 2018 2. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classiﬁcation with deep convolutional neural networks. In: International Conference on Neural Information Processing Systems, pp. 1097–1105. Curran Associates Inc. (2012) 3. Ren, S., He, K., Girshick, R., et al.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017) 4. Lin, T.Y., Goyal, P., Girshick, R., et al.: Focal loss for dense object detection, pp. 2999–3007. In: IEEE Computer Society (2017) 5. He, K., Gkioxari, G., Dollár, P., et al.: Mask R-CNN. In: Computer Vision and Pattern Recognition (CVPR) (2017) 6. Bar, Y., Diamant, I., Wolf, L., et al.: Deep learning with non-medical training used for chest pathology identiﬁcation. In: Medical Imaging 2015: Computer-Aided Diagnosis, p. 94140V (2015)

166

J. Liu et al.

7. Liu, C., Yuen, J., Torralba, A., Sivic, J., Freeman, W.T.: SIFT flow: dense correspondence across different scenes. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5304, pp. 28–42. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-88690-7_3 8. Lopes, U.K., Valiati, J.F.: Pre-trained convolutional neural networks as feature extractors for tuberculosis detection. Comput. Biol. Med. 89, 135–143 (2017) 9. Szegedy, C., Liu, W., Jia, Y., et al.: Going deeper with convolutions. In: IEEE Computer Society, pp. 1–9 (2014) 10. He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition, pp. 770–778. IEEE Computer Society (2015) 11. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. Computer Science (2014) 12. Islam, M.T., Aowal, M.A., Minhaz, A.T., et al.: Abnormality detection and localization in chest x-rays using deep convolutional neural networks. arXiv (2017) 13. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/ 10.1007/978-3-319-24574-4_28 14. Drozdzal, M., et al.: The importance of skip connections in biomedical image segmentation. In: Carneiro, G. et al. (eds.) Deep Learning and Data Labeling for Medical Applications 15. Jaeger, S., Karargyris, A., Antani, S., et al.: Detecting tuberculosis in radiographs using combined lung masks. Conf. Proc. IEEE. Eng. Med. Biol. Soc. 2012(4), 4978–4981 (2012) 16. Hwang, S., et al.: A novel approach for tuberculosis screening based on deep convolutional neural networks. In: Medical Imaging 2016: Computer-Aided Diagnosis, p. 97852W. International Society for Optics and Photonics (2016) 17. Shelhamer, E., Long, J., Darrell, T.: Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 640–651 (2014) 18. Chen, L.C., Papandreou, G., Kokkinos, I., et al.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2018) 19. Zhou, B., et al.: Learning deep features for discriminative localization. In: Computer Vision and Pattern Recognition, pp. 2921–2929. IEEE (2016) 20. Stefan, J., et al.: Two public chest X-ray datasets for computer-aided screening of pulmonary diseases. Quant Imaging Med. Surg. 4(6), 475–477 (2014)

Computerized Counting-Based System for Acute Lymphoblastic Leukemia Detection in Microscopic Blood Images Karima Ben-Suliman and Adam Krzy˙zak(B) Department of Computer Science and Software Engineering, Concordia University, Montr´eal, Qu´ebec H3G 1M8, Canada {k bensul,krzyzak}@cse.concordia.ca

Abstract. Counting of white blood cells (WBCs) and detecting the morphological abnormality of these cells allow for diagnosis some blood diseases such as leukemia. This can be accomplished by automatic quantiﬁcation analysis of microscope images of blood smear. This paper is oriented towards presenting a novel framework that consists of two sub-systems as indicators for detection Acute Lymphoblastic Leukemia (ALL). The ﬁrst sub-system aims at counting WBCs by adapting a deep learning based approach to separate agglomerates of WBCs. After separation of WBCs, we propose the second sub-system to detect and count abnormal WBCs (lymphoblasts) required to diagnose ALL. The performance of the proposed framework is evaluated using ALL-IDB dataset. The ﬁrst presented sub-system is able to count WBCs with an accuracy up to 97.38%. Furthermore, an approach using ensemble classiﬁers based on handcrafted features is able to detect and count the lymphoblasts with an average accuracy of 98.67%.

1

Introduction

Counting of white blood cells (WBCs) is a diagnostic procedure to detect blood malignancies. Leukemia is a blood cancer developing from the stem cells of the bone marrow, that aﬀects the function of WBCs and their number. Leukemia can be preliminary classiﬁed based on progression of disease i.e. acute or chronic. In addition, classiﬁcation can be based on the cell lineage of the stem cells i.e. lymphoid or myeloid. In this paper, we only consider Acute Lymphoblastic Leukemia (ALL) which aﬀects a speciﬁc type of WBCs called lymphocytes. Manual morphological observation of blood cells under the microscope and an automated haematology counting are two diagnostic procedures to diagnose ALL [1]. Observation of blood cells by the microscope requires a few drops of blood sample from a patient on a slide. Then diﬀerent stains are added to the slide to assist specialists to identify diﬀerent blood cells. Afterward, this blood slide is examined under the microscope with diﬀerent magniﬁcations to count WBCs and detect lymphoblasts. Detecting at least 20% of lymphoblasts in the bone marrow or peripheral blood can be an indicator for ALL diagnosis. Although c Springer Nature Switzerland AG 2018 V. K˚ urkov´ a et al. (Eds.): ICANN 2018, LNCS 11140, pp. 167–178, 2018. https://doi.org/10.1007/978-3-030-01421-6_17

168

K. Ben-Suliman and A. Krzy˙zak

this process is very basic, the exhausting part is when a medical expert needs to observe blood samples under the microscope collected from numerous patients to count the normal and abnormal blood cells. Typically, this approach can be diﬃcult even for the specialist because it requires experience and extensive knowledge to be able to distinguish the morphological abnormalities of the blood cells. On the other hand, an automated haematology counter, which is another way for counting WBCs, produces the output in timely manner and diﬀerentiates between blood cells by measuring cell volume and the blood cell morphology based on mechanical and electronic approaches. However, this automatic system has the ability to just count cells and cannot identify the abnormalities of these cells. For this reason, WBCs have to be analyzed manually under the microscope [2]. In this paper, we propose a computer-aided system that comprises of two sub-systems. The ﬁrst one is to separate and count WBCs, including normal and abnormal cells, by adapting a deep-learning-based approach to overcome agglomerates of WBCs and comparing the results with related works. The goal of the second sub-system is to detect lymphoblasts that lead to diagnose ALL. To the best of our knowledge the presented system is the ﬁrst automated system for counting lymphoblasts from microscopic images. This paper is structured as follows: Sect. 2 presents background and related works. Section 3 describes the used dataset. Section 4 presents a detailed process of both sub-systems for counting WBCs and lymphoblasts. Section 5 reports the experimental results and discussion. Finally, Sect. 6 presents conclusions.

2

Related Work

In this section will only consider the automated systems for detection and counting WBCs and lymphoblasts. For example, Tan Le et al. [3] have proposed a framework for counting WBCs. To extract WBCs from the background, a threshold value has been applied on Haematoxylin-Eosin-DAB (HED) color space. Then, the edges of the segmented WBCs are detected using canny edge detector followed by separating the touching cells by using watershed segmentation algorithm. Though this approach achieved 90% of accuracy, no speciﬁc method has been mentioned to determine the threshold value. A diﬀerent approach has been proposed by Putzu et al. [4] to count WBCs. The identiﬁcation of WBCs is based on a threshold value that is determined by Zack algorithm on Y-component of CMYK color space. Then, watershed segmentation is performed to separate the adjacent cells. The performance of the proposed approach achieved an accuracy of 92%. However, it is mentioned that when the overlapping between WBCs is signiﬁcant, no good results have been obtained. In [5], Bhavnani et al. have used Otsu’s method and morphological operations on green component of RGB color space for isolating WBCs. Then, connected label component is used to count WBCs. Although the performance of the system is 94.25% for counting WBCs, a complex degree of overlapping and irregular cells toleration is limited. Moreover, Otsu's method may not be a suitable approach when the background

Counting-Based System for ALL Detection

169

and foreground of an image are not clearly represented. Also, this framework is partially developed because of using the morphological operations for isolation the touching cells of WBCs. This is in turn leads to change the morphological characteristics of blood cells and can’t be used for fully detection system. Basima and Panicker [6] have utilized K-means algorithm on Y component of CMYK color space to segment WBCs followed by watershed segmentation to separate WBCs. However, segmentation WBCs by K-means causes losing in cytoplasm region which is an essential part needed to distinguish the lymphocytes from lymphoblasts. Also, the obtained accuracy to count WBCs by the proposed approach is not mentioned. Alomari et al. [7] have proposed another method for counting WBCs. The detection of WBCs is based on thresholding. Then, the counting of the cells is carried out by an iterative structured circle detection approach. This proposed framework exhibits an average accuracy of 98.4% for counting WBCs. However, the proposed algorithm can tolerate the overlapping cells only with a certain degree producing a noticeable amount of false positives. Moreover, selecting the optimum threshold value is very challenging. Loddo et al. [8] have introduced an approach to detect and count WBCs. Pixel based classiﬁcation approach using support vector machine is performed for segmenting WBCs. Then, all the single WBCs are counted using connected label component, and the remaining of agglomerates of cells are counted by Circular Hough Transform (CHT). Although this approach exhibits an average accuracy of 99.2% for WBCs, this work is partially developed and neglects adjacent cells which limits counting and the analysis of lymphoblasts. Hence, further human visual inspection is required to detect the abnormal cells. It can be observed from the available literature that the only work for counting the lymphoblasts for detecting the abnormality has been done by Halim et al. [9] who have proposed an automatic framework to count blasts (lymphoblasts and myeloblast) for acute leukemia in blood samples. To segment blasts from the background, thresholding based on histogram is performed on S-component in HSV color space. After that, morphological erosion is used to segregate the touching cells. While this approach is able to provide an accuracy of 97.8%, determination of the optimal threshold is not an easy task and may work successfully for some images but fails for others due to lighting condition. Moreover, the blood sample images included in this study consist of only blasts and no other WBCs are involved. There are several attempts for eﬀective counting of WBCs, while there are few authors have proposed methods regarding the cells counting with considering adjacent cells and detecting the abnormality among them. To tackle the issues previously mentioned, we propose an automated system for detecting the presence of ALL. This system consists of two sub-systems which can be as indicators to diagnose patients who may suﬀer from ALL. The ﬁrst sub-system is directed to count WBCs, and the second sub-system aims to detect and count lymphoblasts.

170

3

K. Ben-Suliman and A. Krzy˙zak

Dataset

For testing the proposed approach, Acute Lymphoblastic Leukemia Image Database (ALL-IDB) has been used [10]. It is a public dataset proposed by Donida Labati. All-IDB includes microscopic images of peripheral blood samples of healthy individuals and unhealthy patients suﬀering from leukemia as shown in Fig. 1. The microscopic images have been collected by the M. Tettamanti Research Center-Monza, Italy, that specializes in childhood leukemia and hematological diseases. The ALL-IDB dataset is subdivided into two versions. The ﬁrst version, ALL-IDB1, contains 59 healthy and 49 unhealthy images that are in full size of 1712 × 1368. The second version, ALL-IDB2, contains cropped subimages of 130 normal and 130 lymphoblasts of size 257 × 257. The images in both versions are manually labeled by expert oncologists to be used as a ground truth. In ALL-IDB1 version, each image has a related text ﬁle including the coordinates of the centroid of each lymphoblast. In this study, images belong to ALL-IDB1, which consists of 108 microscopy images of blood samples, are used. To evaluate the proposed system for counting, 50% of the images are used for training, 15% for validation set to tune a model’s hyperparameters, and the remaining images are used to test our model.

Fig. 1. Samples from ALL-IDB1 for unhealthy patients with high (a) and low magniﬁcations (b).

4

Proposed Method

The method proposed in this work aims to count WBCs and lymphoblasts for acute lymphoblastic leukemia using blood smear images as illustrated in Fig. 2.

Counting-Based System for ALL Detection Microscopic blood smear images

171

Nucleus and cytoplasm separation

Hand-crafted features extraction

Apply SVM for WBCs segmentation For each Separation of grouped WBCs using DRBM

Image cleaning

WBCs count

WBC Feature selection

WBC classification using an ensemble approach

Lymphoblasts count

Fig. 2. Proposed approach diagram.

4.1

Counting of WBCs

WBCs Segmentation. The segmentation of WBCs, including nuclei and their cytoplasm, takes an advantage of Ruberto et al. [8] approach which uses support vector machine (SVM) based segmentation. This approach is characterized by its robustness against diﬀerent staining procedures and illumination problems. To achieve this, three diﬀerent regions represent WBCs (positive class), RBCs and background (negative classes) are selected to train binary-class SVM with a Gaussian radial basis kernel function (RBF). These regions are selected from a few images of ALL-IDB1 training set. 255 regions are selected to represent 85 and 170 regions for positive and negative classes respectively. As in the work of Ruberto et al. [8], from all selected regions color and statistical features are extracted from each pixel: the color features represent R, G, B intensity values of a pixel, and the statistical features represent average, entropy, uniformity and standard deviation for 3 × 3 neighborhood of that pixel. The obtained average accuracy of segmentation computed by means of 10 fold cross-validation is 95.21%. Figure 3 shows the results of WBCs segmentation. Separation of Grouped WBCs. To segregate the touching cells of WBCs, we adapted deep learning approach with stacked Restricted Boltzman Machines (RBMs) followed by a discriminative ﬁne-tuning layer as used by Duggal et al. [11] applying the approach to the results of SVM as a segmentation method rather than K-means algorithm. The discriminative ﬁne-tuning layer is applied on the top of the features learned by the RBMs to identify ridge pixels of grouped WBCs, pixels that are inside WBCs, and pixels that are located on the boundary

172

K. Ben-Suliman and A. Krzy˙zak

Fig. 3. WBCs segmentation for microscopic images of unhealthy patients with high (a) and low magniﬁcations (b).

of WBCs but not ridges. Then, the ridge pixels are neglected resulting in separating the grouped WBCs. From 12 images of the training set, 12 single clusters of grouped WBCs are extracted. 80% of these clusters are used for training and 20% are used for validation. The system is trained by considering three layers of RBMs. The number of neurons in the hidden layers are 100, 300, and 1000 respectively. Figure 4 shows segregation of WBCs by considering a patch of size 31 × 31 as a feature vector for training.

Fig. 4. Separation of grouped healthy WBCs and lymphoblasts for unhealthy patients with high (a) and low magniﬁcations (b).

Image Cleaning and WBCs Counting. In order to avoid misidentiﬁcation of WBCs for counting and mis-detection of lymphoblasts which is required for next steps, WBCs that appear partially on the edge of the microscopic images should be neglected. Discarding the partial WBCs is accomplished by suppressing the light structures that are connected to the image border using the value 8 as a connectivity value. After segmentation and separation of the agglomerates of WBCs, WBCs can be counted using connected label component with a connectivity of 8. Details related to the performance of WBCs counting are reported in the experimental results section.

Counting-Based System for ALL Detection

4.2

173

Detection and Counting of Lymphoblasts

Separation of Nucleus and Cytoplasm. Once the WBCs have been separated, sub-images containing each WBC are obtained using bounding box. It is observed that WBC cytoplasm has high contrast in the green channel of RGB colour space [12]. So, to extract the cytoplasm, the green component is obtained from a sub-image of an individual cell of WBC. Afterwards, a binary image is calculated by using Otsu’s algorithm [13]. To separate the WBC nucleus, the a* component of the Lab colour space is obtained. Then, a binary image is calculated by using Otsu’s algorithm and this binary image is subtracted from the binary image containing only the cytoplasm. Feature Extraction. To diﬀerentiate lymphoblasts from other healthy WBCs representing neutrophils; eosinophils; basophils; lymphocytes; and monocytes, three categories of handcrafted features including morphological, textural, and color features are computed. These features describe the nuclear, cytoplasmic, and cellular (a nucleus and its cytoplasm) changes of each sub-image containing an individual cell. The ﬁrst group reﬂects the deformations resulting from transition to malignant case of blood cells. Therefore, 17 morphological features reﬂect the maturity of a cell, i.e., aspect ratio of nucleus and cytoplasm, size of a cell; nucleus; and cytoplasm, nucleus shape descriptors, and the marginal coarseness or irregularity. We compute marginal features using the fractal geometry and the variance of signature of a nucleus and cytoplasm as deﬁned in [14]. To embody the granularity existing in some WBCs such as eosinophil and basophil, we use median robust extended local binary pattern (MRELBP) [15]. To measure the textural changes of the modiﬁcations of nuclear chromatin distribution, that indicates the malignant lymphocytes, 6 wavelet coeﬃcients based statistical features and 21 Gray-Level Co-occurrence Matrix (GLCM) features are extracted as well [16,17]. Moreover, 6 color features are calculated for a nucleus and also for cytoplasm to reﬂect hyperchromatism of malignant lymphocytes. These features are computed from each color space of RGB and HSV. Finally, we add a speciﬁc measure that reﬂects that lymphoblasts contain variably prominent nucleoli [18]. To ﬁgure out the number of nucleoli, K-means algorithm is applied and the complement of the binary image obtained from Otsu’s algorithm can reﬂects the elements that represent nucleoli. Grouping of all the features previously mentioned altogether we generate the set of 52 features. Feature Selection. To identify highly predictive a subset of discriminative features among a large set of features for predicting a response, Maximum Relevance Minimum Redundancy criteria (MRMR) which is based on mutual information is applied [19]. MRMR tends to select the features having the most correlation with a class label and the least correlation between the features themselves. Classification. In order to build a model for lymphoblast detection in microscopic blood images, we make use of diﬀerent types of multiple-classiﬁer approach (MCA). The ﬁrst type consists of a single classiﬁer with diﬀerent parameters setting. In this case, we use SVM classiﬁer with diﬀerent kernels: linear, polynomial, and RBF. The second MCA consists of 3 diﬀerent independent classiﬁers.

174

K. Ben-Suliman and A. Krzy˙zak

The used classiﬁers are SVM, Decision Tree (DT) and K-Nearest Neighbors (KNNs). The last MCA consists of 5 diﬀerent independent classiﬁers: SVM, DT, Naive Bayes (NB), KNNs and Random Forest (RF)[20,21]. In all diﬀerent architectures of MCA, the majority voting of class labels of independent classiﬁers are combined to classify WBCs of an image and count the lymphoblasts belonging to that image.

5

Experimental Results

5.1

WBCs Counting Performance

To present the results of the system performance for WBC counting with an appropriate and fair comparison, we follow the same testing strategy as in [4]. 33 images are selected from the testing set and subdivided into 11 sets. These images contains 267 WBCs and have been used for testing, then the ground truth of manual counting is compared with the results of the proposed sub-system for counting WBCs in each image. As it can be observed from Table 1 that 260 of 267 WBCs are identiﬁed properly by the proposed approach with an accuracy of 97.38% which outperforms the results of [4]. Moreover, the proposed approach shows consistent results over sets from 6 to 11 with results of [4]. Table 1. Performance of the proposed automated WBCs counting system. Set Manual counting Auto count [4] Proposed method Performance improvement of counting WBCs in percentage ratio by the proposed approach over [4] 1

31

26

29

10%

2

49

41

47

12%

3

31

30

31

3%

4

39

36

38

5%

5

32

27

30

9%

6

27

27

27

0%

7

17

17

17

0%

8

15

15

15

0%

9

11

11

11

0%

10

9

9

9

0%

11

6

6

6

0%

5.2

Lymphoblasts Counting Performance

For detecting lymphoblasts, the signiﬁcance of the extracted features are evaluated using mutual information. Therefore, the ranked list of the top 43 features are used to represent the optimal discriminative ones and are indicated to be informative out of 52 extracted features. In all diﬀerent architectures of MCA,

Counting-Based System for ALL Detection

175

the hyperparameters of each independent classiﬁer are chosen experimentally on the validation set. To evaluate the eﬀectiveness of our model, we divide the tested images into 5 sets. For each set, we determine the average test accuracy which is calculated by averaging all the accuracies resulting from each image belongs to that set. Then, we calculate the average values of True Positive Rate (TPR), True Negative Rate (TNR), and Positive Predictive Value (PPV) for that set as well. Table 2 shows the results of majority voting of SVM classiﬁer with diﬀerent kernels: linear, polynomial, and RBF. It can be concluded that the overall average test set accuracy is 89.34%. Also, the overall performance for counting lymphoblasts (TPR) using the proposed method is 98.33%. However, the overall misclassiﬁcation rate of lymphocytes classiﬁed as lymphoblast (false positive rate), which aﬀects the correct counting of lymphoblasts, achieves 29% error rate. Table 2. The experimental results using MCA of SVM classiﬁer with diﬀerent kernels. Set

Manual counting Proposed of lymphoblasts method counting

TP FN FP Average test set accuracy

TPR

TNR

PPV

1

47

52

47 0

5

91.94%

100%

66.67% 90.38%

2

8

11

8 0

3

85.71%

100%

76.92% 72.73%

3

45

47

42 3

5

85.71%

93.33% 54.55% 89.36%

4

60

63

59 1

4

93.33%

98.33% 73.33% 93.65%

5

9

11

9 0

2

90%

100%

Total 169

184

165 4

19

89.34%

98.33% 70.66% 85.59%

81.82% 81.82%

The performance of our system using the majority voting of 3 diﬀerent classiﬁers: SVM (RBF kernel), DT, and KNNs (k = 5) is presented in Table 3. It can be concluded that the overall average test set accuracy is 96.75%. Also, the overall performance of TPR is 97.6%. Moreover, it can be noticed that in some sets such as 3 and 5, the proposed system is able to count the lymphoblasts correctly which shows a very good inﬂuence on the overall misclassiﬁcation rate of false positive rate achieving a 11% error rate. The proposed computer-aided system for counting the lymphoblasts using the majority voting of 5 diﬀerent classiﬁers: SVM (RBF kernel), DT, NB, KNNs (k = 5), and RF shows an apparent increase in the overall average test set accuracy which reaches 98.67%. Also, as it can be seen clearly from Table 4 that the proposed method for counting lymphoblasts by using our proposed approach matches the manual counting by the hematologists in most of sets. Therefore, the overall misclassiﬁcation rate of false positive rate is only 7% and the overall performance of TPR is 100%. It can be observed from the experiments that the architecture that consists of 5 diﬀerent classiﬁers achieves the best performance for counting the lymphoblasts signiﬁcantly. It achieves the lowest recorded average error rate of 1.33% while the

176

K. Ben-Suliman and A. Krzy˙zak

Table 3. The experimental results Using MCA of 3 diﬀerent classiﬁers: SVM, DT, and KNNs. Set

1

Manual counting Proposed of lymphoblasts method counting 47

50

TP FN FP Average test set accuracy

TPR

TNR

PPV

47 0

3

95.16%

100%

80%

94%

2

8

7

7 1

95.24%

88%

100%

100%

3

45

45

45 0

100%

100%

100%

100%

4

60

65

60 0

5

93.33%

100%

67%

92%

5

9

9

9 0

100%

100%

100%

100%

Total 169

176

168 1

8

96.75%

97.6% 89.4% 97.2%

Table 4. The experimental results using MCA of 5 diﬀerent classiﬁers: SVM, DT, NB, KNNs, and RF. Set

Manual counting Proposed of lymphoblasts method counting

TP FN FP Average test set accuracy

TPR

TNR

PPV

1

47

47

47

100%

100% 100%

100%

2

8

8

8

100%

100% 100%

100%

3

45

45

45

100%

100% 100%

100%

4

60

65

60

5

93.33%

100% 67%

92%

5

9

9

9

100%

100% 100%

100%

174

169 0

5

98.67%

100% 93.4% 98.4%

Total 169

Fig. 5. AUC for all diﬀerent MCA.

overall average error rates for MCA of 3 diﬀerent classiﬁers and SVM classiﬁer with diﬀerent kernels are 3.25% and 10.66% respectively. Moreover, based on analyzing the area under the Receiver Operating Characteristic (ROC) curve

Counting-Based System for ALL Detection

177

(AUC) to compare the performance of the used classiﬁcation models, Fig. 5 shows that for MCA of 5 diﬀerent independent classiﬁers the area is larger than for any other architectures of MCA taken into account in this study.

6

Conclusions

We have introduced an innovative counting-based framework consisting of two sub-systems which can be used as indicators for detection the patients who may suﬀer from ALL. By providing a microscopic blood image as an input to the proposed framework, it produces outputs including the number of WBCs and lymphoblasts. The ﬁrst sub-system is directed to count WBCs. Therefore, medical systems such as haematology counters can be supported by the results of the ﬁrst sub-system. The second sub-system aims to address the detection of the abnormalities of WBCs. An advantage of this proposed sub-system is overcoming major limitations of automated haematology counters. We would like to point out that we are proposing the ﬁrst study from its kind for counting the lymphoblasts. The proposed counting-based framework seems quite promising as it can be used in the medical laboratories to aid hematologists in their diagnosis of ALL and make their decisions more precise and objective. In future work we plan to develop an automated prognostic system for subclassiﬁcation of ALL based on French-American-British (FAB) and/or World Health Organization (WHO) classiﬁcation systems. Acknowledgments. This research was supported by the Natural Sciences and Engineering Research Council of Canada.

References 1. Inaba, H., Greaves, M., Mullighan, C.: Acute lymphoblastic leukaemia. Lancet 381(9881), 1943–1955 (2013) 2. Briggs, C., Longair, I., Slavik, M., Thwaite, K., Mills, R., Thavaraja, V., Foster, A., Romannin, D., Machin, S.: Can automated blood ﬁlm analysis replace the manual diﬀerential? An evaluation of the CellaVision DM96 automated image analysis system. Lab. Hematol. 31(1), 48–60 (2009) 3. Le, D., Bui, A., Yu, Z., Bui, F.: An automated framework for counting lymphocytes from microscopic images. In. Computing and Communication (IEMCON), pp. 1–6. Vancouver (2015) 4. Putzu, L., Caocci, G., Di Ruberto, C.: Leucocyte classiﬁcation for leukaemia detection using image processing techniques. Artif. Intell. Med. 62(3), 179–191 (2014) 5. Bhavnani, L., Jaliya, U., Joshi, M.: Segmentation and counting of WBCs and RBCs from microscopic blood sample images. Image, Graph. Signal Process. 8(11), 32 (2016) 6. Basima, C.T., Panicker, J.: Enhanced leucocyte classiﬁcation for leukaemia detection. In: Information Science (ICIS), Kochi, pp. 65–71 (2016) 7. Alomari, Y., Abdullah, S., Azma, R., Omar, K.: Automatic detection and quantiﬁcation of WBCs and RBCs using iterative structured circle detection algorithm. In: Computational and Mathematical Methods in Medicine 2014 (2014)

178

K. Ben-Suliman and A. Krzy˙zak

8. Di Ruberto, C., Loddo, A., Putzu, L.: A leukocytes count system from blood smear images. Mach. Vis. Appl. 27(8), 1151–1160 (2016) 9. Abd Halim, N., Mashor, M., Hassan, R.: Automatic blasts counting for acute leukemia based on blood samples. Res. Rev. Comput. Sci. (IJRRCS) 2(4), 971 (2011) 10. ALL-IDB Homepage. https://homes.di.unimi.it/scotti/all/. Accessed 10 May 2017 11. Duggal, R., Gupta, A., Gupta, R., Wadhwa, M., and Ahuja, C.: Overlapping cell nuclei segmentation in microscopic images using deep belief networks. In: Computer Vision Graphics and Image Processing, Guwahati, p. 82 (2016) 12. Cseke, I.: A fast segmentation scheme for white blood cell images. In: Pattern Recognition, The Hague, pp. 530–533 (1992) 13. Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 9(1), 62–66 (1979) 14. Mohapatra, S., Patra, D.: Automated cell nucleus segmentation and acute leukemia detection in blood microscopic images. In: Systems in Medicine and Biology (ICSMB), Kharagpur, vol. 62, no. 3, pp. 49–54 (2010) 15. Liu, L., Lao, S., Fieguth, P., Guo, Y., Wang, X., Pietikinen, M.: Median robust extended local binary pattern for texture classiﬁcation. IEEE Trans. Image Process. 25(3), 1368–1381 (2016) 16. Haralick, R., Shanmugam, K., Dinstein, I.: Textural features for image classiﬁcation. IEEE Trans. Syst. Man Cybern. 6, 610–621 (1973) 17. Busch, A., Boles, W.: Texture classiﬁcation using multiple wavelet analysis. In: Digital Image Computing Techniques and Applications, pp. 341–345 (2002) 18. Smetana, K., Jir´ askov´ a, I., Star` y, J.: The number of nucleoli and main nucleolar types in lymphoblasts of children suﬀering from acute lymphoid leukemia. Hematol. J. 4(3), 231–236 (1999) 19. Peng, H., Long, F., Ding, C.: Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27(8), 1226–1238 (2005) 20. Bishop, C.: Mach. Learn. Pattern Recogn. Springer, Heidelberg (2006) 21. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

Right Ventricle Segmentation in Cardiac MR Images Using U-Net with Partly Dilated Convolution Gregory Borodin and Olga Senyukova(B) Faculty of Computational Mathematics and Cybernetics, Lomonosov Moscow State University, 2nd Education Building, GSP-1, Leninskie Gory, 119991 Moscow, Russian Federation [emailprotected], [emailprotected]

Abstract. Segmentation of anatomical structures in cardiac MR images is an important problem because it is necessary for evaluation of morphology of these structures for diagnostic purposes. Automatic segmentation algorithm with near-human accuracy would be extremely helpful for a medical specialist. In this paper we consider such structures as endocardium and epicardium of right ventricle. We compare the performance of the best existing neural networks such as U-Net and GridNet, and propose our own modification of U-Net which implies replacement of every second convolution layer with dilated (atrous) convolution layer. Evaluation on benchmark dataset RVSC demonstrated that the proposed algorithm allows to improve the segmentation accuracy up to 6% both for endocardium and epicardium compared to original U-Net. The algorithm also overperforms GridNet for both segmentation problems. Keywords: Right ventricle segmentation Dilated convolution · Atrous convolution

1

· U-Net

Introduction

Morphological analysis of right ventricle (RV) on cardiac magnetic resonance images (MRI) is necessary for diagnostics of such serious diseases as coronary heart disease, congenital heart disease and others. The greatest attention is paid to myocardium, a layer located between endocardium and epicardium. Thus it is important to obtain accurate delineation of endocardial and epicardial contours. Automatic segmentation algorithm would signiﬁcantly reduce the amount of routine work of a radiologist allowing him to process more cases. There are several existing works devoted to RV segmentation. The algorithms not using deep learning, such as [1,2] provide rather good results. However, it is known that deep learning algorithms generalize better and are less prone to overﬁtting on a certain dataset since they learn the best features independently and do not need expert knowledge. The work [3] describes a combination of deep convolutional neural network (CNN) and regression forests for RV volume c Springer Nature Switzerland AG 2018 V. K˚ urkov´ a et al. (Eds.): ICANN 2018, LNCS 11140, pp. 179–185, 2018. https://doi.org/10.1007/978-3-030-01421-6_18

180

G. Borodin and O. Senyukova

prediction. The authors of [4] propose a two-stage solution, one deep CNN for localization of a region containing RV, and another CNN for RV segmentation. The only automatic one-stage algorithm for segmentation of right and left ventricles endocardium and epicardium that uses only one deep CNN is [5]. Among CNN architectures used for other medical image analysis problems, according to [7] the most well-known is U-Net [8]. This network is modiﬁcation of fully convolutional network (FCN) [6]. It is also fully convolutional and it is constructed of a convolution (downsampling) and deconvolution (upsampling) paths. High resolution output feature maps from the convolution path are combined with the upsampled feature maps from the opposite block in order to perform better object localization. A large number of features in the upsampling part makes it almost symmetric to downsampling part and yields U-shape. This allows the network to propagate context information to higher resolution layers. U-Net and its various modiﬁcations have already been applied to plenty of medical image analysis problems, including left ventricle (LV) segmentation. GridNet architecture [9] is inspired by U-Net. Additional convolution blocks are added between each pair of opposing convolution and deconvolution blocks. There is also a convolution block for automatic estimation of the center of mass of the object of interest. The algorithm is evaluated on the Automated Cardiac Diagnostics Challenge (ACDC) dataset [13]. The results are presented for RV, LV and myocardium. In this work we propose a U-Net modiﬁcation by including dilated convolution [12] layers in it. We neither introduce additional layers nor replace all the convolution layers by dilated convolution, we just replace every second convolution layer in each block of the contracting path. The rest of the paper is organized as follows. In the Sect. 2, we describe the proposed CNN architecture in detail. In the Sect. 3 we provide description of experiments and results of evaluation of the proposed method and its comparison with existing state-of-the-art methods. Conclusions are drawn in the Sect. 4.

2 2.1

Method Original U-Net

U-Net [8] consists of a contracting (convolution) path and an expansive (deconvolution) path. Each block of the contracting path consists of two convolution layers with kernel size 3 × 3 where each layer is followed by a rectiﬁed linear unit (ReLU). Each block is followed by 2 × 2 max pooling operation, after which the number of feature channels is doubled. Each block of the expansive path consists of 2 × 2 up-convolution that halves the number of feature channels, concatenation with the correspondingly cropped feature map from the contracting path and two 3 × 3 convolution layers with ReLU. At the ﬁnal layer a 1 × 1 convolution is used to map each 64-component feature vector to the classes of the segmentation map.

RV Segmentation Using U-Net with Partly Dilated Convolution

2.2

181

Dilated Convolution

Dilated (atrous) convolution [12] is a new type of convolution that allows aggregation of multi-scale context. It was successfully applied to diﬀerent tasks [11]. Dilation of the convolution kernel k of size M by the factor l means that we sample the input image with the stride l (1): y[i, j] =

M M

x[i + l ∗ m, j + l ∗ n]k[m, n].

(1)

n=1 m=1

This operation allows to enlarge the ﬁeld of view of the ﬁlter without losing image resolution (Fig. 1). The receptive ﬁeld grows exponentially while the ﬁlter size grows linearly.

Fig. 1. Illustration of convolution kernel dilated by factor 2.

Setting dilation factor l to 1 means that traditional convolution is performed. 2.3

The Proposed Architecture

The main idea of the proposed method is to replace every second convolution layer in contracting path of U-Net by dilated convolution layer with kernel size M = 3 and dilation factor l = 2 (Fig. 2). Therefore, the receptive ﬁeld is 5 × 5. Leaving the ﬁrst 3 × 3 convolution layer in each block of the contracting path allows taking into account all the elements of the corresponding feature map, while introducing a dilated convolution layer after it allows capturing larger context which promotes correct inference. So only one of two convolution layers was replaced by dilated convolution layer. Kernel size 3 × 3 was kept the same as in the original U-Net. It was not increased in order to prevent the network from fast growth. Dilation factor with minimum value 2 was chosen in order keep as much information as possible. The image size does not change after convolution and dilated convolution because the image is padded before the operation.

182

G. Borodin and O. Senyukova

Fig. 2. The proposed CNN architecture. Convolution layers in U-Net replaced by dilated convolution layers are shown in yellow. (Color figure online)

3 3.1

Experimental Results and Discussion Dataset

The proposed algorithm and existing algorithms were evaluated on Right Ventricle Segmentation Challenge (RVSC) dataset [10] provided as part of the MICCAI 2012 challenge on automated RV endocardium and epicardium segmentation from short-axis cine MRI. The dataset consists of images of 48 patients with various cardiac pathologies. The images are in DICOM format. The dataset is divided into three equal disjoint parts, one of which is for training, and the other two are for testing. Manual expert contours for endocardium and epicardium are provided only for the training images (16 cases). The images were preprocessed by mean-variance normalization (MVN). In order to artiﬁcially increase the training database we used data augmentation procedure involving image rescaling (4 scales), vertical and horizontal ﬂipping and rotations (10 angles). 3.2

Training and Evaluation

All the networks participating in our comparison were implemented in Python 3 using Keras library [14]. They were trained with the same protocol and tested on the same datasets described below. Since the expert labeling on RVSC was provided only for 16 patients, we used 12 of them for training and the other

RV Segmentation Using U-Net with Partly Dilated Convolution

183

4 for testing. We used 4-fold cross-validation and took the segmentation result as the average between four results. Training protocol is the same as described in [5]. A learning algorithm is stochastic gradient descent with momentum of 0.9. Dropout ratio is 0.5 and L2 weight decay regularization is 0.0005. All the networks were trained for 10 epochs. Initial learning rate is base lr = 0.01 and it is annealed according to the polynomial decay: base lr × (1 −

iter )power , max iter

(2)

where iter is the current iteration, max iter is the maximum number of iterations equal to 10 epochs, and power = 0.5 controls the rate of decay. We reduced the problem of ﬁnding endocardial/epicardial contour to the problem of ﬁnding the area enclosed by this contour. This makes it possible to use Dice index [15] for evaluation of similarity (overlap) between segmentation result and manual expert labeling (ground truth): D(X, Y ) = 2 3.3

X ∩Y . X ∪Y

(3)

Results

The results for segmentation of RV endocardium and epicardium on RVSC dataset for original U-Net and the proposed algorithm (U-Net with dilated convolution layers) are provided in Table 1. Also, we compared the proposed algorithm with the other U-Net modiﬁcation, GridNet [9] that was also used, in particular, for RV segmentation, but on the other dataset. Table 1. Segmentation results (Dice index) on RVSC dataset. Method

Endocardium Epicardium

GridNet (Zotti et al. 2017)

0.82

0.81

U-Net (Ronneberger et al. 2015) 0.79

0.77

Our method

0.83

0.85

It can be seen that introduction of dilated convolution into U-Net increases its accuracy by 6% both for endocardium and epicardium. The proposed algorithm shows better accuracy than GridNet for both anatomical structures up to 3%. We also tried to introduce dilated convolution to GridNet but it did not help to improve the quality of segmentation. The authors of [5] that proposed to apply fully convolutional network [6] also evaluated their algorithm on RVSC dataset, but they used all 16 cases for training, and sent predicted endocardial and epicardial contours on unlabeled test sets to challenge organizers for independent evaluation. The reported accuracy in this case is 80% for endocardium and 84% for epicardium. It seems that our method performs better for endocardium because it demonstrated 85% accuracy

184

G. Borodin and O. Senyukova

after training only on 12 cases, however the objective comparison could be done if there was a labeled test set. The results example is shown in Fig. 3.

Fig. 3. Results example. Epicardial (external) contour is shown in yellow. Endocardial (internal) contour is shown in red. (Color figure online)

Further comparison with more existing methods is warranted. In general, RV segmentation is more challenging problem than LV segmentation because of more complex shape of RV across slices and phases. Therefore the state-of-the-art accuracy of deep CNNs for this problem is still 80–85% while LV segmentation accuracy is over 90%. Also, apical slices introduce more diﬃculties to the segmentation process. Exploring dilated convolution for 3D networks, such as 3D U-Net [16] is a part of future work.

4

Conclusion

In this work we proposed a modiﬁcation of one of the most widely used deep CNNs for medical image segmentation, U-Net, and demonstrated that it significantly overperforms the original U-Net in the context of right ventricle endocardium and epicardium segmentation problem. Moreover, it overperforms the other U-Net modiﬁcation, GridNet, that contains more convolution blocks. The results are provided for real MR images from benchmark dataset which makes possible objective comparison with diﬀerent algorithms. Although we managed to improve segmentation accuracy of RV, this is still an open problem and further research is warranted. The proposed CNN architecture can be used for other medical image analysis tasks. Acknowledgments. The work was supported by the Grant of President of Russian Federation for young scientists No. MK-1896.2017.9 (contract No. 14.W01.17.1896MK).

RV Segmentation Using U-Net with Partly Dilated Convolution

185

References 1. Ringenberg, J., Deo, M., Devabhaktuni, V., et al.: Fast, accurate, and fully automatic segmentation of the right ventricle in short-axis cardiac MRI. Comput. Med. Imag. Grap. 38(3), 190–201 (2014) 2. Punithakumar, K., Noga, M., Ben Ayed, I., Boulanger, P.: Right ventricular segmentation in cardiac MRI with moving mesh correspondences. Comput. Med. Imag. Grap. 43, 15–25 (2015) 3. Zhen, X., Wang, Z., Islam, A., et al.: Multiscale deep networks and regression forests for direct biventricular volume estimation. Med. Image Anal. 30, 120–129 (2016) 4. Luo, G., An, R., Wang, K., et al.: A deep learning network for right ventricle segmentation in short-axis MRI. In: 2016 Computing in Cardiology Conference (CinC), pp. 485–488. IEEE Computer Society (2016) 5. Tran, P.V.: A fully convolutional neural network for cardiac segmentation in shortaxis MRI. arXiv preprint arXiv:1604.00494 (2016) 6. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), pp. 3431–3440. IEEE Computer Society (2015) 7. Litjens, G., Kooi, T., Ehteshami, B., et al.: A survey on deep learning in medical image analysis. Med. Image Anal. 42, 60–88 (2017) 8. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4 28 9. Zotti, C., Luo, Z., Lalande, A., et al.: Novel deep convolution neural network applied to MRI cardiac segmentation. arXiv preprint arXiv:1705.08943 (2017) 10. Petitjean, C., Zuluaga, M.A., Bai, W., et al.: Right ventricle segmentation from cardiac MRI: a collation study. Med. Image Anal. 19(1), 187–202 (2015) 11. Chen, L.-C., Papandreou, G., Kokkinos, I., et al.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2018) 12. Yu, F., Koltun V.: Multi-scale context aggregation by dilated convolutions. In: ICLR (2016) 13. ACDC-MICCAI challenge. http://acdc.creatis.insa-lyon.fr. Accessed 10 July 2018 14. Keras: The Python Deep Learning library. https://keras.io. Accessed 10 July 2018 15. Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945) ¨ Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3D U-Net: 16. C ¸ i¸cek, O., learning dense volumetric segmentation from sparse annotation. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 424–432. Springer, Cham (2016). https://doi.org/10.1007/978-3-31946723-8 49

Model Based on Support Vector Machine for the Estimation of the Heart Rate Variability Catalina Maria Hernández-Ruiz1(&) , Sergio Andrés Villagrán Martínez1(&), Johan Enrique Ortiz Guzmán2(&), and Paulo Alonso Gaona Garcia1(&) 1

Facultad de Ingeniería, Universidad Distrital Francisco José de Caldas, Bogotá, Colombia [emailprotected], [emailprotected], [emailprotected] 2 Facultad de Medicina, Universidad del Rosario, Bogotá, Colombia [emailprotected]

Abstract. This paper shows the design, implementation and analysis of a Machine Learning (ML) model for the estimation of Heart Rate Variability (HRV). Through the integration of devices and technologies of the Internet of Things, a support tool is proposed for people in health and sports areas who need to know an individual’s HRV. The cardiac signals of the subjects were captured through pectoral bands, later they were classiﬁed by a Support Vector Machine algorithm that determined if the HRV is depressed or increased. The proposed solution has an efﬁciency of 90.3% and it’s the initial component for the development of an application oriented to physical training that suggests exercise routines based on the HRV of the individual. Keywords: Heart Rate Variability (HRV) Internet of Things (IOT) Support Vector Machine (SVM) Heart Rate Monitor (HRM)

1 Introduction The heart rate variability (HRV) is the difference per unit of time between heartbeats in any given interval [1]. It is a useful tool to evaluate the control of the autonomic nervous system over the heart rate (HR), as it is shown by the changes given in the balance between the sympathetic and parasympathetic systems. Obtaining the HRV does not require invasive processes as it is carried out through the analysis of the electrical signals of the heart, reflecting the regularity of heartbeats [2]. Through the Internet of Things (IOT), it is possible to monitor and control a great diversity of systems through the use of sensor sets which facilitate the capture of data for further analysis and processing [3]. In order to obtain HR speciﬁcally, there are HR monitors (HRM), commonly used in medicine and sports sciences by doctors, athletes, coaches and researchers, as a reliable and robust means of recording the activity of the heart [4]. Among these HRM, there are wristbands and wireless chest straps with © Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11140, pp. 186–194, 2018. https://doi.org/10.1007/978-3-030-01421-6_19

Model Based on Support Vector Machine

187

electrodes connected with services and web/mobile applications so as to send the captured information. These applications also offer complementary information associated with the statistics and individual’s proﬁle, which is something beneﬁcial for physical training purposes [5]. HRV has gained relevance in recent decades due to its association with heart diagnosis. For this reason, several authors have developed tools for their analysis and use [1]. Among the most commonly used traditional methods for calculating HRV, the frequency and time domain measurements as well as the non-linear methods can be found [6]. Song et al. [7] claim that, for the analysis of HRV, these conventional practices have some limitations to make predictions and diagnosis. Due to this fact, new techniques and mechanisms based on the usual mathematical models have emerged. These, when combined with computational systems, are more accurate in their calculation, as Matta et al. [8] who applied neural networks to obtain HRV through the recognition and categorization of patterns. From this perspective, the work presented below is a model based on Support Vector Machine (SVM) for the classiﬁcation of HRV using low cost equipment such as chest straps with HR sensors that allow monitoring and obtaining the activity of the heart. The aim of this is to generate a tool which could provide any person - an expert or not - with the value of HRV in a practical and simple way so that this can be applied afterwards in order to make decisions with regard to health areas. The following article is organized as follows: Sect. 2 provides a context for the topic as well as related work and background information. Section 3 presents the methodology used for the work conducted. Section 4 describes the proposed model. Section 5 expresses the results obtained. Subsequently, Sect. 6 shows the analysis of results and discussions. Finally, Sect. 7 covers the conclusions and future works.

2 Related Works The most widely used resource for the capture of HRV is the electrocardiogram (ECG), which registers the origin and propagation of electric potential through the cardiac muscle [9], and is the means by which the most information about the activity of the heart is obtained [1]. The ECG consists of waves, segments and intervals. Such waves are expressed with deflection of the electrical activity, ﬁnding either positive deflections (when the deflection is upward) or negative (when it is downward) in relation to the baseline of the heart rate. On the other hand, the segments are understood as the space lying between two consecutive waves, whereas the intervals are the period resulting from the sum of a wave and a segment. Another determining factor given by the ECG is the QRS complex, which indicates the depolarization of the ventricular muscle. In this way, the time between each heartbeat is determined by the interval between the QRS complexes, more commonly known as R-R intervals [10]. HRV is a valuable tool to examine the sympathetic and parasympathetic functions of the autonomic nervous system and is inversely proportional to the regularity of the HR; that is to say, the higher the regularity there is, the lower HRV there is and vice versa. Additionally, it serves as a measure of the balance between sympathetic and parasympathetic mediators. The former ones reflect the effect of epinephrine and

188

C. M. Hernández-Ruiz et al.

norepinephrine that sympathetic nerve ﬁbers release on the sinoatrial and atrioventricular nodules, which leads to an increase in the rate of cardiac contraction. The latter ones influence on the release of acetylcholine by parasympathetic nerve ﬁbers that decrease HR [11]. Sao et al. [12] state that the combination between the electrical signals of the heart and the HRV generate a good basis for the analysis of its state. According to Giles et al. [4], from several clinical studies undertaken, it was found that the decrease in HRV is related to the diagnosis of cardiovascular diseases, diabetic neuropathy and hypertension, among others. Such authors also claim that the HRV serves as a measure in the sports environment when facing diverse conditions such as overtraining, recovery, endurance training and exercise. Karim et al. [11] describe the calculation of heart rate variability using different methods. Time domain is among one the most known and simplest to apply, in which R-R intervals, which are necessary for the generation of statistical metrics as well as indexes for calculating HRV, are identiﬁed based on the ECG. SDNN corresponds to the standard deviation of all the R-R intervals. Besides, RMSSD and PNN50 can also be found, the former one being the square root of the mean squared difference in successive heartbeats, whereas the latter one is the number of successive intervals that differ by more than 50 ms, expressed as a percentage of the total number of heartbeats. Some other classic measurements to determine HRV are those of the frequency domain. McCraty et al. [6] state that the heart rate oscillations are divided into 4 primary frequency bands: high frequency (HF), low frequency (LF), very low frequency (VLF) and ultra-low frequency (ULF). The ﬁrst two will be vital for the present study since they are directly related to the HRV. The HF goes from 0.15 Hz to 0.4 Hz, which is equivalent to rhythms with periods between 2.5 and 7 s, whereas the LF lays between 0.04 Hz and 0.15 Hz, which means rhythms of 7 and 25 s respectively. The HF reflects the parasympathetic or vagal activity and is also called the respiratory band because it responds to the variations of the HR that occurs in the respiratory cycle. On the other hand, the LF shows the sympathetic activity of the system. The HR is regulated by the balance between the actions of the sympathetic and the parasympathetic nervous system, so it is vital to know the HF and LF bands to determine the HRV. Among the non-linear methods, there is the Poincaré plot, which is a non-linearvisual technique that allows examining the behavior of the R-R intervals, through the classiﬁcation of the forms of the ECG plot. Analysis and recognition allow to identify degrees of heart failure. This differentiation can be done through the calculation of the standard deviations SD1 and SD2 that are related to HRV [12]. To classify HRV, multiple authors have resorted to ﬁelds and techniques derived from artiﬁcial intelligence, like fuzzy logic, neural networks, ML, among others. Such as Patel et al. [13], who designed a neural network for the detection of early fatigue in people who drive for long periods of time, not only warned about the lethargy which seriously affects the performance of drivers but also claimed that this could be a very common cause of accidents. Through the classiﬁcation of time domain measurements and the frequency of HRV, they were able to quantify somnolence with an accuracy of 90%, for which they distinguished the levels of sympathetic (LF) and parasympathetic (HF) activity of the organism. This technique of fatigue detection, based on HRV, was recommended as a countermeasure for fatigue.

Model Based on Support Vector Machine

189

Asl et al. [14] applied SVM for the identiﬁcation of 6 different types of arrhythmias: normal sinus rhythm, premature ventricular contraction, atrial ﬁbrillation, sick sinus syndrome, ventricular ﬁbrillation and heart block. They did this by classifying 15 characteristics of the HRV calculated through linear and non-linear methods. The accuracy of this algorithm for each case was greater than 98%. On the other hand, Liu et al. [15] classiﬁed the combination of cardiac variability and complexity to determine those patients who required lifesaving interventions. Such authors captured information from 104 patients through the use of wireless vital signs monitoring systems from which they obtained their heart rate data. They applied classiﬁcation techniques such as neural networks and multivariable logistic regression, which were evaluated and compared by statistical analysis. The conclusions indicated that in the neural network model, the multilayer perceptron (MLP) algorithm demonstrated more efﬁciency and effectiveness in the classiﬁcation of patients who needed a rescue measure in contrast with the logistic regression algorithm. Considering the aforementioned reference points, the following study intends to determine the classiﬁcation of HRV suggesting an algorithm based on SVM, as Song, et al. [7] did. The authors applied the same technique for the analysis and identiﬁcation of patients who suffered acute myocardial infarction, based on the fact that the decrease in HRV was associated with a potential risk of ventricular arrhythmias for patients who had had such episodes. The aim of this work is to develop a tool which can support decision-making strategies for the areas of health and physical training. In view of the above, it is important to consider that classiﬁcation is a problem which may be solved through ML, in which there could exist from one to two or more classiﬁcations in a sample data. The study included a process of design and implementation of the proposed algorithm, established a work methodology described in the following section.

3 Work Methodology The working method to carry out the following study was quasi-experimental and applied. Then, in Fig. 1 a series of phases that deﬁne it and that allowed to glimpse a navigation map for the study are shown.

Fig. 1. Work phases used for study.

190

C. M. Hernández-Ruiz et al.

The ﬁrst phase involved the search and analysis of literature on conventional techniques for the calculation of HRV, from them, speciﬁc methods were identiﬁed and explored in Phase 2. In stage 3, the deﬁnition of the process was carried out of capture of cardiac signals through IoT devices and the generation of a strategy for the transfer of collected data. During phase 4, a method based on SVM was implemented to classify HRV, this was applied through a case study in phase 5. The results and their analysis were performed in Phase 6, where the efﬁciency of the algorithm was determined. 3.1

Case Study

The case study included the capture of cardiac signals from a group of individuals through chest straps that obtained the HR value. Table 1 presents the characteristics of used strap [16]. Table 1. Characteristics of the Polar H10 chest strap. Polar H10 heart Battery type Battery sealing ring Battery lifetime Operating temperature Connector material Strap material

rate sensor CR 2025 O-ring 20.0 0.90 Material Silicone 40 h −10 °C to +50 °C/14 °F to 122 °F ABS, ABS + GF, PC, Stainless steel 38% polyamide, 29% polyurethane, 20% elastane, 13% polyester, silicone impressions

These non-invasive records were made in 33 people whose HR was obtained for 12 min. In total, 56 data constituted the training set that served as the input for the learning of ML algorithm. The average age of the individuals ranged between 25 and 35 years, mostly healthy people with few exceptions, such as thyroid dysfunctions and hypertension. Close amounts of women and men, although no data was taken on children because their nervous system has not yet fully matured as in the case of adults. During each session, the person was required to remain at rest for approximately 12 min, which included sitting without speaking and minimizing movements. In addition of HR, other information was recorded such as age, weight, height, gender, pre-existing diseases and the use of regular medications or treatments. By means of these cardiac registers the necessary information was obtained to feed the ML algorithm, its model will be described in the following section.

4 Proposed Working Model The model that was carried out has two main components that can be observed in Fig. 2. The ﬁrst is the IoT system that aims to deﬁne the capture and disposition of the information, this being the input for the following component: the HRV classiﬁcation

Model Based on Support Vector Machine

191

Fig. 2. Proposed working model.

system, which takes the data and processes it by classifying the HRV as depressed or increased. IoT system used pectoral bands to record the HR, its transmission was made through a mobile application that communicated with the sensor via bluetooth. Pitale et al. [1] describe two steps for the implementation of classiﬁcation algorithms: the deﬁnition of the model and the selection and application of a method to classify it. For our study, the ﬁrst one included the processing of the information given by the IoT system to obtain the entries of the classiﬁcation algorithm, which were diverse variables on the domains of time, frequency and non-linear methods. Among the ﬁrst were the nnxx which is the number of successive R-R intervals that differ by more than xx milliseconds and pnnxx, which is its corresponding in percentage [17]. In the domain of frequency, the HF and LF were taken, due to their direct relationship with the activity of the sympathetic and parasympathetic systems of the organism [6]. Finally, variables from nonlinear methods such as SD1 and SD2 were analyzed, which are the standard deviations of the Poincaré plot perpendicular and along the identity line respectively [18]. In addition, alpha1 and alpha2 were obtained, short and longterm fluctuations of the detrended fluctuation analysis [19]. The expected results were a reduced or increased HRV as explained by Task Force et al. [20]. The classiﬁcation technique chosen was SVM, due to its efﬁciency and reliability as described in the background section. Song et al. [7], state that SVMs are supervised learning models that are used in regression and classiﬁcation problems because they are based on data analysis and pattern recognition, generating n-dimensional hyperplanes to distinguish and separate various sets of characteristics, thus ﬁnding the optimal hyperparameters. The algorithm was trained with the variables generated from the 56 records obtained with the chest strap, the results of its application are described in the following section.

5 Results Obtained Multiple combinations of inputs were applied for the algorithm training with the purpose of obtaining the best model for the HRV classiﬁcation. Zhao et al. [21] describe a multiclass classiﬁcation function in Matlab ﬁtcecoc, which was used in the present study with a linear kernel and its parameters were optimized using automatic hyperparameter optimization. The corresponding evaluation was carried out through obtaining two types of errors: the classiﬁcation error in the sample, and the error generated from cross validation. He et al. [22], state that the cross-validation technique

192

C. M. Hernández-Ruiz et al.

divides the training data into several non-contiguous parts with similar length. Each one is selected as test data, while the rest are used as training. Then, the prediction model is applied with these data and this process is repeated with each of the divisions obtained. All predictions are averaged to give an estimate of the performance of the algorithm. As a ﬁrst result, the most efﬁcient inputs set was: HF, alpha1, alpha2 and nnxx. With an error of classiﬁcation of the sample of 8.9% and a cross-validation error of 9.7%, the behavior of the algorithm with this conﬁguration is presented in Fig. 3. The evaluation carried out by the optimization function to compare the expected behavior with the real one, decreasing the cross-validation error, returning 90.3% of effectiveness.

Fig. 3. Optimization of the proposed classiﬁcation model.

6 Analysis of Results and Discussions During the algorithm tests, multiple cases with negative behaviors were evidenced, such as the use of frequency domain variables only: HF and LF, because it did not grant a satisfactory classiﬁcation rate for the algorithm, it presented an error of 19.6%. Likewise, the inclusion of the 8 entries in the model generated an overﬁtting problem, same case was perceived when modifying the algorithm’s Kernel to Gaussian, presenting a perfect ﬁt to the training set with a sample classiﬁcation error of 0%, but with cross validation, the error was greater than 30%. This situation was propitiated by the amount of data for the training set being very small in contrast to a high number of features or entries, very common difﬁculty that is presented in the classiﬁcation algorithms with few data. The most efﬁcient set presents a mixture between the three methods that generate variables for the HRV calculation, time and frequency domains and non-linear methods, which outlines a complementary behavior of these variables in HRV obtaining.

Model Based on Support Vector Machine

193

7 Conclusions One of the main advantages presented in this study is the low cost in the acquisition of the cardiac registry. The use of chest straps is a non-invasive method that does not generate any secondary effects on the individual and does not present environmental requirements, it can be applicable in any person who is doing any activity. Its use is recommended in conjunction with applications that allow its consumption to be carried out, because they have shown high reliability in its evaluation. The integration and combination of variables of time and frequency domains and nonlinear methods is a viable and effective alternative for the classiﬁcation of HRV. The proposed solution is suggested as a useful and practical tool for people who need to know the HRV, since it is a health indicator and is related to various deﬁciencies and diseases as expressed in the section of background. As future work and continuation of this study we propose the improvement of the propounded model, increasing its efﬁciency through the enrichment of the training set, providing greater experience to the algorithm for its learning. Also, we want to make use of this solution as a component of an application for physical training, supporting an athlete and personal trainers suggesting exercise routines according to their physical condition, by tracking their HRV, analyzing their progress and history, making use of GPS, to know changes of altitude and length of routes.

References 1. Pitale, R., Tajane, K., Umale, J.: Heart rate variability classiﬁcation and feature extraction using support vector machine and PCA: an overview. J. Eng. Res. Appl. 4, 381–384 (2014) 2. Borchini, R., Veronesi, G., Bonzini, M., Gianfa*gna, F., Dashi, O., Ferrario, M.: Heart rate variability frequency domain alterations among healthy nurses exposed to prolonged work stress. Int. J. Environ. Res. Public Health 15, 113 (2018) 3. Hernández, C., Villagrán, S., Gaona, P.: Predictive model for detecting MQ2 gases using fuzzy logic on IoT devices. In: Jayne, C., Iliadis, L. (eds.) EANN 2016. CCIS, vol. 629, pp. 176–185. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-44188-7_13 4. Giles, D., Draper, N., Neil, W.: Validity of the Polar V800 heart rate monitor to measure RR intervals at rest. Eur. J. Appl. Physiol. 116, 563–571 (2015) 5. Erkkila, M., Rae, R., Thurlin, T., Korva, T., Manninen, T.: Managing physiological exercise data. US Patent 9855463B2, 16 January 2014 6. McCraty, R., Shaffer, F.: Heart rate variability: new perspectives on physiological mechanisms, assessment of self-regulatory capacity, and health risk. Glob. Adv. Health Med. Improv. Healthc. Outcomes Worldw. 4, 46–61 (2015) 7. Song, M., Lee, J., Cho, S., Lee, K., Yoo, S.: Support vector machine based arrhythmia classiﬁcation using reduced features. Int. J. Control Autom. Syst. 3, 571–579 (2005) 8. Matta, S., Sankari, Z., Rihana, S.: Heart rate variability analysis using neural network models for automatic detection of lifestyle activities. Biomed. Signal Process. Control 42, 145–157 (2018)

194

C. M. Hernández-Ruiz et al.

9. Lewis, M.C., Maiya, M., Sampathila, N.: A novel method for the conversion of scanned electrocardiogram (ECG) image to digital signal. In: Dash, S.S., Das, S., Panigrahi, B.K. (eds.) International Conference on Intelligent Computing and Applications. AISC, vol. 632, pp. 363–373. Springer, Singapore (2018). https://doi.org/10.1007/978-981-10-5520-1_34 10. Barrett, K., Brooks, H., Boitano, S., Barman, S.: Ganong’s Review of Medical Physiology, 23rd edn. McGraw Hill Education, New York (2016) 11. Karim, N., Hasan, J., Ali, S.: Heart rate variability - a review. J. Basic Appl. Sci. 7, 71–77 (2011) 12. Sao, P., Hegadi, R., Karmakar, S.: ECG signal analysis using artiﬁcial neural network. Int. J. Sci. Res. (IJSR), 82–86 (2015) 13. Patel, M., Lal, S.K.L., Kavanagh, D., Rossiter, P.: Applying neural network analysis on heart rate variability data to assess driver fatigue. Expert. Syst. Appl. Int. J. 38, 7235–7242 (2011) 14. Asl, B., Setarehdan, S., Mohebbi, M.: Support vector machine-based arrhythmia classiﬁcation using reduced features of heart rate variability signal. Artif. Intell. Med. 44, 51–64 (2008) 15. Liu, N., Holcomb, J., Wade, C., Darrah, M., Salinas, J.: Utility of vital signs, Heart rate variability and complexity, and machine learning for identifying the need for lifesaving interventions in trauma patients. Shock (Augusta, GA) 42, 108–114 (2014) 16. Polar: Technical speciﬁcations. Polar H10 Heart Rate Sensor. https://support.polar.com/e_ manuals/H10_HR_sensor/Polar_H10_user_manual_English/Content/TechnicalSpeciﬁcations.htm. Accessed 26 May 2018 17. Gimeno-Blanes, F.J., Rojo-Álvarez, J.L., Caamaño, A.J., Flores-Yepes, J.A., GarcíaAlberola, A.: On the feasibility of tilt test outcome early prediction using ECG and pressure parameters. EURASIP J. Adv. Signal Process. 33 (2011) 18. Mirescu, S., Harden, S.: Nonlinear dynamics methods for assessing heart rate variability in patients with recent myocardial infarction. Rom. J. Biophys. 22, 117–124 (2016) 19. Mazzuco, A., et al.: Relationship between linear and nonlinear dynamics of heart rate and impairment of lung function in COPD patients. Int. J. Chronic Obstr. Pulm. Dis. 10, 1651– 1661 (2015) 20. Task Force of the European Society of Cardiology and the North American Society of Pacing and Electrophysiology: Heart rate variability: standards of measurement, physiological interpretation and clinical use. Eur. Hear. J. 17, 354–381. (1996) 21. Zhao, J., Mucaki, E., Rogan, P.: Predicting ionizing radiation exposure using biochemicallyinspired genomic machine learning. F1000Research 7, 233 (2018) 22. He, Z.: 4 - Phosphorylation site prediction. In: Data Mining for Bioinformatics Applications, pp. 29–37 (2015)

High-Resolution Generative Adversarial Neural Networks Applied to Histological Images Generation Antoni Mauricio1(B) , Jorge L´ opez1 , Roger Huauya2 , and Jose Diaz2(B) 1

Research and Innovation Center in Computer Science, Universidad Cat´ olica San Pablo, Arequipa, Peru {manasses.mauricio,jorge.lopez.caceres}@ucsp.edu.pe 2 Artiﬁcial Intelligence, Image Processing and Robotic Lab, Department of Mechanical Engineering, Universidad Nacional de Ingenier´ıa, Bldg. A - Oﬀ. A1-221, 210 T´ upac Amaru Ave., Lima, Peru [emailprotected], [emailprotected]

Abstract. For many years, synthesizing photo-realistic images has been a highly relevant task due to its multiple applications from aesthetic or artistic [19] to medical purposes [1, 6, 21]. Related to the medical area, this application has had greater impact because most classiﬁcation or diagnostic algorithms require a signiﬁcant amount of highly specialized images for their training yet obtaining them is not easy at all. To solve this problem, many works analyze and interpret images of a speciﬁc topic in order to obtain a statistical correlation between the variables that deﬁne it. By this way, any set of variables close to the map generated in the previous analysis represents a similar image. Deep learning based methods have allowed the automatic extraction of feature maps which has helped in the design of more robust models photo-realistic image synthesis. This work focuses on obtaining the best feature maps for automatic generation of synthetic histological images. To do so, we propose a Generative Adversarial Networks (GANs) [8] to generate the new sample distribution using the feature maps obtained by an autoencoder [14, 20] as latent space instead of a completely random one. To corroborate our results, we present the generated images against the real ones and their respective results using diﬀerent types of autoencoder to obtain the feature maps.

Keywords: Generative Adversarial Nets High-resolution generated images

· Histological images

The present work was supported by grant 234-2015-FONDECYT (Master Program) from Cienciactiva of the National Council for Science, Technology and Technological Innovation (CONCYTEC-PERU), the Oﬃce of Research of Universidad Nacional de Ingenier´ıa (VRI - UNI) and the research management oﬃce (OGI - UNI). c Springer Nature Switzerland AG 2018 V. K˚ urkov´ a et al. (Eds.): ICANN 2018, LNCS 11140, pp. 195–202, 2018. https://doi.org/10.1007/978-3-030-01421-6_20

196

1

A. Mauricio et al.

Introduction

Since its conception, the focus of deep learning has been to design high hierarchy architectures which extract the best feature maps to represent probability distributions over many kinds of data (images, audio, texts, etc.) [2]. This approach has been successful for applications related to discriminative models because feature maps are obtained to maximize the separation between labeled or segregated groups in high-dimensional space. Hence, feature maps extraction is associated with the discrimination process instead of prioritizing a precise representation of the data [5,15]. On the other hand, deep generative models have generated high impacts, since a few years ago, and several works [9,14,17,19,21] have overcome the most signiﬁcant problems that involved them. Goodfellow et al. [8] proposed a generative model based on adversarial training, known as GAN, which overcame the approximation of intractable probabilistic computations arising in maximum likelihood strategies, and the problem of leveraging piecewise linear units in generative context. Evidently, GANs are among the hottest topics in Deep Learning currently, but synthesizing photo-realistic images is not an easy task. Images do not have a sequential correspondence but spatial correspondence, so it is normal that edges have generation and continuity errors because GANs include a discriminator D that competes against the generator G and ideally they tie or G wins, however in practice D usually wins which implies that the feature maps obtained in the generation are more linked to D than to G. To overcome this problem, several works have proposed improvements over the original pipeline including regularization [16], re-deﬁning cost function [17] and setting a convenient latent space [11,13]. This work is based on the improvements proposed by several authors regarding the common problems of the GAN regarding the synthesis of photo-realistic images. Our proposal is to improve the quality of the generated images using a Teacher-Network based on autoencoders to obtain a suitable latent space. Finally the results using pre-trained latent spaces are visualized in order to evaluate their relevance. We use histological images as dataset because they are used as reference for detection and diagnostic applications [1,10,12,18].

2

Proposed Approach

In the following, we describe the background techniques and methods, and provide further details on the proposed approach. 2.1

Generative Adversarial Networks

Generative adversarial networks [8] allow to model complex databases like a resampling function, to do so a generative network G is pitted against an adversary which is a discriminative network D. The discriminator model, D(x), learns to determine whether a sample x came from G(z) or from the original training

High-Resolution GANNs Applied to Histological Images Generation

197

data while the generator model, G(z), maps samples z from the prior p(z) to the data space and trains to maximally confuse D(x) by leveraging the gradient and using that to modify its parameters, this interaction establishes a min-max adversarial game between G(z) and D(x). The solution to this game is expressed as following considering V (D, G) as the value function: minG maxD V (D, G) = Ex∼Pdata [log(D(x))] + Ez∼P (z) [log(1 − D(G(z)))] (1) G and D alternate the SGD training in two stages: (1) Train D to distinguish the true samples from the fake samples generated by G. (2) Train G so as to fool D with its generated samples. In practice, Eq. 1 does not provide enough gradient for G to learn. Therefore, at the beginning of the learning process G generates poor results and D rejects z with high conﬁdence, since z is clearly fake. 2.2

Autoencoders

An autoencoder (AE) is an unsupervised neural network that learns the probability distribution of a dataset by setting the target values equal to the inputs. In other words, it tries to learn the function FW,b (x) ∼ x that resembles the identity function. An autoencoder has two parts: an encoder network h = f (x) and a decoder network r = g(h). According to Goodfellow et al. [7], autoencoders learn to generate compact representations and reconstruct their inputs well, but they are fairly limited for most of the important applications. Autoencoders latent space may not be continuous and does not allow easy interpolation, which is a big problem considering knowledge representation spaces normally have discontinuities. Similar to GANs case, there are many variations done over the original autoencoders architecture. Doersch et al. [5] presented variational autoencoders (VAEs) as an unsupervised learning solution for complicated distributions. VAEs work well for both feature extraction and generative modeling; their latent spaces are continuous allowing easy random sampling and interpolation. Likewise, Makhzani et al. [15] proposed an adversarial autoencoder (AAEs) which is a probabilistic autoencoder improved to perform variational inference by matching the posterior encoded features, from the autoencoder, with an arbitrary prior distribution, from the GAN. AS Hitawala [9] mentions, the AAEs are trained using a dual cost function, a reconstruction error criteria and an adversarial training function that matches the aggregated posterior distribution of the latent space to an arbitrary prior distribution.

3

Related Studies

Synthesizing photo-realistic images has allowed to explore new solutions based on computer-aided diagnosis (CAD) [1,3,6,21]. Calimeri et al. [3] applies a GAN to synthesize MRI images of brain slices considering visual resolution improved

198

A. Mauricio et al.

by a Laplacian Pyramid in order to avoid contrast loss. Zhang et al. [21] combines GAN with wide-ﬁeld light microscopy to achieve deep learning super-resolution. Finally, [21] achieved synthesize many high-quality images. Tom et al. [18] proposed a stacked GAN for the fast simulation of patho-realistic ultrasound images reﬁning synthesized ones from an initial simulation performed with a pseudo Bmodel ultrasound image generator. On the other hand, Coates et al. [4] mentions that several simple factors, such as the number of hidden nodes in the model, may be more important achieving high performance than the learning algorithm or the depth of the model. The feature learning is a high-level specialized set of algorithms that prioritizes the descriptors or feature maps over hierarchy or complexity of the learning model. Hitawala et al. [9] compares diﬀerent models and improvements based on GAN, but adversarial autoencoders, in particular, lets us appreciate the impact of an adequate selection of latent space, respect to other improvements made based on the architecture. Considering feature maps as latent space, Kumar et al. [13] mentions that semi-supervised learning methods using GANs have shown promising empirical success. To do so, [13] uses the inverse mapping (the encoder) which improves semantically the reconstructed sample with the input sample and analyze the relationship between the number of fake samples and the eﬃciency in semi-supervised learning using GANs.

4

Experimental Analysis

4.1

Dataset Description

The dataset consists of 670 RGB segmented nuclei images and their respective masks. The images were acquired for Kaggle competition “Data Science Bowl 2018 - Find the nuclei in divergent images to advance medical discovery”1 under a variety of conditions and cell types, magniﬁcation, and imaging modality (brightﬁeld vs. ﬂuorescence) (Fig. 1).

Fig. 1. Original images from Data Science Bowl 2018 - “Find the nuclei in divergent images to advance medical discovery” hosted by kaggle

1

https://www.kaggle.com/c/data-science-bowl-2018/data.

High-Resolution GANNs Applied to Histological Images Generation

199

In order to increase the dataset for training, we apply many classical methods of data augmentation: divide into 9 sub-images and random rotations. 4.2

Experiments

To run experiments, we used a PC with the following settings: 3,6 GHz Intel Core i7 processor, 16 GB 3000 MHz DDR4 memory and NVIDIA GTX 1070 and for the implementation we used Pytorch-0.4.0. Framework. Our model consists on transferring the feature maps obtained from an autoencoder as the latent space of a GAN to improve its resolution in image generation. For this, it is necessary to consider a parallel training. The autoencoder trains to represent a feature map as close as possible to the dataset, while the GAN specializes in performing the generation. For a fast implementation, we used the Pytorch tutorials for autoencoders and GANs using MNIST dataset as reference2 . To test our model and evaluate the impact of pre-trained feature maps, the synthetic images are processed in a new pre-trained discriminator specialized on nuclei detection3 . Table 1 shows the results (acceptance ratio ra of synthetic images) achieved by the pre-trained discriminator using a simple autoencoder (AE), a variational autoencoder (VAE) and the classic GAN model as feature maps generator. To consider that a sample meets similar standards like the original ones, it is taken into account how many nuclei it has based on the original images statistics and how good it looks considering the originals. Table 1. Statistics of the generated groups of images respect to the originals Dataset

µ

ra

Original

7.20 -

GAN-AE

5.32 0.737

GAN-VAE 5.91 0.843 GAN

3.44 0.522

As Table 1 shows, the best statistical results and acceptance ratio are obtained using a VAE as the feature maps generator. Visually, Figs. 2 and 3 present the results for the classic GAN and VAE-GAN model respectively.

2 3

https://github.com/MorvanZhou/PyTorch-Tutorial. https://github.com/aksharkkumar/nuclei-detection.

200

A. Mauricio et al.

Fig. 2. Synthetic results using a simple GAN architecture. Detected nuclei in generated images are inside white circles

Fig. 3. Synthetic results using pre-trained feature maps from a VAE as latent space. Detected nuclei in generated images are inside red circles (Color ﬁgure online)

High-Resolution GANNs Applied to Histological Images Generation

5

201

Conclusions and Future Works

After the tests we carried out, it is concluded that the feature maps are essential to adequately describe any dataset and in turn the detail of description depends on the cost functions that deﬁne the main task. To synthesize images, a considerable improvement is observed (greater than 0.2) by correctly deﬁning the feature map which is used as a latent space in GAN model. From that point, the improvements become less and less noticeable for the VAE, but leave open two direct future jobs. First, improving the resolution of synthetic images using the RS-GAN or LAP-GAN cost function. Second, exploring more deeply the usefulness of feature maps as well as evaluate their quality inside more complex learning structures.

References 1. Asperti, A., Mastronardo, C.: The eﬀectiveness of data augmentation for detection of gastrointestinal diseases from endoscopical images. arXiv preprint arXiv:1712.03689 (2017) R Mach. Learn. 2. Bengio, Y.: Learning deep architectures for AI. Found. Trends 2(1), 1–127 (2009) 3. Calimeri, F., Marzullo, A., Stamile, C., Terracina, G.: Biomedical data augmentation using generative adversarial neural networks. In: Lintas, A., Rovetta, S., Verschure, P.F.M.J., Villa, A.E.P. (eds.) ICANN 2017. LNCS, vol. 10614, pp. 626– 634. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68612-7 71 4. Coates, A., Ng, A., Lee, H.: An analysis of single-layer networks in unsupervised feature learning. In: Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics, pp. 215–223 (2011) 5. Doersch, C.: Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908 (2016) 6. Eaton-Rosen, Z., Bragman, F., Ourselin, S., Cardoso, M.J.: Improving data augmentation for medical image segmentation (2018) 7. Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y.: Deep Learning, vol. 1. MIT press, Cambridge (2016) 8. Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014) 9. Hitawala, S.: Comparative study on generative adversarial networks. arXiv preprint arXiv:1801.04271 (2018) 10. Hou, L., et al.: Sparse autoencoder for unsupervised nucleus detection and representation in histopathology images. arXiv preprint arXiv:1704.00406 (2017) 11. Kastaniotis, D., Ntinou, I., Tsourounis, D., Economou, G., Fotopoulos, S.: Attention-aware generative adversarial networks (ATA-GANs). arXiv preprint arXiv:1802.09070 (2018) 12. Komura, D., Ishikawa, S.: Machine learning methods for histopathological image analysis. Comput. Struct. Biotechnol. J. 16, 34–42 (2018) 13. Kumar, A., Sattigeri, P., Fletcher, T.: Semi-supervised learning with GANs: manifold invariance with improved inference. In: Advances in Neural Information Processing Systems, pp. 5534–5544 (2017)

202

A. Mauricio et al.

14. Lai, W.S., Huang, J.B., Ahuja, N., Yang, M.H.: Deep Laplacian pyramid networks for fast and accurate superresolution. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, p. 5 (2017) 15. Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., Frey, B.: Adversarial autoencoders. arXiv preprint arXiv:1511.05644 (2015) 16. Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y.: Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957 (2018) 17. Song, J., Zhao, S., Ermon, S.: A-NICE-MC: adversarial training for MCMC. In: Advances in Neural Information Processing Systems, pp. 5140–5150 (2017) 18. Tom, F., Sheet, D.: Simulating patho-realistic ultrasound images using deep generative networks with adversarial learning. In: 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), pp. 1174–1177. IEEE (2018) 19. Van Den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A., et al.: Conditional image generation with PixelCNN decoders. In: Advances in Neural Information Processing Systems, pp. 4790–4798 (2016) 20. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A.: Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11, 3371–3408 (2010) 21. Zhang, H., Xie, X., Fang, C., Yang, Y., Jin, D., Fei, P.: High-throughput, high-resolution generated adversarial network microscopy. arXiv preprint arXiv:1801.07330 (2018)

Kernel

Tensor Learning in Multi-view Kernel PCA Lynn Houthuys(B) and Johan A. K. Suykens Department of Electrical Engineering ESAT-STADIUS, KU Leuven, Kasteelpark, Arenberg 10, 3001 Leuven, Belgium {lynn.houthuys,johan.suykens}@esat.kuleuven.be

Abstract. In many real-life applications data can be described through multiple representations, or views. Multi-view learning aims at combining the information from all views, in order to obtain a better performance. Most well-known multi-view methods optimize some form of correlation between two views, while in many applications there are three or more views available. This is usually tackled by optimizing the correlations pairwise. However, this ignores the higher-order correlations that could only be discovered when exploring all views simultaneously. This paper proposes novel multi-view Kernel PCA models. By introducing a model tensor, the proposed models aim to include the higher-order correlations between all views. The paper further explores the use of these models as multi-view dimensionality reduction techniques and shows experimental results on several real-life datasets. These experiments demonstrate the merit of the proposed methods. Keywords: Kernel PCA

1

· Multi-view learning · Tensor learning

Introduction

Principal component analysis (PCA) [12] is an unsupervised learning technique that transforms the initial space to a lower dimensional subspace while maintaining as much information as possible. The technique is wildly used in applications like dimensionality reduction, denoising and pattern recognition. PCA consist of taking the eigenvectors corresponding to the np largest eigenvalues, also known as the principal components, of the covariance matrix of a dataset, which span a subspace that retains the maximum variance of the dataset. For dimensionality reduction these principal components make up the lower dimensional dataset, and thus the new dimension equals np . Several nonlinear extensions to PCA were proposed. One well-known extension is kernel PCA (KPCA) [21]. Instead of working on the data directly, it ﬁrst applies a, possibly nonlinear, transformation on the data that maps the input data to a high-dimensional feature space. In multi-view learning the input data is described through multiple representations or views. A dataset could for example consist of images and the associated captions [14], video clips could be classiﬁed based on image as well as audio c Springer Nature Switzerland AG 2018 V. K˚ urkov´ a et al. (Eds.): ICANN 2018, LNCS 11140, pp. 205–215, 2018. https://doi.org/10.1007/978-3-030-01421-6_21

206

L. Houthuys and J. A. K. Suykens

features [13], news stories could be covered by multiple sources [7], and so on. Multi-view learning has been applied in numerous applications both as supervised [3,28] and unsupervised [2,4] learning schemes. Multi-view dimensionality reduction reduces the multi-view dataset to a lower dimensional subspace to compactly represent the heterogeneous data, where each datapoint in the newly formed subspace is associated with multiple views. Dimensionality reduction is often beneﬁcial for the learning process, especially when the data contains some sort of noise [6,8]. Most multi-view methods optimize a certain correlation between variables of two views. For example, in CCA [10] the correlation between the score variables is maximized, and in Multi-view LS-SVM [11] the product of the error variables is minimized. In real-world applications, however, data is often described through three views or more. This is usually accounted for by optimizing the sum of the pairwise correlations between diﬀerent views. Due to this approach, higher-order correlations that could only be discovered by simultaneously considering all views, are ignored. This issue was pointed out by Luo et al. [16], where the authors propose an extension to CCA, called Tensor CCA, that analyzes a covariance tensor over the data from all views. The model is formed by performing a tensor decomposition, which has a computational cost that is signiﬁcantly higher than the cost of regular CCA. This idea of including tensor learning is presented in Fig. 1. View 3

View 3

View 3 View 1

View 1 View 1

View 2

View 2

View 2

Fig. 1. An example with three views to motivate tensor learning in multi-view learning. (left) The standard coupling: only the pairwise correlations between the views are taken into account. (right) The tensor approach: the higher-order correlations between all views are modeled in a third order tensor.

Tensor learning in machine learning methods has been studied before. For example, Signoretto et al. [22] propose a tensor-based framework to perform learning when the data is multi-linear and Wimalawarne et al. [27] collect the weight vectors corresponding to separate tasks in one weight tensor to achieve multi-task learning. This paper investigates the use of tensor learning in multi-view KPCA, in order to include the higher-order correlations. The paper proposes three multiview KPCA methods, where the ﬁrst two are special cases of the last method. Experiments, where the multi-view KPCA methods are used to reduce the dimensionality for clustering purposes, show the merit of our proposed methods.

Tensor Learning in Multi-view Kernel PCA

207

We will denote matrices as bold uppercase letters, vectors as bold lowercase letters and higher-order tensors by calligraphic letters. The superscript [v] will denote the vth view for the multi-view method. Whereas the superscript (j) will correspond to the jth principal component.

2

Kernel PCA

Suykens et al. [26] formulated the kernel PCA problem in the primal-dual framework typical of Least Squares Support Vector Machines (LS-SVM) [25], where the dual problem is equivalent to the original kernel PCA formulation of Sch¨ olkopf et al. [21]. An advantage of the primal-dual framework is that it allows to perform estimations in the primal space, which can be used for large-scale applications when solving the dual problem becomes infeasible. The formulation further provides an out-of-sample extension to deal with new unseen test data. Suykens [24] later formulated the kernel PCA in the Restricted Kernel Machines (RKM) framework, which preserves the advantages of the previous formulation. The primal and dual model are formed by means of conjugate feature duality, and give an expression in terms of visible and hidden layers respectively, in analogy with Restricted Boltzmann Machines (RBM) [9]. The dual problem is equivalent to the LS-SVM formulation (and hence the original formulation) up to a parameter. Furthermore it is shown how multiple RKMs can be coupled to form a Deep RKM, which combines deep learning with kernel based methods. d Given data {xk }N k=1 ⊂ R , the primal formulation of KPCA in the RKM framework is as follows: min

w,hk

η T λ 2 w w− ϕ(xk )T w hk + hk 2 2 N

N

k=1

k=1

(1)

for k = 1, . . . , N . The feature map ϕ(·) : Rd → Rdh maps the input data to a high-dimensional (possible inﬁnite) feature space. λ and η are positive regularization constants and the hidden features hk correspond to the projected values. The dual problem related to this primal formulation is: 1 Ω h=λh η

(2)

where h = [h1 ; . . . ; hN ] and Ω ∈ RN ×N is a centered kernel matrix deﬁned as T

ˆ (ϕ(xl ) − μ) ˆ , Ωkl = (ϕ(xk ) − μ)

k, l = 1, . . . , N

(3)

ˆ = (1/N ) N with μ k=1 ϕ(xk ). The feature map ϕ(·) is usually not explicitly deﬁned, but rather through a positive deﬁnite kernel function K : Rd × Rd → R. Based on Mercer’s condition [20] we can formulate the kernel function as K(xk , xl ) = ϕ(xk )T ϕ(xl ). Every eigenvalue-eigenvector pair (λ − h) can be seen as a candidate solution of Eq. (1). The ﬁrst principal component, i.e. the direction of maximal variance in

208

L. Houthuys and J. A. K. Suykens

the feature space, is determined by the eigenvector corresponding to the highest eigenvalue of η1 Ω. The maximum number of components that can be extracted equals the number of datapoints N . For an unseen test point x, the projection into the subspace spanned by the jth principal component, i.e. the score variable eˆ(x)(j) , can be obtained as eˆ(x)(j) =

1 Ω test h(j) η

(4)

where h(j) is the eigenvector corresponding to the jth largest eigenvalue λ and Ω test is the centered test kernel matrix calculated through the kernel function K(xk , x) = ϕ(xk )T ϕ(x) for all k = 1, . . . , N . If KPCA is used to perform dimensionality reduction, the new dimension of the data equals the number of selected components np .

3

Multi-view Kernel Principal Component Analysis

In this section we conceive a KPCA model when the data is described through diﬀerent representations, or views. Instead of coupling the diﬀerent views pairwise, we formulate an overall model so that also higher order correlations between the diﬀerent views are considered. 3.1

KPCA-ADD: Adding Kernel Matrices

A ﬁrst model, called KPCA-ADD, is formed by adding up the diﬀerent KPCA objectives and assuming that all views share the same hidden features h. [v] d[v] the primal Let V be the number of views, given data {xk }N k=1 ⊂ R formulation is stated as follows: min

w[v] ,hk

N V V N η [v]T [v] [v] [v] T [v] λ 2 w w − ϕ (xk ) w hk + hk 2 v=1 2 v=1 k=1

(5)

k=1

The stationary points of this objective function, denoted as J , in the primal formulation are characterized by: ⎧ V T ⎪ ∂J [v] ⎪ ⎪ = 0 → λh = w[v] ϕ[v] (xk ), k ⎪ ⎪ ∂h ⎪ k ⎪ v=1 ⎨ N (6) ∂J 1 [v] [v] [v] ⎪ ⎪ = 0 → w = ϕ (xk )hk , ⎪ [v] ⎪ η ∂w ⎪ ⎪ k=1 ⎪ ⎩ where k = 1, . . . , N and v = 1, . . . , V. By eliminating the weights w[v] , the dual formulation is obtained: 1 [1] Ω + . . . + Ω [V ] h = λ h η

(7)

Tensor Learning in Multi-view Kernel PCA

209

where Ω [v] is the centered kernel matrix corresponding to view v, deﬁned as T [v] [v] [v] ˆ [v] ˆ [v] for k, l = 1, . . . , N . ϕ[v] (xl ) − μ Ωkl = ϕ[v] (xk ) − μ Notice that this coupling results in adding up the kernel matrices belonging to the diﬀerent views. The score variables corresponding to a test point x can be calculated by: V 1 [v] (j) Ω h . η v=1 test

eˆ(x)(j) =

4

(8)

Including Tensor Learning in Multi-view KPCA

Even though in the KPCA-ADD formulation the views are coupled by the shared [v] hidden features, there is still a model weight vector w[v] ∈ Rdh for each view [1] [V ] v. In order to introduce more coupling, a model tensor W ∈ Rdh ×...×dh is presented. By using a tensor comprised of the weights of all views, instead of coupling them pairwise, it becomes possible to model higher order correlations. 4.1

KPCA-PROD: Product of Kernel Matrices

The introduction of a model tensor W leads to the KPCA-PROD model, where the primal formulation is given by: min

W,hk

η λ 2 W, W − Φ(k) , W hk + hk 2 2 N

N

k=1

k=1

(9)

where ·, · is the tensor inner product deﬁned as A, B :=

I1

···

i1 =1

IM

Ai1 ···iM Bi1 ···iM

(10)

iM =1

for two M -th order tensors A, B ∈ RI1 ×...×IM . The rank-1 tensor Φ(k) ∈ [1]

[V ]

Rdh ×...×dh is composed by the outer product of the feature maps of all views, [1] [V ] i.e. Φ(k) = ϕ[1] (xk ) ⊗ . . . ⊗ ϕ[V ] (xk ). The stationary points of the objective function J in the primal formulation are characterized by: ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨

[1]

[V ]

dh dh [1] [1] ∂J [V ] = 0 →λhk = Φ(k) , W = ··· ϕ (xk )i1 · · · ϕ[V ] (xk )iV Wi1 ...iV ∂hk i =1 i =1 1

V

N 1 [1] [1] [V ] = ϕ (xk )i1 · · · ϕ[V ] (xk )iV hk , η

∂J ⎪ ⎪ = 0 → Wi1 ...iV ⎪ ⎪ ⎪ ∂W i1 ...iV ⎪ k=1 ⎪ ⎪ ⎪ ⎩ [v] where k = 1, . . . , N and iv = 1, . . . , dh for v = 1, . . . , V.

(11)

210

L. Houthuys and J. A. K. Suykens

By eliminating the weights, the following dual problem is derived: 1 [1] Ω . . . Ω [V ] h = λ h η

(12)

where denotes the element-wise product. Notice that the dual problem results in element-wise multiplication of the view-speciﬁc kernel matrices. The score variable corresponding to an unseen test point x can hence be calculated by: V 1 [v] (j) eˆ(x)(j) = Ω h (13) η v=1 test where is the element-wise multiplication operator. 4.2

KPCA-ADDPROD

Taking the element-wise product of kernel matrices can have some unwanted results. Take for example kernel matrices comprised of linear kernel functions. An element of such a linear kernel matrix could be negative, indicating a low similarity between two points. By multiplying the elements of the kernel matrices, highly negative values could result in a high positive value for a certain datapoint pair, which would indicate a very high similarity which is clearly unwanted. Even for kernel matrices comprised of RBF kernel functions, where the values lie between zero and one, a poor view indicating a certain datapoint pair as non-similar and hence assigning a value close to zero, could inﬂuence the ﬁnal result to harshly. Therefore a last model is proposed, called KPCA-ADDPROD, where the two principles of the previous models are combined. A parameter ρ is added in order to determine the inﬂuence of each part. The primal formulation is given by: min

W,w[v] ,hk

η λ 2 √ W, W − ρ Φ(k) , W hk + hk 2 2 N

N

k=1

k=1

N V V η [v]T [v] [v] + w w − (1 − ρ) ϕ[v] (xk )T w[v] hk 2 v=1 v=1

(14)

k=1

where ρ ∈ [0, 1] ⊂ R. By deriving the stationary points of the objective and eliminating the weights, the following dual problem is obtained:

V V

1 [v] [v] (1 − ρ) h = λ h. (15) Ω +ρ Ω η v=1 v=1 Note that if ρ = 0 the model is equivalent to KPCA-ADD, and if ρ = 1 it is equivalent to KPCA-PROD.

Tensor Learning in Multi-view Kernel PCA

5

211

Experiments

This section describes the experiments performed to evaluate the multi-view KPCA models, as dimensionality reduction techniques. To assess the performance, the KPCA methods are used as a preprocessing step for clustering, and the clustering accuracy is regarded as the evaluation criterion. Two clustering methods are considered: k-means (KM) [18], a well known linear clustering algorithm and Kernel Spectral Clustering (KSC) [1], a nonlinear clustering technique within the LS-SVM framework. To determine the clustering accuracy, the NMI [23] is reported1 . Due to the local optima solutions found by KM, these results are averaged over 50 runs. The performances of the proposed multi-view models are compared to the performances on the views separately. Both by clustering the views directly, and by clustering after KPCA was performed. Model Selection. The parameter η is set to 1 in all experiments, since this parameter is of most importance when multiple RKMs are stacked to form a deep RKM. The RBF kernel function was used for all experiments, both for the KPCA methods as for KSC. The performance of the (multi-view) KPCA models depend on the (view-speciﬁc) kernel parameter and the number of principal components np . For KPCA-ADDPROD it will also depend on the parameter ρ. Both KSC and KM depend on the number of clusters, and KSC also on the kernel parameter. These parameters are tuned through a grid search with 5-fold crossvalidation. Since the methods are all unsupervised, the model selection criteria has to be unsupervised as well. Here the Davies-Bouldin index (DB) [5] criterion is used. Datasets. A brief description of each dataset used is given here: – Image-caption dataset: A dataset comprised of images, together with their associated captions. We thank the authors of [14] for providing the dataset. Each image-caption pair represent a ﬁgure related to sport, aviation or paintball. For each of these categories, 400 records are available. The ﬁrst two views consist of diﬀerent features describing the image (HSV colour and image Gabor texture). The third view describes the associated caption text by its term frequencies. Gaussian white noise is added to the ﬁrst two views. – YouTube Video dataset: A dataset describing YouTube videos of video gaming, was originally proposed by Madani et al. [19]2 . The videos are described through textual, visual and auditory features. For this paper we selected the textual feature LDA, the visual Motion feature through CIPD [29] and the audio feature MFCC [17] as three views. From each of the seven 1

2

To calculate the NMI, and hence asses the performance, the labels of the dataset are used. However, notice that they are never used in the training or validation phase of KM, KSC or the proposed multi-view KPCA models. http://archive.ics.uci.edu/ml/datasets/youtube+multiview+video+games+dataset.

212

L. Houthuys and J. A. K. Suykens

most occurring labels (excluding the last label, since these datapoints represent videos not belonging to any of the other 30 classes) 300 videos were randomly sampled. – UCI Ads dataset: This dataset, as described by Kushmerick [15]3 , was constructed for the task of predicting whether a certain hyperlink corresponds to an advertisem*nt or not. The features are divided over three views in the same way as was done by Luo et al. [16]. The dataset consist of 2821 instances not corresponding to advertisem*nts, and 458 instances that do. Results. The results of the performed experiments are depicted in Table 1. The table shows the clustering accuracy found by using the clustering techniques on the views directly, and when KPCA was applied as a dimensionality reduction technique ﬁrst. It further shows the accuracy when the proposed multi-view KPCA techniques are applied. For the KPCA-ADDPROD method, also the found optimal value for ρ is noted. Table 1. NMI results, where the proposed methods function as dimensionality reduction methods for KM and KSC. The best performing methods, are indicated in bold. Method View KM KPCA+KM KPCA-ADD+KM KPCA-PROD+KM KPCA-ADDPROD+KM KSC KPCA+KSC KPCA-ADD+KSC KPCA-PROD+KSC KPCA-ADDPROD+KSC

Image-caption 1 2 3 0.502 0.301 0.206 0.516 0.328 0.412 0.596 0.154 0.643 (ρ = 0.4) 0.061 0.107 0.066 0.474 0.330 0.295 0.520 0.031 0.568 (ρ = 0.4)

YouTube Video 1 2 3 0.434 0.200 0.052 0.375 0.207 0.065 0.273 0.076 0.279 (ρ = 0.2) 0.028 0.025 0.030 0.243 0.167 0.037 0.166 0.025 0.248 (ρ = 0.2)

Ads 1 2 3 0.068 0.028 0.071 0.016 0.021 0.047 0.016 0.291 0.291 (ρ = 1) 0.017 0.077 0.312 0.013 0.094 0.046 0.085 0.147 0.147 (ρ = 1)

A ﬁrst observation is that the performance usually improves when using KPCA as a dimensionality reduction method, when clustering the views separately. This encourages the use of dimensionality reduction in these datasets. A notable exception is the accuracy when using KM on the ﬁrst view of the YouTube Video dataset. A second observation is that the multi-view KPCA methods are able to improve the clustering accuracy in ﬁve out of the six experiments, suggesting the merit of using the multi-view techniques independently of the choice of clustering technique. Only for YouTube Video dataset, the (multi-view) dimensionality reduction is not able to improve the result of applying KM on the ﬁrst 3

http://archive.ics.uci.edu/ml/datasets/Internet+Advertisem*nts.

Tensor Learning in Multi-view Kernel PCA

213

view directly. Another interesting observation is that the found optimal ρ for each dataset is equal for both clustering methods. Since ρ determines the importance of the tensor model vector, this could be an indication of the number of relevant higher order correlations in a dataset. For the ﬁrst two datasets ρ is relatively small. For these two datasets KPCA-ADD outperforms KPCA-PROD considerably, which is to be expected as it is shown that these two models are actually special cases of KPCA-ADDPROD with ρ = 0 and ρ = 1 respectively. For the Ads dataset the found optimal ρ equals 1, and hence only the tensor model vector is taken into account, suggesting a high importance of higher order correlations.

6

Conclusion

This paper introduced novel Multi-view Kernel Principal Component Analysis methods to perform KPCA when the data is represented by multiple views. Techniques from tensor learning are applied in order to account for higher order correlations between the views. The paper starts from the primal RKM formulation of KPCA and shows three approaches for a multi-view extension. It is shown that, when assuming shared hidden features, the dual model results in kernel addition. It further shows that introducing a model tensor, containing the information of all views, results in kernel product in the dual formulation. Finally a third method is suggested combining the two techniques. The gain of these multi-view techniques is shown by using it as a dimensionality reduction step before clustering. Experiments on multiple real-world datasets with two well known clustering techniques, show the improvement of using multiple views. The parameter controlling the importance of the model tensor seems to indicate the importance of the higher order correlations. Acknowledgments.. Research supported by Research Council KUL: CoE PFV/10/002 (OPTEC), PhD/Postdoc grants Flemish Government; FWO: projects: G0A4917N (Deep restricted kernel machines), G.088114N (Tensor based data similarity), ERC Advanced Grant E-DUALITY (787960).

References 1. Alzate, C., Suykens, J.A.K.: Multiway spectral clustering with out-of-sample extensions through weighted kernel PCA. IEEE Trans. Pattern Anal. Mach. Intell. 32(2), 335–347 (2010) 2. Andrew, G., Arora, R., Bilmes, J., Livescu, K.: Deep canonical correlation analysis. In: ICML, pp. 1247–1255 (2013) 3. Bekker, A., Shalhon, M., Greenspan, H., Goldberger, J.: Multi-view probabilistic classiﬁcation of breast microcalciﬁcations. IEEE Trans. Med. Imaging 35(2), 645– 653 (2016) 4. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT, pp. 92–100 (1998)

214

L. Houthuys and J. A. K. Suykens

5. Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1(2), 224–227 (1979) 6. Foster, D.P., Kakade, S.M., Zhang, T.: Multi-view dimensionality reduction via canonical correlation analysis. Toyota Technical Institute-Chicago (2008) 7. Greene, D., Cunningham, P.: A matrix factorization approach for integrating multiple data views. In: Buntine, W., Grobelnik, M., Mladeni´c, D., Shawe-Taylor, J. (eds.) ECML PKDD 2009. LNCS (LNAI), vol. 5781, pp. 423–438. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04180-8 45 8. Han, Y., Wu, F., Tao, D., Shao, J., Zhuang, Y., Jiang, J.: Sparse unsupervised dimensionality reduction for multiple view data. IEEE Trans. Circ. Syst. Video Technol. 22(10), 1485–1496 (2012) 9. Hinton, G.E.: What kind of a graphical model is the brain? In: Proceedings of the 19th International Joint Conference on Artiﬁcial Intelligence, IJCAI 2005, pp. 1765–1775. Morgan Kaufmann Publishers Inc., San Francisco (2005) 10. Hotelling, H.: Relations between two sets of variates. Biometrica 28, 321–377 (1936) 11. Houthuys, L., Langone, R., Suykens, J.A.K.: Multi-view least squares support vector machines classiﬁcation. Neurocomputing 282, 78–88 (2018) 12. Jolliﬀe, I.T.: Principal Component Analysis. Springer, New York (1986). https:// doi.org/10.1007/978-1-4757-1904-8 13. Kidron, E., Schechner, Y.Y., Elad, M.: Pixels that sound. In: CVPR, vol. 1, pp. 88–95 (2005) 14. Kolenda, T., Hansen, L.K., Larsen, J., Winther, O.: Independent component analysis for understanding multimedia content. In: IEEE Workshop on Neural Networks for Signal Processing, vol. 12, pp. 757–766 (2002) 15. Kushmerick, N.: Learning to remove internet advertisem*nts. In: AGENTS 1999, pp. 175–181 (1999) 16. Luo, Y., Tao, D., Ramamohanarao, K., Xu, C., Wen, Y.: Tensor canonical correlation analysis for multi-view dimension reduction. IEEE Trans. Knowl. Data Eng. 27(11), 3111–3124 (2015) 17. Lyon, R.F., Rehn, M., Bengio, S., Walters, T.C., Chechik, G.: Sound retrieval and ranking using sparse auditory representations. Neural Comput. 22(9), 2390–2416 (2010) 18. Macqueen, J.: Some methods for classiﬁcation and analysis of multivariate observations. In: Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967) 19. Madani, O., Georg, M., Ross, D.A.: On using nearly-independent feature families for high precision and conﬁdence. Mach. Learn. 92, 457–477 (2013) 20. Mercer, J.: Functions of positive and negative type, and their connection with the theory of integral equations. Philos. Trans. R. Soc. London. Ser. A Contain. Pap. Math. Phys. Character 209, 415–446 (1909) 21. Sch¨ olkopf, B., Smola, A., M¨ uller, K.R.: Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput. 10(5), 1299–1319 (1998) 22. Signoretto, M., Tran Dinh, Q., De Lathauwer, L., Suykens, J.A.K.: Learning with tensors: a framework based on convex optimization and spectral regularization. Mach. Learn. 94, 303–351 (2014) 23. Strehl, A., Ghosh, J.: Cluster ensembles - a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3, 583–617 (2002) 24. Suykens, J.A.K.: Deep restricted kernel machines using conjugate feature duality. Neural Comput. 29(8), 2123–2163 (2017) 25. Suykens, J.A.K., Van Gestel, T., De Brabanter, J., De Moor, B., Vandewalle, J.: Least Squares Support Vector Machines. World Scientiﬁc, Singapore (2002)

Tensor Learning in Multi-view Kernel PCA

215

26. Suykens, J.A.K., Van Gestel, T., Vandewalle, J., De Moor, B.: A support vector machine formulation to PCA analysis and its kernel version. IEEE Trans. Neural Netw. 14(2), 447–450 (2003) 27. Wimalawarne, K., Sugiyama, M., Tomioka, R.: Multitask learning meets tensor factorization: Task imputation via convex optimization. In: NIPS, vol. 4, pp. 2825– 2833 (2014) 28. Wozniak, M., Jackowski, K.: Some remarks on chosen methods of classiﬁer fusion ´ Baruque, based on weighted voting. In: Corchado, E., Wu, X., Oja, E., Herrero, A., B. (eds.) HAIS 2009. LNCS (LNAI), vol. 5572, pp. 541–548. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02319-4 65 29. Yang, W., Toderici, G.: Discriminative tag learning on Youtube videos with latent sub-tags. In: CVPR, pp. 3217–3224 (2011)

Reinforcement

ACM: Learning Dynamic Multi-agent Cooperation via Attentional Communication Model Xue Han1, Hongping Yan1(&), Junge Zhang2, and Lingfeng Wang2 1

Department of Information Engineering, China University of Geosciences, No. 29 College Road, Haidian District, Beijing, China [emailprotected], [emailprotected] 2 Institute of Automation, Chinese Academy of Sciences, No. 95 Zhongguancun East Road, Haidian District, Beijing, China {jgzhang,lfwang}@nlpr.ia.ac.cn

Abstract. The collaboration of multiple agents is required in many real world applications, and yet it is a challenging task due to partial observability. Communication is a common scheme to resolve this problem. However, most of the communication protocols are manually speciﬁed and can not capture the dynamic interactions among agents. To address this problem, this paper presents a novel Attentional Communication Model (ACM) to achieve dynamic multi-agent cooperation. Firstly, we propose a new Cooperation-aware Network (CAN) to capture the dynamic interactions including both the dynamic routing and messaging among agents. Secondly, the CAN is integrated into Reinforcement Learning (RL) framework to learn the policy of multi-agent cooperation. The approach is evaluated in both discrete and continuous environments, and outperforms competing methods promisingly. Keywords: Multi-agent RL

Communication Cooperation Attention

1 Introduction Many real-world applications, such as autonomous vehicle control, resource management systems, etc., can be naturally modeled as multi-agent problems. Many solutions have been proposed to address the multi-agent problem. For example, [2] has regressively learned the strategy using a rate distortion theory-based information framework. However, it has poor adaptability to complex decentralized problems. Recently, the ﬁeld of multi-agent RL has attracted massive attention [12, 13, 18], since it is one of the main methods to train the system of self-learning through interaction with environment, which is more in line with human learning model. Practically, RL can be successfully utilized to solve single agent problems [16, 19, 20]. Unfortunately, it is difﬁcult to solve multi-agent problem via traditional RL models. One of the major challenges is the instability of the environment. The environment of multi-agent RL relies on the actions of multiple agents and involves the interactions among agents, which implies that the key problem in multi-agent environment is how to do collaboration. © Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11140, pp. 219–229, 2018. https://doi.org/10.1007/978-3-030-01421-6_22

220

X. Han et al.

Collaboration is an important manifestation of intelligence, making agents appear as a whole rather than a collection of individuals. Communication is a common scheme to achieve collaboration, of which the kernel is the construction of communication protocol, including routing and messaging. Recently, manually speciﬁed communication protocols are basically applied in the ﬁeld of RL [5, 10]. Most of them adopt the action strategy of each agent as a message to stabilize the environment, which can not adapt to changing environment and strategies. In addition to Hoshen’s dynamic construction of communication routing, considering the different relationships among the agents can lead to distinct influence [9]. However, except for determining the communication routing, distilling the state information into the message is important for the multi-agent problems. Because the action strategy contains a lot of useless information so that it not only consumes communication resources but also distracts the attention of agents, leading the policy to be difﬁcult to achieve collaboration. To address this problem, we propose an Attentional Communication Model (ACM), so as to adaptively construct the communication routing and messaging. For this purpose, we adopt the attention mechanism, which derives from the attention model of the human brain [1, 8]. To introduce ACM for multi-agent collaboration, we construct two networks, i.e., the policy network of agents as well as the Cooperationaware Network (CAN). CAN, which is a two-branch network, enables the dynamic construction of communication protocols and services to the policy network. Two networks are iteratively updated to obtain collaborating agents. ACM can effectively use the information to achieve collaboration. The main contributions of this paper are listed as follows: (1) We propose the CAN, which dynamically calculates the relationships among agents to ascertain the routing, and distills the state information into the message. It not only saves the communication resources but also makes better use of the action strategy so that the agents can get smart cooperation strategies and improve the stability of training. Most importantly, CAN dynamically builds the communication protocols to adapt to the changing environments and strategies. (2) The CAN is successfully combined with the policy network built by RL algorithms to construct ACM. The ACM demonstrates the outstanding ability in collaboration with the environment after sufﬁcient training.

2 Related Work Early approaches of multi-agent interactions include no communication. M. Tan has experimented with Q-learning using independent agents, but does not perform well in practice [22]. This is due to the fact that each agent is partially observable and lacks the necessary information because of a limited ﬁeld of view. Under the constant learning and changing of the agent’s strategy, the environment is extremely unstable, resulting in the strategy of the agent being difﬁcult to collaborate and converge. Another approach is parameter sharing, such as [7, 11]. They can sample more from training strategies, but lack the necessary information in partially observable environments, which makes the strategy poor and converge slowly. Therefore, the recent work mainly focuses on transmitting information through communications to stabilize the environment.

ACM

221

The core of communication is the communication protocol. Some work has passed on all the parameters of the policy [5] or simpliﬁed information about the training strategy [10]. [3, 21] have used the deep Q-network (DQN) to construct agents, except that [3] directly transmits the actions, while [21] broadcasts the communication vector which is the mean value of the states. [4, 14, 15] have used a more sophisticated actorcritic mechanism to deliver action strategies: all agents of [4] share a unique critic; [14] has studied a critic for each agent, which is applicable to both cooperative and competitive scenarios; but [15] has compared the above two settings and joined the coordinator to encode the states and actions. [6] applies to GAN which passes images to other generators. In a word, the above work has transmitted the action strategies of other agents to stabilize the environment. They adopt the manually predeﬁned communication protocol. However, as the strategy and environment change, the necessary information is constantly changing. So in this paper, the dynamic learning of the communication protocol is adopted, and the communication messaging and routing are determined through learning. The useful information of current state is selected to prevent consuming the channel resources and distracting the attention. Hoshen’s work [9] is most similar to ours. Hoshen has proposed VAIN which uses the attention mechanism to compute the relationships among agents. This means that VAIN has constructed the routing dynamically. The most important difference between VAIN and ACM is that we dynamically build communication protocol, including routing and more sophisticated message. The beneﬁts of our work are to save the communication resources, focus on the current useful state information, and train agents with collaboration capabilities ﬁnally.

3 Approach Among multi-agent problems, the construction of the communication protocol is one of the most effective ways to achieve collaboration. Dynamically built the communication protocol can adapt to the changing environment. Therefore, we introduce Attentional Communication Model (ACM). In the following, we ﬁrst deﬁne the model. Second, we construct the framework and explain how to determine the communication routing. Third, we elaborate how to dynamically distill the state information into the message. Finally, the routing and messaging are combined to construct the communication protocol. 3.1

Deﬁnition

The multi-agent problem is so complex, and in reality, it is usually partially observable. To address this difﬁculty, it is a good choice to use Shared Parameters Partial Observable Markov Decision Process (SP-POMDP) which is a classical approach among multi-agent problem. Inspired by SP-POMDP, our problem model consists of an ten-tuple \An ; S; O; R; M; U; P; R; X; t [ , in which

222

X. Han et al.

An ¼ fa1 ; a2 ; . . .; an g is the collection of all agents. n is the number of agents. st 2 S is the state at the current time step t. Oit ¼ foit jst 2 S; oit ¼ Oðst ; iÞg is the observation space of player i. The observation function O : S f1; . . .; ng ! Rd speciﬁes each agent’s d-dimensional view on the state space. For the sake of simplicity we will write ot ¼ fo1t ; o2t ; . . .; ont g. in i Rit 2 R, Rit ¼ fRi1 t ; . . .; Rt g, Rt are the relationships among agent i and all the other agents at the current time step t. Mt 2 M, Mt distills the observations information for all agents into message at the current time step t. ut 2 U, ut ¼ fu1t ; u2t ; . . .; unt g, is the collection of actions for all agents at the current time step t. ut ¼ pi ðoit ; ut1 ; Rit ; Mt Þ, and pi is the policy of agent i. Pðst þ 1 jst ; ut Þ is the state transfer function of the agent. R : oit uit ut1 Rit Mt ! Rit , Rit is the reward function of the agent i. X : oit uit ut1 Rit Mt Rit ! X stores all samples.

Fig. 1. The framework of ACM.

3.2

The Framework

ACM is a multi-agent communication model constructed by combining of CAN and the policy network, as shown in Fig. 1. We ﬁrst train the policy until it converges and keep it ﬁxed to train CAN. CAN uses the observations and actions of the agents to construct the communication routing and messaging which are transmitted to the policy network to get the action. Then the policy network calculates advantage values for training parameters of CAN, which only exists in training. The calculation of the advantage value is shown in Eq. 1, and c 2 ½0; 1Þ. The trained CAN remains ﬁxed and then integrates with the RL to obtain the ﬁnal cooperative strategy. X1 Qðoit ; uit ; ut1 ; Rit ; Mt Þ ¼ Eot þ 1 ;ut þ 1 ;... ½ t¼0 ct Rðoit ; uit ; ut1 ; Rit ; Mt Þ X1 Vðoit ; ut1 ; Rit ; Mt Þ ¼ Eut ;ot þ 1 ;... ½ t¼0 ct Rðoit ; uit ; ut1 ; Rit ; Mt Þ Aðoit ; uit ; ut1 ; Rit ; Mt Þ ¼ Qðoit ; uit ; ut1 ; Rit ; Mt Þ Vðoit ; ut1 ; Rit ; Mt Þ

ð1Þ

ACM

223

Fig. 2. Cooperation-aware Network (CAN) architecture diagram. Cooperation-aware Network Routing (CANR) consists of a fully connected layer (snaking blue line symbolizes sliding of each ﬁlter across inputs), followed by a softmax output layer, with a single output for each valid agent; Cooperation-aware Network Messaging (CANM includes two fully connected layers, followed by a softmax output layer, and each valid observation dimension has a unique output. Each hidden layer is followed by a sigmoid nonlinearity. (that is, 1 þ1ex ) (Color ﬁgure online)

CAN is a two-branch network consisting of Cooperation-aware Network Routing (CANR) and Cooperation-aware Network Messaging (CANM). The role of CANR is to dynamically ascertain the routing, while CANM is used to dynamically distill the state information into the message. CAN originates from the attention mechanism. The architecture of the network is shown in Fig. 2. Attention is essentially a content-based addressing mechanism, so as to followup information distillation. CAN uses the classic attention mechanism which is the additive attention [1]. Hence, the core of the network can be interpreted by the following equation, fatt ðai ; aj Þ ¼ vT sigmoidðW½ai ; aj jjW½oÞ

ð2Þ

fatt is the attention function of CAN. v and W are the weight matrix of CAN. CANR uses W½ai ; aj , and CANM applies W½o. The use of the fully connected layers for CAN facilitates the processing and distilling of all the information. The inputs of CAN are the observations of all current agents, which are transmitted to two branches separately after dimension reduction (Encoder). Inspired by the idea of [9], we here propose CANR to determine the routing in a similar way. CANR. CANR dynamically constructs routing. There are different relationships between agents, which make different effects. Therefore, we dynamically model the relationships between agents for determining the influence of others on the current agent.

224

X. Han et al.

Rijt ¼ softmaxðait atj Þ

ð3Þ

where Rijt represents the current attention of agent i to j. ait ¼ Eðoit Þ is the observations of agent i. E indicates the encoded process. CANR can construct the real-time communication routing based on the relationships constructed above. 3.3

Message

CANM dynamically distills the state information into the message. The purpose of communication is to transfer the state information for the agent to make the policy achieve collaboration, thus the message plays a crucial role. When the information dimension is large, the agent can not handle all the information well. Accepting too much redundant information may distract agents and make agent not effectively utilize the state information. Therefore, we propose to distill the message based on the attention mechanism in a dynamic manner. As agents have different information requirements in different periods, we distill the message before each iteration of the agent’s policy. CANM learns from the Trust Region Policy Optimization (TRPO) algorithm of RL. TRPO [17] is a deterministic strategy gradient algorithm, which is characterized by guaranteeing a monotonous increase of policies and is effective for optimizing largescale nonlinear strategies. At each time step, the goal of TRPO algorithm is to optimize the policy under constraints: ph ðujsÞ Ah ðs; uÞ h qðujsÞ old s:t:Es qhold ½DKL ðphold ðjsÞjjph ðjsÞÞ d maximize Es qhold ;u q ½

ð4Þ

where qðujsÞ ¼ phold ðujsÞ indicates using existing phold for importance sampling. qhold ¼ qph is the state-access frequency deﬁned by phold . DKL shows the KL variance between old the two strategy distributions, and d controls the maximum change between two strategies at each time step. A is the advantage value likes Eq. (1). CANM updates the network parameters as in Eq. (5) according to the advantage value passed by the policy network. maximize Eo qph ;u ph ½DDt Ah ðoit ; uit ; ut1 ; uit ; Dt Þ x

s:t:Eo qph ½DKL ðDt jjD d

ð5Þ

where D ¼ Mx is the CANM network distribution. i is the index of agent. uit ¼ Rit indicates the relationships among agent i to others at current time step. The advantage value is still calculated by the policy network, but the parameters of the policy network are currently ﬁxed (h), with the only variable being the parameters of CANM. For the policy network, we add an input that dynamically distills message for the current step

ACM

225

to measure the impact of the currently selected message on the agent’s reward value. The goal is to update the parameters of CANM in a direction that increases the reward. Eventually, the optimization problem of CANM is shown in Eq. (6): i t ÞÞ i i maximize Eo qph ;u ph ½MMxx ðEðo ðEðot ÞÞ Ah ðot ; ut ; ut1 ; Rt ; Mxt Þ x

ð6Þ

t

s:t:Eo qph ½DKL ðMxt ðjEðot ÞÞjjMx ðjEðot ÞÞÞ d

where Eðot Þ is the input to the CANM network, which is the encoded value of the observations for all current agents. We propose using TRPO to update the CANM as it ensures monotonically increasing. The CANM iteration algorithm, as shown in Algorithm 1, iterates the parameters until convergence under the ﬁxed policy network. Algorithm 1 CANM Iteration Algorithms Initialize 0 Obtain π θ for t=0,1,2,... until convergence do i i Compute all advantage values Aθ (ot , ut , ut −1 ,

i t

,

t

) by π θ

Solve the constrained optimization problem

η(

t +1

max ) − CDKL (

) = maximize L t ( whereC =

t

,

)

4εγ (1 − γ ) 2

and

LM t ( M ) = η ( M t ) + ∑ ρ M t ∑ π θ (uti | oti , ut −1 , ot

uti

i t

,

t

) Aθ (oti , uti , ut −1 ,

end for Algorithm 2 Attentional Communication Model Obtain ot , ut −1 , n // ot is the observations of all agents at time step t

Mt

CANM ( E (ot ))

for i=1 to n do for j=1 to n do

Rtij

CANR( E (ot ))

ci

[ci ;[

ij t

⋅

end for

ut end for

Policy (ci , otj )

t

( E (ot )) ⋅ otj ];[

ij t

⋅ ut −1 ]]

i t

,

t

)

226

3.4

X. Han et al.

Attentional Communication Model

The ﬁnal communication is: ci ¼ ½½½Ri1 Mx ðEðot ÞÞ o1t ; ½Ri1 ut1 ; . . .; ½½Rin Mx ðEðot ÞÞ ont ; ½Rin ut1 ð7Þ where ci represents the communication of agent i. The ACM algorithm is illustrated in Algorithm 2. Firstly, the observations of the agents are distilled into the communication message by CANM. Secondly, the communication routing is determined by CANR. Thirdly, the communication routing and messaging are combined to construct the communication protocol. And ﬁnally, communication is passed to the policy for selecting the action.

4 Experiment In this section, we compare the performance of our algorithm ACM with benchmark experiments to demonstrate that our model achieves better results than competing approaches. We test two tasks covering discrete and continuous environment. In all experiments, we apply TRPO algorithm for learning the policy. The number of agents is 10. 4.1

Environment

Pursuit. The state action space of pursuit is discrete. The environment contains two types of agents - pursuer and evader. We train pursuer, and the goal is to catch the evaders as soon as possible. In the experiment, evader takes a random policy. The agent receives a reward (+5) when pursuers catch an evader. We also set a shaping reward of 0.01 for encountering an evader to ease exploration. Agents’ observations include the information of surroundings, such as the locations of their nearest pursuer and evader. Coordinating Bipedal Walkers. Multi-walker is a continuous environment. Each walker, consists of a pair of legs, with the goal of multiple agents coordinated delivery of a box. When the box drops, the reward minus 100, and with the reward 1 when moving forward. They also receive an action penalty deﬁned as the square norm of the force applied. Each walker observes the terrain, adjacent walker location information and package. 4.2

Experimental Settings

• Shared Parameters-TRPO (SP-TRPO): Shared parameters among all agents is the basic form of TRPO algorithm applied in the ﬁeld of multi-agent. • Communication All-TRPO (CA-TRPO): The current observations and the last move of all agents are taken as message. All agents use the same communications.

ACM

227

• CANR-TRPO1: The current observations and previous actions of all agents are used as the communication contents. CANR dynamically calculates the relationships among agents to determine the routing. • ACM: The messaging and routing of the agent are constructed dynamically based on the environment using ACM. 4.3

Experimental Results Evaluation

In Fig. 3 we compare the performance of our method ACM with SP-TRPO, CA-TRPO and CANR-TRPO in different environments. We can clearly contrast ACM of which the average reward is higher than several other benchmark experiments in both discrete and continuous environments. As expected, this indicates that ACM is effective in multi-agent collaboration.

(a) Pursuit

(b) Multi-Walker

Fig. 3. Average returns for multi-agent policies.

It can be seen in Fig. 3(a) that for the discrete problem, the ﬁnal results of ACM and CANR-TRPO are not signiﬁcantly different. Because the dimension of the action state space is small, the agent can handle it well. On the contrary, in Fig. 3(b) the results of ACM are much higher than CANR-TRPO which are slightly higher than CATRPO, indicating that distilling for information has a signiﬁcant impact on experimental results. Therefore, the proposed ACM is more suitable for continuous problems with complex state space. The message distillation results are visualized in Fig. 4. It can be seen that the importance of information changes with the update of the strategy. Compared with pursuit, the difference between ﬁnal message distillation of multi-walker is greater. In Fig. 4(b), the overall velocity of the agent, that is the 2,3,4-dimensional values, compared with the velocity of each joint in each leg of each agent, which is values of 5 to 12-dimension, can be deﬁned as abstract information. The values of 2,3,4dimensional information show an upward trend, indicating that more attention is paid 1

CANR-TRPO is built using the idea of [9], except that the TRPO algorithm is used here to build the strategy.

228

X. Han et al.

(a) Pursuit

(b) Multi-Walker

Fig. 4. Message distillation diagram. The horizontal coordinate represents the information dimension. We use different colors to represent the message distillation results.

to the overall speed of the agent. The values of 5 to 12-dimensional information decrease, indicating the attention on the turning point for each joint decreased, and pay more attention to abstract information. This shows that the need for information is different among agents with different mentalities. The junior agents may require more speciﬁc information; advanced agents may require more emphasis on abstract information and less on speciﬁc information.

5 Conclusions Recently, the single agent tasks have made great progress, but the problem of multiagent is still beset with difﬁculties. We develop a new ACM for multi-agent collaboration within the SP-POMDP framework. ACM is a multi-agent attentional communication model used to dynamically build the communication protocol. The experimental results show that the proposed ACM can promote the agent to collaborate as well as accelerate the learning of the agent.

References 1. Chorowski, J.K., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-based models for speech recognition. In: Advances in Neural Information Processing Systems, pp. 577– 585 (2015) 2. Dobbe, R., Fridovich-Keil, D., Tomlin, C.: Fully decentralized policies for multi-agent systems: an information theoretic approach. In: Advances in Neural Information Processing Systems, pp. 2945–2954 (2017) 3. Foerster, J., Assael, Y., de Freitas, N., Whiteson, S.: Learning to communicate with deep multi-agent reinforcement learning. In: Advances in Neural Information Processing Systems, pp. 2137–2145 (2016) 4. Foerster, J., Farquhar, G., Afouras, T., Nardelli, N., Whiteson, S.: Counterfactual multi-agent policy gradients. arXiv preprint arXiv:1705.08926 (2017)

ACM

229

5. Foerster, J.N., Chen, R.Y., Al-Shedivat, M., Whiteson, S., Abbeel, P., Mordatch, I.: Learning with opponent-learning awareness. arXiv preprint arXiv:1709.04326 (2017) 6. Ghosh, A., Kulharia, V., Namboodiri, V.: Message passing multi-agent gans. arXiv preprint arXiv:1612.01294 (2016) 7. Gupta, J.K., Egorov, M., Kochenderfer, M.: Cooperative multi-agent control using deep reinforcement learning. In: Sukthankar, G., Rodriguez-Aguilar, J.A. (eds.) AAMAS 2017. LNCS (LNAI), vol. 10642, pp. 66–83. Springer, Cham (2017). https://doi.org/10.1007/9783-319-71682-4_5 8. Hermann, K.M., et al.: Teaching machines to read and comprehend. In: Advances in Neural Information Processing Systems, pp. 1693–1701 (2015) 9. Hoshen, Y.: Vain: attentional multi-agent predictive modeling. In: Advances in Neural Information Processing Systems, pp. 2698–2708 (2017) 10. Hüttenrauch, M., Šošić, A., Neumann, G.: Learning complex swarm behaviors by exploiting local communication protocols with deep reinforcement learning. arXiv preprint arXiv:1709. 07224 (2017) 11. Kurek, M., Jaśkowski, W.: Heterogeneous team deep q-learning in low-dimensional multiagent environments. In: 2016 IEEE Conference on Computational Intelligence and Games (CIG), pp. 1–8. IEEE (2016) 12. Lanctot, M., et al.: A uniﬁed game-theoretic approach to multiagent reinforcement learning. In: Advances in Neural Information Processing Systems, pp. 4191–4204 (2017) 13. Leibo, J.Z., Zambaldi, V., Lanctot, M., Marecki, J., Graepel, T.: Multi-agent reinforcement learning in sequential social dilemmas. In: Proceedings of the 16th Conference on Autonomous Agents and Multi-agent Systems. pp. 464–473. International Foundation for Autonomous Agents and Multiagent Systems (2017) 14. Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., Mordatch, I.: Multi-agent actor-critic for mixed cooperative-competitive environments. arXiv preprint arXiv:1706.02275 (2017) 15. Mao, H., et al.: ACCNet: Actor-coordinator-critic net for “learning-to-communicate” with deep multi-agent reinforcement learning. arXiv preprint arXiv:1706.03235 (2017) 16. Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518 (7540), 529–533 (2015) 17. Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 1889–1897 (2015) 18. da Silva, F.L., Glatt, R., Costa, A.H.R.: Simultaneously learning and advising in multi-agent reinforcement learning. In: Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems. pp. 1100–1108. International Foundation for Autonomous Agents and Multiagent Systems (2017) 19. Silver, D., et al.: Mastering the game of go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016) 20. Silver, D., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354 (2017) 21. Sukhbaatar, S., Fergus, R., et al.: Learning multiagent communication with back propagation. In: Advances in Neural Information Processing Systems, pp. 2244–2252 (2016) 22. Tan, M.: Multi-agent reinforcement learning: independent vs. cooperative agents. In: Proceedings of the Tenth International Conference on Machine Learning, pp. 330–337 (1993)

Improving Fuel Economy with LSTM Networks and Reinforcement Learning Andreas Bougiouklis(B) , Antonis Korkoﬁgkas, and Giorgos Stamou National Technical University of Athens, Athens, Greece [emailprotected]

Abstract. This paper presents a system for calculating the optimum velocities and trajectories of an electric vehicle for a speciﬁc route. Our objective is to minimize the consumption over a trip without impacting the overall trip time. The system uses a particular segmentation of the route and involves a three-step procedure. In the ﬁrst step, a neural network is trained on telemetry data to model the consumption of the vehicle based on its velocity and the surface gradient. In the second step, two Q-learning algorithms compute the optimum velocities and the racing line in order to minimize the consumption. In the ﬁnal step, the computed data is presented to the driver through an interactive application. This system was installed on a light electric vehicle (LEV) and by adopting the suggested driving strategy we reduced its consumption by 24.03% with respect to the classic constant-speed control technique. Keywords: Trajectory optimization · Velocity proﬁle · Racing line Topographical data · Electric vehicle · LEV · Neural network · LSTM Reinforcement learning · Q-Learning

1

Introduction

Over the last decade there has been a great eﬀort to reduce fuel consumption and dependence on fossil fuels. Electric cars have limited autonomy due to the poor energy density of current batteries. The need for ever improving autonomy has stimulated research aiming at developing control strategies that exploit the characteristics of a particular terrain [1,2]. These strategies regard the best overall velocities and trajectory that a vehicle has to maintain in order to minimize energy consumption. We will refer to these strategies as trajectory optimization. A plethora of research has been conducted on vehicle trajectory optimization [3–7]. Most of the publications use theoretical analysis and Newtonian physics in order to model the consumption characteristics of a vehicle. In this paper, we present an architecture based on Long Short Temp Memory Networks (LSTM) [8] that model the consumption of a light electric vehicle (LEV). The Neural Network (NN) has been trained with data acquired from the vehicle during the European Shell Eco Marathon. c Springer Nature Switzerland AG 2018 V. K˚ urkov´ a et al. (Eds.): ICANN 2018, LNCS 11140, pp. 230–239, 2018. https://doi.org/10.1007/978-3-030-01421-6_23

Improving Fuel Economy with LSTM Networks and Reinforcement Learning

231

Researchers have identiﬁed that real time speed guidance decreases long-term fuel consumption up to 47% [6]. For the purpose of this paper we conducted a parametrical analysis of a Q-Learning algorithm [9] to approximate the optimum velocity proﬁle for a LEV. The optimum velocity proﬁle is a sequence of elements which correspond to the velocities that a vehicle has to maintain on every part of a speciﬁc route and leads to consumption minimization. A second Q-Learning algorithm approximates the racing line [10,11], which is the optimum trajectory the driver has to follow on the track. The driver has to remain focused on the road. The calculated data has to be presented to the driver in a simple and intuitive way. For that reason, we propose an interactive system which can be mounted on the steering wheel of a vehicle and guide the driver through a simple graphical interface. The completed system was installed on a LEV and was tested in a closed test track during the European Shell Eco Marathon 2017.

2 2.1

Onboard Systems Vehicle Characteristics

The vehicle under study is a three-wheel light electric vehicle (LEV) equipped with an In-wheel Surface Mounted Permanent Magnet (SMPM) motor with Fractional Slot Concentrated Windings (FSCW) and unequal stator teeth, comprising of 16 poles and 18 slots, which is mounted in the back wheel [7]. The electric motor is driven by a three phase, two-level-bridge voltage source inverter. The system is fed by lithium batteries. 2.2

Data Gathering System

The vehicle is equipped with a telemetry system, which takes measurements of the phase current of the motor and a Global Positioning System (GPS) which calculates elevation, velocity, latitude and longitude of the vehicle. The sampling rate of the system is 50 Hz. The gathered data is used to train the NN consumption model and the Q-Learning algorithms oﬄine. 2.3

Online Monitoring System

The online monitoring system uses sensors to oversee the dynamics of the vehicle and present the guiding application to the driver. The exact position of the vehicle is calculated by two Lidar V3 sensors, which measure the distance of it from the boundaries of the testing track, and one GPS sensor. The vehicle current speed is measured with a speedometer and the exact position of the steering wheel is captured by a linear potentiometer mounted on the steering wheel column. As a computational unit we use a Raspberry pi 3 with a monitor.

232

3

A. Bougiouklis et al.

Neural Network Consumption Model

The LSTM NN [8] consumption model uses elevation, E(n), and velocity, V(n), sequences as inputs. The output is the consumed phase current of the electric motor, I(n). For the hidden layers we chose LSTM units with a sigmoid activation function. Finally, the output layer is a fully connected node. The selected architecture has 50 LSTM nodes in the ﬁrst layer and 35 LSTM nodes in the second. We concluded to this particular layout by examining all the combinations of nodes between 0 and 100 for a two layer NN. This architecture minimizes the mean square error of the training process. The data acquired from the telemetry system is ﬁrst passed through a cleaning process. In this stage, we remove data corresponding to negative phase current values. These are invalid elements as the vehicle does not have regenerative braking and the phase current is always positive. Also, zero latitude and longitude measurements are removed. The zero values correspond to loss of signal on the GPS sensor. At a 50 Hz sampling rate the percentage of the removed values is signiﬁcantly smaller than the size of the training set and therefore we did not alter the sequence of the data. The inputs fed into the network are sequences of 300 elements each and the output is the sum of the consumed phase current from these measurements. The training process was conducted with a training data set of 658.500 measurements which correspond to 13 laps from the eco-marathon 2016 and 9 rounds from the eco-marathon 2017. The validation set consisted of 36.000 elements from 1 lap on the 2017 track. The neural network was implemented with the KERAS library. We used the RMSprop optimizer and early stopping with the patience parameter set to 13 epochs. Finally, the learning rate was set to 0.001. After training, the mean squared error was 0.3972 (Figs. 1 and 2).

Fig. 1. The real consumption of the vehicle and the approximation of the NN for an unknown dataset are shown in this ﬁgure. The vertical lines represent the changes of monotony on the consumption curve. The network is able to identify the areas of the track in which the vehicles consumption monotony diﬀers. The diﬀerence between the consumed current of the test round and the approximation of the network is 17.48%.

Improving Fuel Economy with LSTM Networks and Reinforcement Learning

233

Fig. 2. As a baseline we designed a Multilayer Perceptron (MLP) network [13] with the same architecture as the LSTM NN described above. The approximation of the MLP is presented here.

4 4.1

Velocity Optimization Q-Learning Algorithm Design

The grade of the driving surface is the most important factor when deciding the fuel optimum velocity proﬁle [1]. Therefore to set the states of the environment we examined the elevation data from the GPS system. We made a segmentation in accordance with the monotony of the elevation. Every time the monotony changes we set a diﬀerent state. The results of this segmentation are presented in Fig. 3b. According to our research a larger set of track slots does not lead to better results. The second step of the algorithm design involves the appropriate values of action, which correspond to the appropriate values of speed. All velocities that the agent is able to choose from have been described as follows: v min ≤ v ≤ v max

(1)

Where v min = 10 km/h (2.778 m/s) and v max = 45 km/h (12.5 m/s) The policy has been initialized with the desirable average velocity for every state. The other constants of the algorithm have been set to the following values. The discount factor has been set to 0, as the reward from the next state is not aﬀected from the taken action. Moreover, the learning rate has been set to 0.7. Finally we assume that the policy has converged when it does not change for 2000 epochs. 4.2

Reward Function

The Reward function has the greatest impact on successfully training the agent. It consists of two important factors. Firstly, it evaluates if the desirable average speed is being maintained. Secondly, it evaluates whether the policy of the agent

234

A. Bougiouklis et al.

is an improvement on the consumption of the previous strategy. These are the two criteria that the algorithm has to meet. The time reward concerning the average velocity of the vehicle has been set to a constant. Speciﬁcally, if the average speed is within the desirable margin then the reward is set to 0.5, otherwise it has the value of −0.5. To approximate the average velocity, the algorithm computes the weighted average from the length of each state (wi ) and the corresponding velocity (vi ). wi × vi (2) μ= wi The desirable margin is (m − 1, m + 1), where m is the desirable average speed. To approximate the consumption of the vehicle for every velocity proﬁle the NN described above has been used. Every time the agent makes the choice to maintain a speciﬁc speed into the boundaries of a state, the NN calculates the consumed energy for the entire trip. Then, this approximation is being subtracted from the policy’s consumption. Finally, the result from the subtraction is multiplied by a discount factor k. We used this discount factor to keep the balance between the two rewards. To set the optimum value for the parameter k we conducted a statistical analysis of the used data. We discovered that the expected value of the subtraction between the policy consumption and the new approximation (d) was: E[d] = 32.369 (3) Thus, in order to balance these two amounts we set the value of k as 0.02. With this speciﬁc setting the expected value of the consumption reward is: k × E[d] = 0.647

(4)

That means there is balance between the two rewards. The ﬁnal reward is equal to the sum of the time reward and the consumption reward.

Fig. 3. (a) The testing track map. (b) Track segmentation based on the elevation data.

Improving Fuel Economy with LSTM Networks and Reinforcement Learning

4.3

235

Action Selection Strategy

The primary challenge is choosing the agent’s action. We used the -greedy strategy to balance the exploitation and the exploration by behaving most of the time greedily, while choosing a uniform random action with a small probability p. random choise of action with probability p selectionstrategy = (5) argmax a ∈ A Q(s, a) with propability 1 − p The probability that we used is: p = e−n×

(6)

An initial analysis has been conducted to establish the most suitable exploration rate. To test the decaying setting, the value of as in Eq. 6 had been set to have a varying rate at 0.001, 0.0001, and 0.00001. For the gradually decreasing exploration method the value of = 0.0001 has been selected for this study as it provides the best performance over other values (Table 1). Table 1. Experimentation with -Value − V alue Convergence Speed Consumption Approximation

5 5.1

0.001

3870

4.449

0.0001

4970

4.408

0.00001

8620

4.428

The Racing Line Introduction

The racing line is the trajectory that a driver should follow to achieve the best lap-time on a given track with a given car. The racing line depends on several factors [10,11] including the track shape, the tire grip and the mass of the vehicle. Besides engine dynamics, the maximum velocity of a vehicle depends on the following parameters: F ×r (7) vmax = m F is the gripping force from the tires, m is the mass of the vehicle and r is the radius of the circle which is tangent to the trajectory of the vehicle. To approximate the racing line we calculate the trajectory that maximizes the radius of the tangent circle.

236

A. Bougiouklis et al.

Fig. 4. (a) The chosen optimum velocity proﬁle for the track of Fig. 3a with a desirable average speed of 25 km/h. The actual average velocity of the proﬁle is 25.975 km/h. (b) In orange we distinguish the elevation proﬁle of the test track and in blue the velocity proﬁle. (c) Velocity proﬁles with diﬀerent average speeds of 24.8, 25.9, 27.1, 28.7, 29.8 km/h, the same strategy is being used for various trip times. (Color ﬁgure online)

5.2

Algorithm Design

The trajectory has been represented by points of the track. Each element of the trajectory is a state of the environment. The agent is able to move each point across the width of the track. The average width of this speciﬁc route is 6 m and we have set the movement step to 1 m. All the calculations were conducted with the latitude and longitude coordinates. The policy has been initialized with the trajectory which corresponds to the middle of the track. We set the discount factor to 0, the learning rate to 0.001 and we assumed that the policy has converged when it does not change for 10000 epochs. 5.3

Reward Function

As Eq. 7 shows, the goal of the agent is to approximate the trajectory of the vehicle which maximizes the radius of the tangent circle. For the purpose of this study every circle is tangent to three elements of the trajectory (A, B and C). In order to calculate the radius, ﬁrstly, we approximate the two lines which connect these elements. Secondly, we approximate the common point of their mediators (M). Finally, we measure the distance between A and M, this measurement is the radius of the tangent circle of the trajectory.

Improving Fuel Economy with LSTM Networks and Reinforcement Learning

237

In every iteration one point Pi of the trajectory is moved and the radii of three tangent circles is calculated. Every circle is tangent to three successive points. The ﬁrst circle (a) is tangent to the point being moved and the two previous points on the trajectory (Pi−1 , Pi−2 ), the second one (b) to Pi and two points ahead (Pi+1 , Pi+2 ) and the third one (c) to Pi−1 , Pi , Pi+1 . The reward is equal to the sum of the radial diﬀerences from the policy of the agent and the new action. r = (ra − rpolicya ) × i + (rb + rpolicyb ) × j + (rc + rpolicyc ) × k

(8)

Where ra , rb , rc correspond to the radiuses of the new trajectory and the rpolicya , rpolicyb , rpolicyc to the policy’s trajectory. Furthermore, the i, j, k are constant parameters the value of which has been set to i =1, j = 0.1 and k = 0.1. These values have been optimized for this particular track and they were the result of experimentation in the range [0, 1]. 5.4

Action Selection Strategy

An initial analysis has been conducted to establish the most suitable exploration rate for the in -greedy strategy. The value of as in Eq. 6 has been set to have a varying rate at 0.01, 0.001, 0.0001, and 0.00001. For the gradually decreasing exploration method the value of = 0.001 has been selected for this study as it provides the best performance over other values (Table 2 and Fig. 5). Table 2. Experimentation with -Value − V alue Convergence Speed %T otal State Action P air V isited 0.01

1800

84.877

0.001

9600

85.34

0.0001

68000

85.031

0.00001

999996

94.256

Fig. 5. The red points indicate the racing line approximation of the algorithm and the green points the middle line of the track. As it is shown the tangent circles of the racing line always have bigger radii, thus the racing line approximation is a better solution than the middle line. (Color ﬁgure online)

238

6

A. Bougiouklis et al.

Interface

We developed an application which can be mounted on the steering wheel of a vehicle and present in real time the optimum trajectory and the velocity proﬁle. The interface includes a red ball which moves up or down if the vehicle’s cruising speed is too fast or too slow and tends to go right or left if the vehicle does not follow the racing line. The driver can make the appropriate corrections by moving the steering wheel or pushing the throttle. The behavior of the driver is always monitored by the system described in Sect. 2. The completed system is simple and its minimalistic design leaves the concentration of the driver unaﬀected.

Fig. 6. (a) The implementation of the graphical interface into the raspberry pi. (b) Illustration of a game scenario.

7

Testing and Results

The presented system has been installed on the LEV described in Sect. 2. To test the results of the system and the behavior of the driver when the system is running we conducted the following experiment during the Europe Shell Eco Marathon event 2017. First, we asked the driver to maintain a constant velocity on the closed track mentioned above, which is 1.7 km in length, as a regular

Fig. 7. Graph 7a shows the generated power from the engine of the LEV when the driver maintains a constant velocity of 25 km/h in the testing track. Graph 7b shows the generated power when the driver uses our system with the velocity proﬁle of Fig. 4a and the racing line of Fig. 6a.

Improving Fuel Economy with LSTM Networks and Reinforcement Learning

239

cruise control system would do. Second, we used the suggested system and the driver tried to follow the optimum trajectory while driving with the same desired average velocity. The experiment showed that the driver was able to follow the instructions of the system and by adopting the suggested driving strategy the total consumption was reduced by 24.03% (Fig. 7). Acknowledgements. We would like to thank Prometheus research team of National Technical University of Athens for providing the LEV for the research.

References 1. Kamal, M.A.S., Mukai, M., Murata, J., Kawabe, T.: Ecological vehicle control on roads with up-down slopes. IEEE Trans. Intell. Transp. Syst. 12(3), 783–794 (2011) 2. Lin, Y.-C., Nguyen, H.L.T.: Development of an eco-cruise control system based on digital topographical data. Inventions 1(3), 19 (2016) 3. Gilbert, E.G.: Vehicle cruise: improved fuel economy by periodic control. Automatica 12(2), 159–166 (1976) 4. Yi, Z., Bauer, P.H.: Optimal speed proﬁles for sustainable driving of electric vehicles. In: IEEE Vehicle Power and Propulsion Conference (VPPC), pp. 1–6, 19–22 (2015) 5. Chang, D.J., Morlok, E.K.: Vehicle speed proﬁles to minimize work and fuel consumption. J. Transp. Engrg. 131(3), 173–182 (2005) 6. Wu, X., He, X., Yu, G., Harmandayan, A., Wang, Y.: Energy-optimal speed control for electric vehicles on signalized arterials. IEEE Trans. Intell. Transp. Syst. 16(5) (2015) 7. Sivak, M., Schoettle, B.: Eco-driving: Strategic, Tactical, and Operational Decisions of the Driver that Inﬂuence Vehicle Fuel Economy. Elsevier Ltd. (2012) 8. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 9. Gamage, H.D., (Brian) Lee, J.: Machine learning approach for self-learning ecospeed control. In: Australasian Transport Research Forum (2016) 10. Sharp, R.S., Casanova, D.: On minimum time vehicle manoeuvring: the theoretical optimal time. Ph.D. thesis, Cranﬁeld University (2000) 11. Braghin, F., Cheli, F., Melzi, S., Sabbioni, E.: Race driver model. Comput. Struct. 86(13–14), 1503–1516 (2008) 12. Medsker, L.R., Jain, L.C.: Recurrent Neural Networks Design and Applications. CRC Press, New York (2001) 13. Mazroua, A.A., Salama, M.M.A., Bartnikas, R.: PD pattern recognition with neural networks using the multilayer perceptron technique. IEEE Trans. Electr. Insulation (1993)

Action Markets in Deep Multi-Agent Reinforcement Learning Kyrill Schmid(B) , Lenz Belzner, Thomas Gabor, and Thomy Phan Mobile and Distributed Systems Group, LMU Munich, Munich, Germany {kyrill.schmid,belzner,thomy.phan,thomas.gabor}@ifi.lmu.de http://www.mobile.ifi.lmu.de

Abstract. Recent work on learning in multi-agent systems (MAS) is concerned with the ability of self-interested agents to learn cooperative behavior. In many settings such as resource allocation tasks the lack of cooperative behavior can be seen as a consequence of wrong incentives. I.e., when agents can not freely exchange their resources then greediness is not uncooperative but only a consequence of reward maximization. In this work, we show how the introduction of markets helps to reduce the negative eﬀects of individual reward maximization. To study the emergence of trading behavior in MAS we use Deep Reinforcement Learning (RL) where agents are self-interested, independent learners represented through Deep Q-Networks (DQNs). Speciﬁcally, we propose Action Traders, referring to agents that can trade their atomic actions in exchange for environmental reward. For empirical evaluation we implemented action trading in the Coin Game – and ﬁnd that trading signiﬁcantly increases social eﬃciency in terms of overall reward compared to agents without action trading.

1

Introduction

The success of combining reinforcement learning (RL) and artiﬁcial neural networks (ANNs) in single agent settings has also respawned interest in multi agent reinforcement learning (MARL) [8,9,16,20]. In so called independent learning each agent is represented by a neural network which is trained according to a speciﬁc learning rule such as Q-learning [12]. When agents are self-interested the emergent behavior is often suboptimal as agents learn behavior w.r.t. their individual reward signal. In tasks such as resource allocation problems this leads to ﬁrst-come, ﬁrst-served strategies. The resulting allocations from such strategies are in general ineﬃcient. An allocation is said to be ineﬃcient, if there is another allocation under which at least one agent has higher reward and all other agents have at least equally high rewards compared to the former allocation. While some work tries to mitigate greedy behavior based on game theoretic strategies such as Tit-for-Tat [9] we argue that ineﬃciency can also be seen as a consequence of market failure. Speciﬁcally, many settings provide no incentives for agents to increase eﬃciency. I.e., as long as an agent’s best alternative in terms of utility is being greedy then the learned behavior is rational rather than c Springer Nature Switzerland AG 2018 V. K˚ urkov´ a et al. (Eds.): ICANN 2018, LNCS 11140, pp. 240–249, 2018. https://doi.org/10.1007/978-3-030-01421-6_24

Action Markets in Deep MARL

+1

241

+2

+2 +1

E = 0.5

E=1

E=1

E=1

Fig. 1. Two agents competing for a coin: while pure self-interested behavior without trading incentivizes agents to act greedily the introduction of a market can help to increase both agents’ expected value. (Color ﬁgure online)

uncooperative. However, individual utility maximization can originate eﬃciency when agents are enabled to incentivize other agents. We call such a mechanism a market for behavior as it enables agents to trade behavior in exchange for other resources e.g. environmental reward. In the presence of a behavior market a utility maximizing agent can invest to stimulate behavior. Figure 1 illustrates how the introduction of a behavior market helps to overcome ineﬃciency in a stylized scenario. Suppose two agents are competing for a coin where agent 1 (yellow) gains a reward of +1 while agent 2 (blue) gains a reward of +2 from gathering the coin. When there is a probability of 0.5 for both agents to get it when they step forward then they will have an expected value of 0.5 (agent 1) and 1 for agent 2. As each agent only considers it’s own reward there will be no incentive for agent 1 to dedicate the coin to agent 2 while this would maximize the overall outcome. This changes when agents are enabled to exchange reward for behavior. When being able to trade, agent 2 could propose agent 1 a reward +1 when agent 1 steps back. In this case, expected values are both 1 which increases overall reward. The main contributions of this paper are: – A deﬁnition of action trading as a realization of a behavior market. – Empirical evidence that the introduction of markets is suﬃcient in order to increase eﬃciency in MAS. The rest of this paper is organized as follows: Sect. 2 gives an overview about related work. Section 3 describes the learning methods. Section 4 introduces action trading. Finally, in Sect. 5 we evaluate action trading in two experiments comparing self-interested agents with and without action trading in a matrix game and the Coin Game.

2

Related Work

Independent and cooperative RL in multi-agent systems has been researched for decades [10,14,21]. Recent successes of both model-free and model-based deep RL extending classical approaches with learned abstractions of state and action spaces [12,17,19] motivated the use of deep RL also in multi-agent domains [1,3].

242

K. Schmid et al.

The tensions of cooperation, competitiveness, self-interest and social welfare have traditionally been researched in the framework of game theory [13]. Game theory has also been a central theoretic tool for studying multi-agent systems [18]. A recent line of research investigates game-theoretic considerations in multiagent (deep) RL settings, extending the idea of classical games into the setting of sequential decision making under uncertainty [8,9,16,20]. In particular, to bring the concept of social dilemmas closer towards realworld problems the authors of [8] propose sequential social dilemmas (SSDs) where cooperation and competition cannot be seen as an atomic action but are represented through complex policies. In diﬀerent experiments the authors show how learned behavior depends on the choice of environmental factors such as resource abundance. Through variation of these external properties the authors train diﬀerent policies and classify these as cooperative or competitive respectively. In this work we adopt the idea of SSDs with multiple independent agents each represented through deep Q-networks. Still, in our analysis we do not focus on the emergence of cooperative policies through variation of environmental factors. Instead we were interested in answering the question whether in a system of autonomous, self-interested agents the chance to make economical decisions leads to eﬃcient allocation of resources and hence increases social welfare. In [20] the authors demonstrated how cooperative behavior emerges as a function of the rewarding scheme in the classic video game Pong. Agents, represented by autonomous Deep Q-Networks, learned strategies representing cooperation and competition respectively through modiﬁcation of the reward function. In our approach we do not specify the rewarding scheme as a static property of the environment but rather as a changing structure through which agents can express their willingness to cooperate. To deal with resource allocation in MARL the authors in [11] propose resource abstraction where each available resource is assigned to an abstract group. Abstract groups build the basis for new reward functions from which learning agents receive a more informative learning signal. Whereas the building of abstract resource groups and hence the shaping of rewards is done at design time, in this work the transformation of reward schemes is part of the learning process. An approach to carry the successful Prisoner’s Dilemma strategy tit-for-tat into complex environments has been recently made by Lerer and Peysakhovich [9]. In their work they construct tit-for-tat agents and show through experiments and theoretically their ability to maintain cooperation while purely reactive training techniques are more likely to result in socially ineﬃcient outcomes. The analysis of reward trading agents is more interested in emergent properties than in implementing a ﬁxed strategy. We therefore make no other assumption than agents maximizing their own returns.

3

Reinforcement Learning

For the purpose of this work we follow the line of descriptive approaches similar to [8]. Rather than asking what learning rule agents should use we model each agent

Action Markets in Deep MARL

243

as a speciﬁc learner and observe the emergent system behavior. In this sense we model agents as independent learners, i.e., agents cannot observe each other but only recognize a changing environment which is the result of the learning of other agents. We apply methods from the framework of reinforcement learning where it is known that indepenent learning results in non-stationarity as well as to the violation of the Markov property [4,7]. However, as [8] points out in the descriptive approach this can be considered as a feature rather than a bug as it is an aspect of the real environment that the model captures. Reinforcement learning (RL) are methods where an agent learns a policy π from repeated interaction with an environment. If multiple agents are involved the problem can be described with a so called Stochastic Game (SG). Formally, a SG is a tuple (S, N , A, T , R) where: S is a ﬁnite set of states, N is a ﬁnite set of I players, A = A1 × ... × AI describes the joint-action space where Ai is the ﬁnite action set of player i, T : S × A × S → R is the transition function and R = r1 , ..., rI where ri : S × A → R is the reward function for player i [4]. agent’s goal is to maximize its expected return which is Rt := ∞An t−1 γ Rt . An agent decides which actions to take in a certain state accordt=1 ing to a policy π which is a function π : S → P(A) from states to probability distributions over A. Over the course of training the agent is supposed to learn a policy that maximizes the expected return. One way to obtain a policy is to learn the action value function Q : S × A → R that gives the value of an action in a certain state. A popular way to learn the action value function is Q-learning where an agent i updates its values according to: Qi (s, a) ← Qi (s, a)+α ri +γ maxa ∈Ai Qi (s , a )−Qi (s, a) where α is the learning rate and γ is a discount factor. From Q a policy π can be derived by using e.g. -greedy action selection where with probability 1 − the agent selects an action with argmaxa∈A Q(s, a) and with probability the agent selects an action random uniform from the available actions. In this work, we model agents as independent Q-Learners. Deep RL refers to methods that use deep neural networks as function approximators. In deep multi-agent RL each agent can be represented by a deep Q-network (DQN) [12]. For independent learners, each agent stores a function Qi : S × Ai → R that approximates the state-action values.

4

Action Trading

This section formally introduces action trading which is realized through extending agents’ action spaces. The idea of action trading is to let agents exchange environmental reward for atomic actions. Learning then comprises two parts: policies for the original action space of the stochastic game and a trading policy that represents an agent’s trading behavior. To keep notation simple we deﬁne action trading for the two agent case i.e., N = {1, 2}. For a given stochastic game (S, N , A, T , R), action trading is realized through extending action spaces A1 and A2 in the following way: A1 = A1 ×(A2 ×[0, .., N ]) and A2 = A2 ×(A1 ×[0, .., N ]). I.e., action spaces Ai comprise

244

K. Schmid et al.

the original actions aorig ∈ Ai and also trading actions atrade ∈ Aj × [0, .., N ]. A trading action atrade is a tuple (aij , p) that deﬁnes the amount of reward p ∈ [0, ..., N ] that agent i is oﬀering agent j for an action aij . p therefore is the price an agent pays and is transferred from agent i to agent j if a trade is established. In this work we require a successful trade to satisfy two conditions. Firstly, agent i made an oﬀer to agent j at time-step t for action a written as aij . Secondly, also at time-step t agent j actually chose action a, written as aj . Thus, a trade will only be established if oﬀer and supply match at the same time step. The resulting rewards at time-step t in the two agents scenario for agent 1 are rt1 = R1 + p2 ∗ δ21 − p1 ∗ δ12 and for agent 2 rt2 = R2 + p1 ∗ δ12 − p2 ∗ δ21 where reward and δij are boolean values to Ri represents the original environmental 1, if aij = aj , deﬁne successful trades i.e., δij = . 0, otherwise

Fig. 2. Action trading describes a mechanism to oﬀer other agents environmental reward in exchange for speciﬁc actions. Agents therefore choose in addition to their original actions also trading actions. A trade is realized when an oﬀer matches an actual action.

Figure 2 visualizes how action trading is realized. Agents select actions from their original action space and from the trading action space. Trading actions describe agents’ oﬀers towards other agents for speciﬁc actions. Whenever an oﬀer matches an actual performed action a trade is realized i.e., a ﬁxed amount of reward is transferred between the two involved agents.

5

Experiments

In this section, we describe two experiments. The ﬁrst experiment is an iterated matrix game that has been extended to enable agents to trade actions. The second experiment is the Coin Game, which is used for studying sequential social dilemmas in the recent literature for multi-agent learning [2,9]. In all experiments we compared action traders with self-interested agents. To measure the social outcomes of multi-agent learning, it is necessary to deﬁne a metric as the value function cannot be used as a performance metric like in single agent RL. To measure eﬃciency, we use the total sum of rewards

Action Markets in Deep MARL

245

obtained by all agents over an episode of duration T , also called the Utilitarian N i T i=1 R metric (U), which is deﬁned by [15]: U = E[ T ] where Ri = i=1 rit is the return for agent i for a sequence of rewards {rti |t = 1, ..., T } over an episode of duration T . For the Coin Game the Utilitarian is complemented by the total number of collected coins, and the share of correctly collected coins within one episode. 5.1

Iterated Matrix Game

To study the eﬀects of action trading in a simple matrix game, we used a game with pay-oﬀs as given in Fig. 3a. Action trading in the matrix game was realized by extending action spaces Ai = {1, 2} to Ai = {(1, 0), (1, 1), (1, 2), (2, 0), (2, 1), (2, 2)}, i.e., each agent decides what action to take from the original action space in combination with a trading action. The price in terms of reward is ﬁxed with p = 1 for all actions. As learning rule we used tabular Q-learning with learning rate α = 0.001. For action selection we used the -greedy Q-Function with decaying from 1.0 to 0.1 over 2500 steps. The results from 100 runs each comprising 2500 steps are shown in Fig. 3. Independent learners without trading (blue) start to select the dominating action (1, 1) with high probability which is reasonable as agent 2 only ever receives reward when choosing action 1. Likewise, agent 1 learns to choose action 1 as a best response to the selection of agent 2. In contrast, independent learners with action trading (green) have decreasing reward for around 1000 steps. Afterwards overall reward constantly increases. 3.5

Agent2

1

2

1

0.5 / 0.5

4.0 / 0.0

2

0.0 / 1.0

1.0 / 0.0

Agent1

trading no-trading

3.0

reward

2.5 2.0 1.5 1.0 0.5 0

500

1000

1500

2000

Step

(a) Payoﬀs

(b) Overall reward

Fig. 3. 100 runs of the iterated matrix game with payoﬀs as given in the table (left). Whereas non-trading agents (blue) fail to ﬁnd a global optimum agents with action trading (green) eventually learn to maximize overall and individual reward (Color ﬁgure online)

5.2

The Coin Game

To study the eﬀects of action trading in a problem with sequential decision making we adopt the Coin Game ﬁrst proposed in [9]. The Coin Game is a 2-dimensional grid world that includes 2 agents (blue and yellow) and their respective coins. The task is to collect the coins and agents get a reward of +1

246

K. Schmid et al.

for collecting coins of any color. However, whenever an agent collects a coin that is diﬀerent from its own color the other agent loses two points. To evaluate the performance of action trading for n > 2 we also tested an extended version of the Coin Game comprising 4 agents. The 4 agents Coin Game works in the same way, i.e., agents have their associated coins and impose costs on a fellow agent whenever they collect a diﬀerently colored coin. From the perspective of this work, the Coin Game can be seen as a task where resources (coins) need to be allocated to agents. When eﬃciency is measured as overall reward then it would be best if agents only collected their own coins to prevent imposing costs on the other agent. As a consequence agents have an incentive to pay the other agent for not collecting their own coins. Consider the situation, when agent 1 (yellow) is about to collect the blue coin. This will bring agent 1 a reward of +1 and −2 for agent 2 (blue). Consequently, agent 2 would be willing to pay a price p ≤ 3 to agent 1 in exchange for the coin. Action spaces Ai in the Coin Game have four actions: Ai = {North, South, East, West}. To reduce the trading options for agents at any step, we decided to deﬁne a single tradeable action StepBack which is any action that increases the distance between an agent and the current coin. The trading decision an agent has to make is whether to oﬀer another agent the ﬁxed price p inexchange for a StepBack action. I.e., each agent i chooses actions from: Ai × j=i sj where Ai describes the original action space of agent i and sj = {0, 1} describes the binary choice to trade with any other agent j. Learning. Agents in the Coin Game were represented as deep Q-Networks (DQNs). During learning, exploration was encouraged by using a linear Boltzman policy, deﬁned by: π(s) = argmaxa (Va ), where Va is sampled from Va ∼ nexp(qt (a)/τ ) for each a ∈ A. All agents updated their policies from a stored i=1 exp(qt (i)/τ ) batch of transitions {(s, a, ri , s )t : t = 1, ..., T } [6]. For the Coin Game experiments, the batch size was limited to 50k transitions, where older transitions are discarded after inserting new transitions. The network was trained with the Adam optimization algorithm with a learning rate of 1e−3 [5]. Coin Game episodes lasted for 100 steps and after 25 episodes we logged 50 test episodes. The discount rate γ was 0.99. Modeling trade in the Coin Game required to set a couple of trading related parameters. Firstly, the price p for an action a. In our experiments, we set p = 1.25 as it exceeds an agents proﬁt from collecting a coin and is less than the designated owner of the coin would lose if the other agent collected the coin. The second parameter of interest is the trading budget m i.e., the available budget until the current coin is collected. We experimented with diﬀerent budgets and chose m to be 2.5 which allowed for a maximum of 2 trades when p = 1.25. A third critical question was whether agents should be allowed to accumulate wealth over steps or even episodes. Although this seems an interesting aspect we decided not to let agents gather their earnings and leave the analysis of such a setting for future work.

Action Markets in Deep MARL

247

(a) Rewards

(b) Share collected coins

(c) Number of trades

(d) Overall reward

(e) Share collected coins

(f) Number of trades

Fig. 4. Coin Game results for 2 agents (upper row) and 4 agents (lower row). Results comprise 1000 (2 agents) and 10000 (4 agents) episodes and show mean values and conﬁdence intervals from 80 runs for 2 agents and 10 runs for 4 agents. Each plot shows results for agents with action trading (green) and without trading (blue). Action traders show increasing individual and overall rewards (left column) along with an increasing share of correctly collected coins (middle column). The number of trades (third column) decreases after a steep rise during the early learning period (best viewed in color). (Color ﬁgure online)

Results. Figure 4 shows Coin Game results for 2 agents (upper row) and 4 agents (lower row) respectively. Experiments involve agents without trading (blue) and trading (green) for 80 runs (2 agents) and 10 runs (4 agents) where runs last for 1000 episodes (2 agents) and 10000 episodes (4 agents). Shaded areas show .95 conﬁdence intervals. The left column shows the overall reward and the individual rewards in the 2 agents setting. While non-trading agents’ reward never increases, action traders manage to increase individual and overall reward. This comes from an increasing share of correctly collected coins (middle column). The number of trades sharply increase during the ﬁrst 200 episodes and continuously decrease afterwards.

6

Discussion

Action trading in the iterated matrix game outperformed pure self-ﬁsh agents. Nevertheless, prices for actions were given at design time which renders the question on the ability of agents to ﬁnd prices on their own. The results from the Coin Game clearly conﬁrm that action trading eﬀectively increases social welfare, measured through overall increase of reward for all agents. It also shows that a given number of available resources (coins) are allocated more eﬃciently as the proportion of correctly collected coins also

248

K. Schmid et al.

constantly increases. This is the consequence of agents’ trading activity that increases sharply at early learning phases and is kept at a high level afterwards. In learning to trade, agents realize Pareto improvements and empirically conﬁrm the ﬁrst fundamental theorem of welfare economics according to which competitive markets will tend towards Pareto eﬃciency. From the experiments we realized that the trading budget is a critical parameter with respect to the problem of interest which will be left for future work. An interesting point seems the slow decrease in the number of trades. This might be caused by an agent speculating for short-term proﬁts by not oﬀering a trade in the hope that the other agent might be doing the expected action anyway. This could cause distrust which threatens future trades. We recognize that trading actions in MARL presumes that a trade can be controlled, i.e., agents cannot cheat on each other by making oﬀers which they do not hold afterwards. While this seems like a strong assumption, it appears less restrictive from a practical point of view. The only extension with respect to the environment is that agents’ rewards need to include the net earnings that where realized by their trading activity. I.e., the environment adopts the role of an neutral auctioneer that matches supply and oﬀer and returns the resulting rewards for each agent.

References 1. Foerster, J., Assael, Y.M., de Freitas, N., Whiteson, S.: Learning to communicate with deep multi-agent reinforcement learning. In: Advances in Neural Information Processing Systems, pp. 2137–2145 (2016) 2. Foerster, J.N., Chen, R.Y., Al-Shedivat, M., Whiteson, S., Abbeel, P., Mordatch, I.: Learning with opponent-learning awareness. arXiv preprint arXiv:1709.04326 (2017) 3. Gupta, J.K., Egorov, M., Kochenderfer, M.: Cooperative multiagent control using deep reinforcement learning. In: Proceedings of the Adaptive and Learning Agents Workshop (AAMAS 2017) (2017) 4. Hernandez-Leal, P., Kaisers, M., Baarslag, T., de Cote, E.M.: A survey of learning in multiagent environments: Dealing with non-stationarity. arXiv preprint arXiv:1707.09183 (2017) 5. Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 6. Lange, S., Gabel, T., Riedmiller, M.: Batch reinforcement learning. In: Wiering, M., van Otterlo, M. (eds.) Reinforcement Learning, pp. 45–73. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-27645-3 2 7. Laurent, G.J., Matignon, L., Fort-Piat, L., et al.: The world of independent learners is not Markovian. Int. J. Knowl. Based Intell. Eng. Syst. 15(1), 55–64 (2011) 8. Leibo, J.Z., Zambaldi, V., Lanctot, M., Marecki, J., Graepel, T.: Multi-agent reinforcement learning in sequential social dilemmas. In: Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pp. 464–473. International Foundation for Autonomous Agents and Multiagent Systems (2017) 9. Lerer, A., Peysakhovich, A.: Maintaining cooperation in complex social dilemmas using deep reinforcement learning. arXiv preprint arXiv:1707.01068 (2017)

Action Markets in Deep MARL

249

10. Littman, M.L.: Markov games as a framework for multi-agent reinforcement learning. In: Proceedings of the Eleventh International Conference on Machine Learning, vol. 157, pp. 157–163 (1994) 11. Malialis, K., Devlin, S., Kudenko, D.: Resource abstraction for reinforcement learning in multiagent congestion problems. In: Proceedings of the 2016 International Conference on Autonomous Agents and Multiagent Systems, pp. 503–511. International Foundation for Autonomous Agents and Multiagent Systems (2016) 12. Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015) 13. Osborne, M.J., Rubinstein, A.: A Course in Game Theory. MIT Press, Cambridge (1994) 14. Panait, L., Luke, S.: Cooperative multi-agent learning: the state of the art. Auton. Agents Multi Agent Syst. 11(3), 387–434 (2005) 15. Perolat, J., Leibo, J.Z., Zambaldi, V., Beattie, C., Tuyls, K., Graepel, T.: A multiagent reinforcement learning model of common-pool resource appropriation. arXiv preprint arXiv:1707.06600 (2017) 16. Peysakhovich, A., Lerer, A.: Prosocial learning agents solve generalized stag hunts better than selﬁsh ones. arXiv preprint arXiv:1709.02865 (2017) 17. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) 18. Shoham, Y., Leyton-Brown, K.: Multiagent Systems: Algorithmic, GameTheoretic, and Logical Foundations. Cambridge University Press, Cambridge (2008) 19. Silver, D., et al.: Mastering the game of Go without human knowledge. Nature 550(7676), 354–359 (2017) 20. Tampuu, A., et al.: Multiagent cooperation and competition with deep reinforcement learning. PloS One 12(4), e0172395 (2017) 21. Tan, M.: Multi-agent reinforcement learning: independent vs. cooperative agents. In: Proceedings of the Tenth International Conference on Machine Learning, pp. 330–337 (1993)

Continuous-Time Spike-Based Reinforcement Learning for Working Memory Tasks Marios Karamanis, Davide Zambrano, and Sander Boht´e(B) CWI, Machine Learning Group, Amsterdam, The Netherlands {marios,davide,sbohte}@cwi.nl

Abstract. As the brain purportedly employs on-policy reinforcement learning compatible with SARSA learning, and most interesting cognitive tasks require some form of memory while taking place in continuous-time, recent work has developed plausible reinforcement learning schemes that are compatible with these requirements. Lacking is a formulation of both computation and learning in terms of spiking neurons. Such a formulation creates both a closer mapping to biology, and also expresses such learning in terms of asynchronous and sparse neural computation. We present a spiking neural network with memory that learns cognitive tasks in continuous time. Learning is biologically plausibly implemented using the AuGMeNT framework, and we show how separate spiking forward and feedback networks suﬃce for learning the tasks just as fast the analog CT-AuGMeNT counterpart, while computing eﬃciently using very few spikes: 1–20 Hz on average. Keywords: Reinforcement learning Spiking neurons

1

· Working memory

Introduction

Reinforcement Learning [17] describes how animals can learn to act eﬀectively given sparse and possibly delayed rewards from their environment. For many tasks, optimal action selection requires some form of memory: the shortest path to a parked car relies on remembering where the car was parked, and understanding text requires the integration of information over the length of the sentence, if not from earlier paragraphs. For event-based and discrete-time optimization problems, Reinforcement Learning has been used to successfully train deep [11,16] and recurrent neural networks [1]. For working memory tasks, [1] demonstrated that LSTMs can be trained with the RL Advantage Learning algorithm, but this type of “oﬀ-policy” RL based on error-backpropagation is considered biologically implausible given the preponderance for “on-policy” RL like SARSA [12]. How animals can learn such tasks with SARSA-like RL and neural network models has been the topic of much research in neuroscience, with implications also in ﬁelds like deep learning and neuromorphics. c Springer Nature Switzerland AG 2018 V. K˚ urkov´ a et al. (Eds.): ICANN 2018, LNCS 11140, pp. 250–262, 2018. https://doi.org/10.1007/978-3-030-01421-6_25

Spike-Based Reinforcement Learning

251

Recent work [15,20] has suggested how working memory tasks can be learned in neural network models equipped with memory neurons, where memory neurons learn which stimuli need to be remembered for later use; learning is then made local and plausible using feedback connections [13]. While standard RL is formulated in an event-based manner, that is, framed in terms of state-changes, animals operate in a continuous-time setting and Zambrano et al. showed in [20] that a continuous-time version of AuGMenT (CT-AuGMenT) can be realized using an action selection mechanism that integrates evidence - drawing inspiration from the brain’s basal ganglia structures - combined with a separate feedback network for learning. Missing so far is a model of biologically plausible RL based on spiking neurons: here we present such a model, and we show how learning can in fact be based on the (sparse) relative timing of spikes. We show how the CT-AuGMenT framework can be extended to asynchronous and sparsely active spiking neural networks. Recent work has shown how spiking neurons can be used to computed convolutional neural networks [5,18] and compute control [7]; RL versions are lacking. We turn to adaptive spiking neurons [2] and develop two spike-based approaches: the ﬁrst where spikes carry approximations of both forward and feedback signals, and the CT-AuGMenTderived learning mechanism uses these signal approximations. In the second, we develop spike-triggered learning by exploiting the fact that the dynamics of the tasks are much slower than the timescale of timesteps in the simulation, and CT-AuGMenT weights-updates can be approximated by sparse sampling of the learning components – spike-triggered learning then uses the asynchronous nature of adaptive spike-coding where changes in signals elicit more spikes in the network, and hence higher precision sampling. We show how these approaches can be applied to two standard RL working memory tasks (T-Maze and Saccade-anti-Saccade), and ﬁnd that networks trained with both spike-based learning methods successfully and eﬃciently learn the tasks. When using spike-based learning, we ﬁnd that very low ﬁring rates in the network suﬃce, where the spike-triggered learning approach requires only slightly higher ﬁring rates, as can be expected since so very learning events take place. Together, we demonstrate spiking neural networks to learn cognitive tasks, capable of online-learning using sparse spike-triggered learning.

2

CT-AuGMEnT

In [14,15], AuGMEnT was developed as an artiﬁcial neural network (ANN) implementation of the on-policy SARSA reinforcement learning algorithm for solving Markov Decision Processes (MDPs) that require learnable working memory to construct Markov states in the hidden layer of the neural network model. AuGMEnT implements a biologically plausible local learning rule based on four factors: attentional feedback, forward activation, the local derivative of the transfer function, and a global neuromodulatory scalar value that signals the temporal diﬀerence error (TD-error) δ (Fig. 1a). This learning rule is local and enables the learning of XOR-like non-linear function mappings in multi-layer networks [15].

252

M. Karamanis et al.

a

b

Most inhibited: selected action

Feedback network

Z Layer

Z Layer – Motor Layer Exploration Extra-current

inhibition – WZ+ disinhibition – WZ-

Q Layer

Q Layer

WjkR

WjkR

WmkM

Association Layer

Working Memory Synaptic Tag

WkjSR

WkmSM

WmkM

Association Layer

VlmM

VijR

sTrace

Sensory Layer

Sensory Layer

Instantaneous

On

Instantaneous

Off

On

Off

Fig. 1. (a) The CT-AuGMEnT architecture. The feedforward layers include memory units in the association layer (diamonds) to compute Q-values in the Q-layer. The Qvalues are integrated in the action-selecting Z-layer, where the most inhibited action is selected at any point in time. Feedback from the (sole) selected action induces tags and traces on the synapses, which in combination with TD-error (δ) determines changes in synaptic weights. (b) In continuous-time, the feedback activity from the selected action is carried by a separate feedback network with its own weights (orange network). (Color ﬁgure online)

In [19,20] the CT-AuGMEnT framework was developed as an extension of AuGMEnT to include a realistic notion of continuous-time, introducing a dynamic action selection system and demonstrating an explicit feedback network with layer-wise delays and separately learned feedforward and feedback weights. The inclusion of an action selection system decouples the typical timescale of actions from the time resolution of the simulation, allowing for continuous-time on-policy SARSA learning. The resulting network is depicted in Fig. 1b. As described in [20], the CT-AuGMENT network comprises of four layers (Fig. 1a, b): a sensory input layer, a hidden “association” layer, a Q-layer, and an action layer Z. In the sensory layer, instantaneous units directly represent the stimulus intensity x(t), and transient “on/oﬀ”units represent positive and negative changes in stimulus intensity, x+ (t) (“on”) and x− (t) (“oﬀ”): x+ (t) =

1 [x(t) − x(t − dt)]+ , dt

x− (t) =

1 [x(t − dt) − x(t)]+ , dt

(1)

where [.]+ is a thresholding operation returning 0 for negating inputs. The hidden layer is comprised of regular units and memory units, where the instantaneous R and the transient units units i connect to the regular units j via connections vij M l connect to the memory units m via connections vlm . Activations are then computed as: R aR vij xi (t) yjR (t) = f (aR (2) j (t) = j (t)) i M M M vlm xl (t) ym (t) = f (aM aM (3) m (t) = am (t − dt) + m (t)). l

where f (.) denotes the neuron’s transfer function, here the standard sigmoid transfer function; for brevity of notation, xl (t) = [x+ (t) x− (t)]. The third layer

Spike-Based Reinforcement Learning

253

M R is connected to the association layer via connections wmk and wjk , computing Q-values for every possible action k in the current state s, qk (t): M M R R qk (t) = wmk ym (t) + wjk yj (t). (4) m

j

The Z-layer, modeled after action selection in the basal ganglia [8], implements an action-selection model based on competition between possible actions by connecting the Z-layer to the Q-layer with oﬀ-center on-surround connectivity: each q-unit inhibits its corresponding Z-unit and excites all other Z-neurons (Fig. 1a, top). The input to a Z-layer unit ui is thus: ui (t) = −w− qi (t) + w+

n

qj (t),

(5)

j=i

where we set w− /w+ = ν, with ν the number of possible actions in the task; the activation of the Z units can then be modeled as a leaky integrator: a˙ i (t) = −ρ(ai (t) − ui (t)),

(6)

where ρ is a rate constant that determines how fast equilibrium is reached. The Z-layer output yi (t) is bounded using the sigmoid activation function: yi (t) = σ(ai (t)).

(7)

The Q-layer thus determines the degree of inhibition in the Z-layer, where, somewhat counterintuitive, the selected action is the one that receives the most inhibition. Exploration is implemented as the addition of an external current to the explorative action unit in Eq. (5) [20]. Learning: In the CT-AuGMenT network, network plasticity is modulated by two factors: a global neuromodulatory signal and an attentional feedback signal. At every time-step, the Z-unit corresponding to the winning action a creates synaptic tags (equivalent to eligibility traces) by sending feedback activity to earlier processing levels. Tags in the Q-layer decay and are updated as: 1 T agjk (t + dt) = − T agjk (t) + dt[yj (t)zk (t)], φ

(8)

with zk = 1 for the selected action and zk = 0 for the other actions. The association units that provided strong input to the winning action a thus also receive the strongest feedback. Tags - mimicking eligibility trace - on connections between regular units and instantaneous units are equivalently computed as: 1 R T agij (t + dt) = − T agij (t) + dt[xi (t)f (aR j (t))wkj ], φ

(9)

where f (·) denotes the local derivative of the transfer function f , and the feedR R and the feedback connections wkj may have diﬀerent forward connections wjk

254

M. Karamanis et al. V

y (t)

t Sj (t)

t Sk (t)

3

y(t)

t

2 1 0

1

V

2

3

4

5

S

Fig. 2. ASN-based neural coding. Input spikes (red ticks), induce a smoothed activation S(t) in the post-synaptic neurons. The neuron emits spikes (blue ticks) when the input activation exceeds a variable threshold ϑ(t), and a refractory response scaled by the momentary adaptation is subtracted from the activation at the time of spiking. The ˆ approximates the rectiﬁed activation S(t)+ . At resulting total refractory response S(t) the next target neuron, the emitted spike-train induces an (unweighted) activation y(t); the transfer function (inset) describes the average relationship between the activation S(t) and the target activation y(t). (Color ﬁgure online)

strength [13]. Synaptic traces between sensory units l and memory cells m enable the proper learning of working memory: sT racelm (t + dt) = sT racelm (t) + dt[xl (t)] M T aglm (t + dt) = − φ1 T aglm (t) + dt[sT racelm (t)f (aM m (t))wkm ].

(10)

To implement on-policy SARSA temporal diﬀerence (TD) learning [17], the predicted outcome qa (T − 1) is compared to the sum of the reward r(t) and the discounted action-value qa (T ) of the unit a that wins the competition at time T , resulting in a TD error δ(T ) = r + γqa (T ) − qa (T − 1). For continuous-time TD learning, [20] gives the following TD error: 1 dt (11) δ(t) = r(t) + 1− qa (t) − qa (t − dt) , dt τ with learning rate β, weight updates are then deﬁned as: vij (t + dt) = vij (t) + dt[βδ(t)T agij (t)], vlm (t + dt) = vlm (t) + dt[βδ(t)T aglm (t)], wjk (t + dt) = wja (t) + dt[βδ(t)T agjk (t)].

(12)

Spike-Based Reinforcement Learning

3

255

Adaptive Spiking Neurons

Adaptive Spiking Neurons (ASNs) [2] are a variant of standard Leaky-Integrateand-Fire spiking neurons incorporating a fast multiplicative adaptation mechanism, where the fast adaptation limits the neuron’s asymptotic ﬁring rate. The ASN includes spike-triggered adaptation and a dynamical threshold that allows it to match neural responses while maintaining a high coding eﬃciency. Illustrated in Fig. 2, adaptive spike-based neural coding is described as a Spike Response Model (SRM) [6], where the input to a neuron j is computed as a sum of spike-triggered post-synaptic currents (PSCs) from pre-synaptic input neurons i. The total PSC, I(t), is computed as a sum over spike-triggered (normalized) kernels κ(tis − t) each weighted by synaptic eﬃcacies wij : I(t) = wij κ(ts − t), (13) i

tis

where tis denotes the timing of spikes from input neuron i. A normalized exponential ﬁlter φ(t) is applied to I(t) to obtain the neuron’s activation S(t): S(t) = (φ ∗ I)(t).

(14)

In the SRM formulation [2], the membrane potential of the neuron is obtained ˆ as the neuron’s activation S(t) from which the total refractory response S(t) ˆ is subtracted, where S(t) is computed as the sum of spike-triggered refractory response kernels η(t) each scaled by the (variable) value of the neuron’s threshold ˆ then approximates the rectiﬁed S(t): S(t)+ . at the time of spiking (ϑ(tj )); S(t) ˆ > θ(t) and the A spike is emitted by neuron j at time t whenever S − S(t) membrane potential is reset by subtracting a scaled refractory kernel η(t) which ˆ is then added to the total refractory response S(t). Spike-triggered adaptation is incorporated into the model by multiplicatively increasing the variable threshold θ(t) with a decaying kernel γ(t) at the time of spiking, and by controlling the speed of the ﬁring rate adaptation using the multiplicative parameter mf : ˆ = θ(t) = θ0 + mf θ(ts )γ(ts − t), S(t) θ(ts )η(ts − t). (15) ts

ts

We set the PSC kernel as equal to the refractory response kernel η(t), and model this kernel and the threshold kernel γ(t) as decaying exponentials with corresponding time-constants τη , τγ ; as is the membrane ﬁlter φ(t) (τφ ): ts − t κ(t) = η(t) = exp , (16) τη ts − t ts − t , φ(t) = φ0 exp , (17) γ(t) = exp τγ τφ where the timing of outgoing spikes is denoted by ts , θ0 is the resting threshold.

256

M. Karamanis et al.

Given a ﬁxed input current I(t) resulting in a ﬁxed activation S(t), the emitted spike-train from the post-synaptic neuron has an (unweighted) ﬁxed size impact y(t) on the next target neuron. We characterize the relationship between activation S(t) and target impact y(t) as the eﬀective ASN transfer-function (inset); this function has a half-sigmoid like shape and can be either computed analytically for particular parameter choices (i.e. [18]) or approximated. For the analog spike-like network in Sect. 4, we approximate the shape of this transferfunction with the positive rectiﬁed tanh() function: tanhP ().

a

b

Z Layer

Q Layer

Association Layer

Sensory Layer

Instantaneous

On

Off

Fig. 3. (a) Spiking CT-AuGMent. Indicated by the half-sigmoid graphs are the neurons that are set to have tanhP () as transfer functions (in the analog rectiﬁed network), which are substituted by ASN neurons in the spiking network versions. Ticks along network connections indicate which part of the network “spikes”. (b) Spike-based and spike-triggered learning: spike-based learning uses the analog global δ and local y (t) signals and those derived from feedforward spikes, x(t) and feedback spikes, z(t); spiketriggered learning considers those signals only at spike times ts,n .

4

Spike-Based CT-AuGMenT

Analog Rectified CT-AuGMenT. To convert the CT-AuGMenT network to a spiking neural network, we replace the analog neurons by ASN models. The main obstacle here is that ASNs eﬀectively have a rectiﬁed half-sigmoidlike transfer function, as illustrated in Fig. 2. The CT-AuGMenT network uses sigmoidal transfer-functions for the feedforward stage, and linear neurons for Q-layer and the feedback network [20]. While for instance [10,13] suggest that there is some ﬂexibility with regard to the feedback network, we create an analog network where the neurons in the feedforward Sensory and Association layer use the tanhP () transfer-function, as well as the feedback network from the Q-layer projecting to the Association layer (illustrated in Fig. 3a). We train this network on the tasks to ascertain the feasibility of training spike-based networks with rectiﬁed half-sigmoid-like transfer functions.

Spike-Based Reinforcement Learning

257

Spike-Based Learning. Spiking-AuGMenT incorporates ASNs in the feedback-learning network to include spike-based learning. Inspecting the learning rules (8)–(12) we see that four terms are involved in updating a synapse between a neuron i and j: the feedforward activation xi (t), the TD-error δ(t), the gradient of the transfer function f (ai (t)), and, for the hidden layer neurons j, the feedback activity from the winning action k, zk (t). In the spiking-AuGMenT formulation, we use ASNs in both the forward and the feedback network, also while training the network. The feedforward and feedback activations xi (t) and zk (t) are both computed as a sum of spike-triggered kernels, corresponding to S(t) in the ASN model. Reformulating CT-AuGMeNT, we denote the spiking neurons of spiking-AuGMenT with s and we use the same subscripts with the analog CT-AuGMenT. Instantaneous and transient units emit spikes to the regular and memory spiking neurons, respectively: R R vij xi (ts ) ∗ φ(ts ), sR aR (18) φj (ts ) = j (ts ) = f (aφj (ts )), ts

M aM φm (ts ) = am (ts − dt) +

i

ts

l

M vlm xl (ts ) ∗ φ(ts ),

M sM m (ts ) = f (aφm (ts )),

(19) where ts is the time of outgoing spikes, f is the eﬀective transfer function and φ(t) an exponential decay ﬁlter. As before, the Q-layer is fully connected to the association layer and the values are updated when there are input spikes: M M R R wmk σm (ts ) + wjk σj (ts ) . (20) qk (ts ) = ts

m

j

Equivalently to the analog network, the Z-layer involves the action mechanism and determines the amount of inhibition an action receives. Note that now the transfer-function is implicit. The spiking neurons in the feedback network are deﬁned as: zk (ts ) ∗ φ(ts ), (21) aZ φk (ts ) = sR kj (ts ) = f

ts

k

ts

k

M M R wkj (ts )aZ wkj (ts )aZ φk (ts ) , skj (ts ) = f φk (ts ) . ts

k

(22) Equations (8)–(10) and (12) are reformulated accordingly, where we approximate the local gradient of the transfer-function as the derivative of the positive part of the tanh-function: tanhP = max(0, 1 − tanh2 ) - while a rough approximation, we ﬁnd this works well in practice. Tags between the association layer and the Q-layer are then deﬁned as: 1 T agjk (t + dt) = − T agjk (t) + dt[yj (t)aZ φk (ts ))]. φ

(23)

258

M. Karamanis et al.

For tags that are formed between the sensory layer and the association layer: 1 R T agij (t + dt) = − T agij (t) + dt[xi (ts )tanhP (aR φj (ts ))skj (ts )]. φ

(24)

sT racelm (t + dt) = sT racelm (t) + dt[xl (ts )], M T aglm (t + dt) = − φ1 T aglm (t) + dt[sT racelm (t)tanhP (aM φm (ts ))skj (t)].

(25)

In the spike-based learning process the weights are updated again by (12), where the TD-error δ(t) is still an analog broadcasted signal. In both tasks the initial weights are positive uniformly distributed, motivated by the rectiﬁed-positive nature of the spike-based feedback network (22) (Fig. 3a).

a Goal N

b

Antisaccade Prosaccade

Anti R Anti L

T-junction

3 Corridor

Anti L

Anti R

Pro L

Pro R

2

Pro L Pro R Time Go Delay (1,000ms)

1

Road sign

ds ds

Agent

Cue presentation (1,000ms) Fixation (2,000ms) Empty (1,000ms) dt

Fig. 4. Tasks. (a) T-Maze task, (b) Saccade-anti-Saccade task. See text for explanation.

Spike-Triggered Learning. In the spiking-AuGMenT formulation, each weight is updated every dt, even though the typical dynamics of the tasks have substantially longer temporal dynamics - milliseconds versus hundreds of milliseconds: a more sparse sampling approach to learning should suﬃce. Rather than ﬁxed interval learning, we here propose to exploit the asynchronous nature of adaptive spike-coding: we only update the weights when a neuron receives or emits a spike (illustrated in Fig. 3b). The beneﬁt of this sampling scheme is that with adaptive neural coding, the spike-rate increases there is a large change in signal, thus allowing for more and more precise sampling when needed. In more detail, whenever a neuron emits a spike we update the weights, otherwise the learning process pauses. Here, we denote with n the number of the current learning update. Hence, the rule for the update of the weights is: vij (ts,n+1 ) = vij (ts,n ) + δt[βδ(t)T agij (ts,n )], vlm (ts,n+1 ) = vlm (ts,n ) + δt[βδ(t)T aglm (ts,n )], wjk (ts,n+1 ) = wjk (ts,n ) + δt[βδ(t)T agjk (ts,n )],

(26)

Spike-Based Reinforcement Learning

259

where δt equals the time between two successive spikes: δt = ts,n+1 − ts,n (note that here each neuron updates only for its “own” spikes ts,n ).

5

Results

We demonstrate the spike-based CT-AuGMenT model of Fig. 1 on two working memory tasks: the T-Maze task from the machine learning literature [1,14] and the Saccade/Antisaccade task from the neuroscience literature (both as in [20]). The T-Maze task is a working memory task where information that is presented at the start of the maze has to be maintained to make optimal decisions at the end of the corridor. The agent can choose actions to move in directions N, E, S, W ; the corridor length N scales the task diﬃculty. The same details for corridor representation, reward and time-out conditions as in [20] were applied. For the simulations, we gave each network at most 10,000 trials to learn the task. Convergence was determined by checking at 90% optimal choices as in [20] for each condition. The parameters of the network for the T-Maze task are: β = 0.02, λ = 0.3, γ = 0.9, = 0.025, τ = 0.5 and corridor length N = 10. The ASNs use ﬁxed values for θ0 = 0.1 and τφ = 2.5 ms. The network is updated at time increments of dt = 0.01, equivalent to 10 ms. The network consists of 24 neurons: a sensory layer with 9 input neurons (3 instantaneous and 6 transient units), an Association layer with 7 neurons (4 memory neurons and 3 regular neurons), and, matching the number of possible actions, both the output and the action layer have 4 neurons. Weights between the Sensory and Association and Q-layer are randomly initialized from the uniform distribution U [0, 0.25]. In the Saccade/Antisaccade (SaS) task, the agent has to learn that the color of the ﬁxation mark determines the strategy. Every trial started with an empty screen, shown for one second. Then a ﬁxation mark was shown, either black or white, indicating that a pro- or anti-saccade was required. The model had to ﬁxate within ten seconds, otherwise the trial was terminated without reward. If the model ﬁxated for two consecutive seconds, we presented a cue on the left or the right side of the screen for one second and gave the ﬁxation reward rf ix . This was followed by a memory delay of two seconds during which only the ﬁxation point was visible. At the end of the memory delay the ﬁxation mark turned oﬀ. To collect the ﬁnal reward rf in in the pro-saccade condition, the model had to make an eye-movement to the remembered location of the cue and to the opposite location on anti-saccade trials. The trial was aborted if the model failed to respond within eight seconds. The maximum number of trials the model is allowed to learn the task is set to 35,000. As to the implementation in [20], we kept the same temporal sequence of the events, and we updated the network at an increased rate of dt = 0.01 (corresponding to 10 ms per time step). The chosen parameters for the simulation are: β = 0.01, λ = 0.2, γ = 0.9, = 0.025, τ = 0.5, θ0 = 0.1 and τφ = 2.5 ms. The initialization of the weights is also uniformly distributed U [0, 0.25]. In this task the network is comprised of 26 neurons, with 12 neurons in the sensory layer (4 instantaneous and 8 transient units), 8 neurons in the Association layer (4 memory and 4 regular units) and both output and action layers have 3 neurons.

260

M. Karamanis et al.

Fig. 5. First row: the convergence rate over the average ﬁring rate (Hz) for the two tasks. In the T-Maze task we used τγ = [50, 150, 450, 1000, 1750] ms and τη = [150, 450, 1000, 1750, 2500] ms. In the SaS task we have τγ = 50 ms ﬁxed and τη = [100, 150, 200, 250, 300] ms. Bottom row : The average number of trials for each model and task for spiking network that match the analog network’s convergence rate.

In both tasks, the spiking neuron time-constants τγ , τη are varied to generate spiking neurons that have varying asymptotic activation rates. We tested 50 randomly initialized networks for each set of τη and τγ . At the end of each learning phase we set β = = 0 to validate the convergence. We plot the results for both tasks in Fig. 5, both in terms of convergence rate of the networks (top row) and the number of trials required for learning the tasks. We ﬁnd that both spiking methods, spike-based and spike-triggered CT-AuGMenT, are able to learn the tasks with convergence rates similar to that of CT-AuGMenT [19,20] and the analog rectiﬁed version (dashed line) for suﬃciently high ﬁring rates. We also compare the average number of trial needed for those spiking networks where the convergence rate matches the analog network (bottom row): we ﬁnd that for all three learning models, the networks need a similar number of trials to converge. We note also that for both tasks, a majority of networks still converge even for very low average ﬁring rates ( 0, for all non-empty T ⊆ D. 2. wD (i) > 0, for all i ∈ D. 3. wD {i} (i) < 0, for all i ∈ / D. Here we observe that a dominant set is a maximal, internally coherent subset. The three conditions guarantee that a dominant set has large internal similarities and small external similarities, and enable a dominant to be regarded as a cluster. The work in [15] detects a dominant set with the replicator dynamics. Specifically, with x ∈ Rn denoting the weight vector of the n data, the weights are updated iteratively by [k] x (Ax[k] )i [k+1] xi = i [k] [k] . (5) x Ax At convergence, the data whose weights are larger than a threshold form a domwD (i) inant set. In addition, it is shown that the weight of data i equals to W (D) , reﬂecting the relationship of i with the other data in D. Speciﬁcally, a large weight indicates that the data has a large similarity with the other data. Furthermore, [2] proposes the infection and immunization dynamics (InImDyn) to improve the computation eﬃciency. The InImDyn calculates data weights by x(t+1) = θF (x(t) ) (x(t) )[F (x(t) ) − x(t) ] + x(t) ,

(6)

A Target Dominant Sets Clustering Algorithm

289

where F is used to calculate the most infective strategy y for x, and θ represents the minimum share of y to make (1 − θ)x + θy immune to y. For space reason, the details of this dynamics are skipped in this paper. After one dominant set (cluster) is obtained, the included data are removed. Then the next cluster is detected in the remaining unclustered data. By repeating this process we are able to accomplish the clustering. 1 Aggregation Compound Pathbased D31 R15 Flame Wine Iris

0.9

Number of clusters

0.8 0.7

NMI

0.6 0.5 0.4 Aggregation Compound Pathbased D31 R15 Flame Wine Iris

0.3 0.2 0.1 0

10

20

2

10

1

10

30

40

50

σ

60

70

(a) Clustering result

80

90

100

10

10

20

30

40

50

σ

60

70

80

90

100

(b) Number of Clusters

Fig. 1. DSets clustering results and obtained number of clusters with diﬀerent σ’s.

3

Our Approach

The DSets algorithm has some interesting properties and has been applied in diverse tasks. However, with data represented as feature vectors, the parameter σ impacts on the clustering results signiﬁcantly. In this section we discuss this problem and present target dominant sets clustering as a solution. 3.1

Problems

The DSets algorithm detect clusters based only on the pairwise similarity matrix. With data represented as feature vectors, we use s(x, y) = exp(−d(x, y)/σ) to estimate data similarity and introduce the parameter σ. We study the impact of σ on clustering results below. The eight datasets used in experiments are Aggregation [8], Compound [20], Pathbased [4], D31 [18], R15 [18], Flame [7] and the Wine and Iris datasets from UCI machine learning repository. The parameter σ is tested with the values σ = αd, where d denotes the mean of pairwise distances and α takes values from 0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 30, · · · , 100. We use NMI (Normalized Mutual Information) to evaluate the clustering results, which are reported in Fig. 1(a). Evidently the clustering results on all the eight datasets are inﬂuenced by σ signiﬁcantly. To ﬁnd out the reason of this inﬂuence, we further have a look at the obtained numbers of clusters with diﬀerent σ’s. As shown in Fig. 1(b), the numbers of clusters decrease or remain unchanged with the increase of σ.

290

J. Hou et al.

This means that σ inﬂuences the number of clusters, cluster sizes and then the clustering results. We discuss the observations from Fig. 1 as follows. By deﬁnition, a dominant set is a maximal, internally coherent subset, and only the data with large internal similarities can be grouped into one dominant set. With a small σ, s(x, y) = exp(−d(x, y)/σ) generates small similarity values, and one data has large similarities with only a limited number of nearest neighbors, resulting in many small clusters. With a large σ, the similarity values are large and we obtain large clusters correspondingly. This explains why the numbers of clusters decrease or keep unchanged with the increase of σ in Fig. 1(b). As the cluster sizes increase with the increase of σ, the clusters are smaller than the real ones at ﬁrst, and become larger than the real ones gradually. Consequently, σ improves the clustering results at ﬁrst, and then results in a degradation when σ is too large, as shown in Fig. 1(a). The experiments above imply that a tuning process of σ is need to obtain the desired clustering result and number of clusters. In the case that the number of clusters is speciﬁed beforehand, and we also intend to make use of the special properties of dominant set, we have to try diﬀerent σ’s to obtain the speciﬁed number of clusters. In general, this means a large computation load. 3.2

Target Dominant Set Extraction

In order to generate a speciﬁed number of clusters with the DSets algorithm eﬃciently, we present a target dominant sets algorithm. We detect cluster centers and treat them as seeds in the ﬁrst step, and then extract clusters containing the seeds. As the extracted cluster contain a speciﬁed data, i.e., the seed, we call the obtained cluster (dominant set) as target cluster (dominant set). These two steps are described in details below. The ﬁrst step is to detect cluster centers. While there are diﬀerent methods for this task, in our implementation we adopt the one proposed in [16]. Each data is represented by local density ρ and the distance δ to the nearest data of larger local density. By regarding local density peaks as cluster centers, it is found that both ρ’s and δ’s of cluster centers are large, whereas either ρ’s or δ’s of the other data are small. Based on this diﬀerence, we sort the data according to their γ = ρδ, and those with the largest γ’s are selected as the cluster centers. While the original DSets algorithm detects clusters sequentially, we don’t know which data will be included in extracting each cluster. In order to extract a target cluster containing a speciﬁed data, the game dynamics must be modiﬁed to serve this purpose [10]. With InImDyn, the weights of data are updated iteratively according to Eq. (6). Even if we assign the initial weight of the seed to be 1, it is still possible that the seed data is assigned a zero weight, which means it is not in the obtained cluster. Therefore we need a diﬀerent weight updating

A Target Dominant Sets Clustering Algorithm

291

method. In the iteration, each state x(t) can be regarded as an approximation of the ﬁnal weight vector, and the ﬁnal weights are shown to be equal to wD (i) , if i ∈ D, D (7) xi = W (D) 0, otherwise. Consequently, each time the most infective strategy y is selected with the F function in Eq. (6), we can use Eq. (7) to update the weights, instead of Eq. (6). In this way, once one data is selected, it will never be assigned a zero weight, and it will stay in the obtained cluster. This guarantees that the obtained cluster contains the seed, since the seed is the ﬁrst selected data. However, the recursive form of wD (i) in Eq. (2) means a large computation load, especially if the cluster size is large. We therefore explore an approximation of wD (i) to improve the computation eﬃciency. As discussed in Sect. 2, wD (i) measures the relationship between Φ(i, D \ {i}) and Φ(D \ {i}). We make use of this relationship to estimate wD (i) as ⎧ 1, if |D| = 1, ⎪ ⎪ ⎨ aij , if |D| = 2, wD (i) = (8) j∈D\{i} ⎪ ⎪ ⎩ Φ(i,D\{i}) otherwise. Φ(D\{i}) , Given a dataset, we ﬁrstly detect the cluster centers. For each cluster center, we then extract the target cluster containing the cluster center. In this process, it is possible that some data are grouped into more than one clusters. We make use of the data weights to solve this problem and obtain the ﬁnal result. As a large data weight means a large probability of one data in one cluster, we compare the weights of one data assigned by each cluster, and group the data into the cluster where it is assigned the largest weight. As a special type of dominant sets clustering, our approach is proposed to eliminate the impact of σ and generate a speciﬁed number of clusters with the DSets algorithm. While non-parametric similarity measures, e.g., cosine, can also be used to estimate data similarity, the work in [14] indicates that nonparametric measures usually generate unsatisfactory results. In addition, with non-parametric similarity measures we are not guaranteed to obtain the speciﬁed number of clusters. 3.3

Improvement Measure

The DSets algorithm is computationally expensive in comparison with some other algorithms, e.g., k-means, DBSCAN and NCuts. The running time comparison of these four algorithms is shown in Table 1, where σ = 30d is adopted for the best average result for the DSets algorithm. It is evident that on all the datasets except for Iris, the running times of DSets algorithm are much more longer than those of the other algorithms. Even on the Iris dataset, only the DBSCAN algorithm consumes more running time than the DSets algorithm.

292

J. Hou et al. Table 1. Running time (ms) comparison of diﬀerent clustering algorithms. Aggregation Compound Pathbased D31

DSets k-means NCuts DBSCAN

473.9

132.5

66.9

4.5

2.2

1.8

100.2

R15

Flame Wine Iris

12377.9 1275.0 51.3

193.0

56.6

29.5

3701.5

46.4

25.3

16.2

487.0

52.9

19.7

1.2

2.0

1.5

113.8 19.3

12.6

10.5

108.4

10.3

42.5

9.3

8.9

As our approach is based on dominant set extraction, it is also aﬄicted by the large computation load. In our opinion, the reason of the large computation load of the DSets algorithm is two folds. First, the clusters are obtained sequentially and each cluster is extracted by updating the data weights iteratively. This means a large number of iterations and leads to a large computation load, especially if there are a large amount of clusters. Second, each cluster is detected in all the unclustered data, although the data in a cluster usually correspond to a small subset of unclustered data. Considering it is inherited in the DSets algorithm to extract clusters by updating the data weights, we choose to explore measures to reduce the computation load based on the second reason. Since one cluster usually corresponds to a subset of the unclustered data, one natural solution is to extract a cluster within a part, instead of all, of the unclustered data. However, with the original DSets algorithm it is not clear which data will be included into one cluster before the cluster is obtained. In this case, it is not possible to reduce the computation load by extracting a cluster in a subset of unclustered data. Fortunately, in our algorithm the cluster centers are detected in the ﬁrst step, and the clusters are then extracted to include these cluster centers. As one cluster center and the farthest data are unlikely to be in the same cluster, it is not necessary to extract the target cluster in all the unclustered data. Instead, we can safely discard the farthest data and work with only the nearest neighbors of the cluster centers. As the data used in calculation is reduced, the computation load is expected to decrease correspondingly. Furthermore, since the major memory load is caused by the pairwise data similarity matrix, and the matric size is square of data amount, the memory load can be reduced signiﬁcantly.

4

Experiments

In this part, we ﬁrstly illustrate the eﬀect of the improvement measure in reducing computation load presented in Sect. 3. Then we compare the running time and clustering results of our approach with the original DSets algorithm based on tuning of the parameter σ. 4.1

Eﬀect in Reducing Computation Load

In Sect. 3 we show that it is possible to reduce computation load by discarding the farthest data to the cluster centers. In order to test to which degree the

A Target Dominant Sets Clustering Algorithm

293

farthest data can be discarded without degrading the clustering results, the data are sorted based on to their distances to the cluster center. Then diﬀerent percentages of data in the farthest part are discarded and the corresponding clustering results are recorded in Fig. 2(a). 1 Aggregation Compound Pathbased D31 R15 Flame Wine Iris

4

10

0.9

Running time (ms)

0.8 0.7

NMI

0.6 0.5 0.4 Aggregation Compound Pathbased D31 R15 Flame Wine Iris

0.3 0.2 0.1 0

0.1

0.2

3

10

2

10

0.3

0.4

0.5

0.6

0.7

Percentage

(a) Clustering result

0.8

0.9

1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Percentage

(b) Running time

Fig. 2. Clustering results and running time with respect to diﬀerent percentage of discarded data.

It can be observed from Fig. 2(a) that on all the datasets except for Iris, at least 50% of the farthest data can be discarded without degrading the clustering results. Especially with the D31 and R15 datasets, we can discard up to 90% of the farthest data safely. In our opinion, the reason for this observation is that these two datasets contain 31 and 15 clusters respectively, and all the clusters in one dataset have the same amount of data. In this case, in extracting one cluster, the contained data are often less than 10% of the data. Therefore we can remove up to 90% of data without inﬂuencing the clustering results. In contrast, with the Iris dataset, discarding even 10 percent of data results in an evident decrease in the clustering accuracy. One possible reason is that on this dataset the features are extracted from iris ﬂowers and it is not suitable to measure the diﬀerence with Euclidean distance. On the other datasets, the number of clusters ranges from 2 to 7, and we can discard 50 to 60 percent of farthest data safely in extracting target clusters. This means that the new matrix size is about 25% or smaller of the original one, indicating a signiﬁcant reduction in memory load. Intuitively, the reduction of data in computation will result in the reduction of computation load. Similar to the last experiment, we show the running time with respect to diﬀerent percentage of discarded data in Fig. 2(b). With all the datasets the computation load is reduced evidently with the reduction of data in computation. It is worth noticing that in our implement we discard the farthest data to the cluster centers in all the data, instead of the unclustered data. This is helpful to determine a ﬁxed ratio to discard the farthest data. Otherwise, with the extraction of clusters, the amount of unclustered data becomes smaller and smaller, and it is diﬃcult to ﬁnd out such a ﬁxed ratio.

294

J. Hou et al.

4.2

Comparison

To obtain a speciﬁed number of clusters with the DSets algorithm, we present the target dominant sets algorithm. Since Fig. 1(b) indicates that we can also achieve this purpose by selecting a proper σ, we compare the running time of these two methods. With our approach, we use the original version and no data are discarded to reduce the computation load. With DSets, we ﬁrstly use σ = d to obtain the clustering results. If the obtained number of clusters is smaller than the real one, we ﬁnd the σ which generates the real number of clusters by bisection between [0, d]. Otherwise, we continue to test 10d, 20d, · · · and also determine the σ by bisection. Here we call this algorithm with parameter tuning as parameter-tuned DSets (PT-DSets) for ease of expression. The running time of our approach and PT DSets is shown in Table 2, where it is evident that on the majority of datasets our approach is much more eﬃcient than parameter-tuned DSets. Table 2. Running time (seconds) comparison between our algorithm and PT-DSets. Aggregation Compound Pathbased D31

R15 Flame Wine Iris

PT-DSets 4.80

1.05

1.01

128.39 2.44 0.64

0.28

0.24

Ours

0.37

0.17

17.39 0.41 0.11

0.14

0.20

2.75

Table 3. Clustering results comparison between our algorithm and PT-DSets. Aggregation Compound Pathbased D31 R15 Flame Wine Iris PT-DSets 0.90

0.75

0.39

0.87 0.97 0.12

0.49

0.60

Ours

0.79

0.53

0.91 0.96 0.87

0.60

0.76

0.85

Finally, we compare the clustering accuracy of our approach and PT-DSets in Table 3. Our algorithm outperforms PT-DSets on 6 out of the 8 datasets, is outperformed by the latter slightly on the other two datasets. This observation indicates that our approach performs better than PT-DSets in both clustering accuracy and computation eﬃciency.

5

Conclusions

We present a target dominant sets algorithm to obtain a speciﬁed number of clusters with the dominant sets algorithm eﬃciently. In the ﬁrst step cluster centers are determined based on the local density relationship among the data. Then we extract the target clusters around the cluster centers, which is based on a revised infection and immunization dynamics. We further show that the computation and memory load of our approach can be reduced signiﬁcantly without degrading the clustering results by discarding the farthest data to the cluster centers. Experiments on some datasets indicate that our algorithm outperforms the dominant sets algorithm with parameter tuning in both clustering accuracy and computation eﬃciency.

A Target Dominant Sets Clustering Algorithm

295

Acknowledgement. This work is supported by the National Natural Science Foundation of China under Grant No. 61473045, and the Natural Science Foundation of Liaoning Province under Grant No. 20170540013 and No. 20170540005.

References 1. Brendan, J.F., Delbert, D.: Clustering by passing messages between data points. Science 315, 972–976 (2007) 2. Bulo, S.R., Pelillo, M., Bomze, I.M.: Graph-based quadratic optimization: a fast evolutionary approach. Comput. Vis. Image Underst. 115(7), 984–995 (2011) 3. Bulo, S.R., Torsello, A., Pelillo, M.: A game-theoretic approach to partial clique enumeration. Image Vis. Comput. 27(7), 911–922 (2009) 4. Chang, H., Yeung, D.Y.: Robust path-based spectral clustering. Pattern Recogn. 41(1), 191–203 (2008) 5. Cheng, Y.: Mean shift, mode seeking, and clustering. IEEE Trans. Pattern Anal. Mach. Intell. 17(8), 790–799 (1995) 6. Ester, M., Kriegel, H.P., Sander, J., Xu, X.W.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: International Conference on Knowledge Discovery and Data Mining, pp. 226–231 (1996) 7. Fu, L., Medico, E.: Flame, a novel fuzzy clustering method for the analysis of DNA microarray data. BMC Bioinform. 8(1), 1–17 (2007) 8. Gionis, A., Mannila, H., Tsaparas, P.: Clustering aggregation. ACM Trans. Knowl. Discov. Data 1(1), 1–30 (2007) 9. Hamid, R., Maddi, S., Johnson, A.Y., Bobick, A.F., Essa, I.A., Isbell, C.: A novel sequence representation for unsupervised analysis of human activities. Artif. Intell. 173, 1221–1244 (2009) 10. Hou, J., Xu, E., Chi, L., Xia, Q., Qi, N.: Dominant sets and target clique extraction. In: International Conference on Pattern Recognition, pp. 1831–1834 (2012) 11. Hou, J., Gao, H., Li, X.: DSets-DBSCAN: a parameter-free clustering algorithm. IEEE Trans. Image Process. 25(7), 3182–3193 (2016) 12. Hou, J., Gao, H., Li, X.: Feature combination via clustering. IEEE Trans. Neural Netw. Learn. Syst. 29(4), 896–907 (2018) 13. Hou, J., Pelillo, M.: A simple feature combination method based on dominant sets. Pattern Recogn. 46(11), 3129–3139 (2013) 14. Hou, J., Xia, Q., Qi, N.: Experimental study on dominant sets clustering. IET Comput. Vis. 9(2), 208–215 (2015) 15. Pavan, M., Pelillo, M.: Dominant sets and pairwise clustering. IEEE Trans. Pattern Anal. Mach. Intell. 29(1), 167–172 (2007) 16. Rodriguez, A., Laio, A.: Clustering by fast search and ﬁnd of density peaks. Science 344, 1492–1496 (2014) 17. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 167–172 (2000) 18. Veenman, C.J., Reinders, M., Backer, E.: A maximum variance cluster algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 24(9), 1273–1280 (2002) 19. Yang, X.W., Liu, H.R., Laecki, L.J.: Contour-based object detection as dominant set computation. Pattern Recogn. 45, 1927–1936 (2012) 20. Zahn, C.T.: Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Trans. Comput. 20(1), 68–86 (1971) 21. Zhu, X., Loy, C.C., Gong, S.: Constructing robust aﬃnity graphs for spectral clustering. In: IEEE International Conference on Computer Vision and Pattern Recognition, pp. 1450–1457 (2014)

Input Pattern Complexity Determines Specialist and Generalist Populations in Drosophila Neural Network Aaron Montero(B) , Jessica Lopez-Hazas, and Francisco B. Rodriguez Grupo de Neurocomputaci´ on Biol´ ogica, Dpto. de Ingenier´ıa Inform´ atica, Escuela Polit´ecnica Superior, Universidad Aut´ onoma de Madrid, 28049 Madrid, Spain [emailprotected], [emailprotected], [emailprotected]

Abstract. Neural heterogeneity has been reported as beneﬁcial for information processing in neural networks. An example of this heterogeneity can be observed in the neural responses to stimuli, which divide the neurons into two populations: specialists and generalists. Being observed in the neural network of the locust olfactory system that a balance of these two neural populations is crucial for achieving a correct pattern recognition. However, these results may not be generalizable to other biological neural networks. Therefore, we took advantage of a recent biological study about the Drosophila connectome to study the balance of these two neural populations in its neural network. We conclude that the balance between specialists and generalists also occurs in the Drosophila. This balancing process does not aﬀect the neural network connectivity, since specialist and generalist neurons are not diﬀerentiable by the number of incoming connections. Keywords: Pattern recognition · Bio-inspired neural networks Neural computation · Supervised learning · Connectivity Specialist neuron · Generalist neuron · Neural variability Olfactory system

1

Introduction

In a recently published study [7], we observed that in a neural network that simulated the locust olfactory system, pattern recognition was inﬂuenced by the balance of specialist and generalist neurons. These neurons are deﬁned in this way based on their neural responses to diﬀerent stimuli, for which the specialists respond to few of them and the generalists to a wide number of them. Because of this, it is suggested that specialists are essential for discrimination, while generalists extract common features [15]. However, these results may not be generalizable to other insects, so we have taken advantage of a recent and extensive study on Drosophila [3] to test our results using a computational model c Springer Nature Switzerland AG 2018 V. K˚ urkov´ a et al. (Eds.): ICANN 2018, LNCS 11140, pp. 296–303, 2018. https://doi.org/10.1007/978-3-030-01421-6_29

Input Pattern Complexity Determines Specialist and Generalist Populations

297

that simulates its olfactory system. Some of the diﬀerences between the locust neural network and the Drosophila one are the number of neurons in the antennal lobe (AL) (∼1, 000 [8] vs ∼250 [3]), the number of Kenyon cells (KCs) in the mushroom body (MB) (∼50, 000 [8] vs ∼2, 500 [3]) and the connection probability between the AL and the KCs (∼0.2 [6,8] vs ∼0.01 [3]), see Fig. 1. The study of specialism/generalism on Drosophila will not only serve to strengthen our results, since we will also use the data of Drosophila to analyze the connectome obtained by the balance of these two types of neurons. The computational model to perform this study on Drosophila is a singlehidden-layer neural network (Fig. 1) which represents in its input layer the AL, in the hidden layer the KCs and in the output layer the MB output neurons (MBONs). AL and KCs are connected by a non-speciﬁc connectivity matrix [5] that increases the separability between diﬀerent encoded stimuli. On the other hand, the connectivity matrix that links KCs with MBONs is subjected to a learning process that can be emulated by using Hebbian rules [2]. Finally, the information received by the MBONs is subject to a process of lateral inhibition, which is similar to the winner-take-all principle [9].

Fig. 1. Neural network model. Panel (a) shows the biological structure of the olfactory system of the Drosophila. When the olfactory receptors react to odor plumes the olfactory receptor neurons (ORNs) send the odor information using a fan-in connectivity network to the AL. AL codiﬁes this information and relays it via fan-out connections to the MB. Inside MB, the odor information is received by KCs, which are responsible for increasing its separability. Finally, the KC send the stimulus signal to the MBONs, which are responsible for its ﬁnal classiﬁcation process by means of convergent synaptic connections. Panel (b) shows the computational model used, which is a single-hiddenlayer neural network with the AL layers as input (X), the KCs as hidden layer (Y ) and the MBONs as output (Z). The connectivity matrices C and W link AL to KCs and KCs to MBONs respectively. The thresholds or biases for the hidden and output layer are θj and εl respectively.

In this computational model, we introduced Gaussian patterns for analyzing input data of diﬀerent complexities, since this complexity can be easily controlled through overlap between classes of these patterns. On the other hand,

298

A. Montero et al.

to analyze the role of specialist and generalist neurons in the KC layer, we calculated the classiﬁcation success for diﬀerent combinations of these neurons. We started with a network with only generalists neurons moving to a network with only specialists, going through several intermediate states. The results obtained by this process are consistent with those previously obtained for locust [7]. Furthermore, we noted that the number of incoming connections to specialist and generalist neurons is similar. This suggests that the neural sensitivity of KCs seems to be due only to the spatial distribution of stimuli in the AL layer.

2 2.1

Methods Gaussian Patterns

The AL encodes the olfactory stimuli so that a speciﬁc odorant stimulates speciﬁc glomeruli of it [10]. This activity can be propagated to the rest of the glomeruli according to the intensity of the stimulus [10]. This behavior has led us to simulate the odor patterns as Gaussians. The speciﬁc glomeruli are represented by the expected value of the Gaussians and the propagated activity because of the stimulus intensity by their standard deviation. The variation of the standard deviation of these Gaussians will determine the degree of overlap between pattern classes and, therefore, their complexity level. In Fig. 2 we can see examples of these Gaussian patterns for 10 classes used and 3 diﬀerent complexity levels, these examples show a pattern example for each of their classes and diﬀerent conﬁgurations. These patterns are deﬁned in a two-dimensional space: the Xaxis deﬁnes the spatial location of AL neurons and the Y -axis shows their neural activity. Finally, since the neural response to a stimulus is not always identical, we added noise to the activity generated by stimuli.

Fig. 2. Gaussian patterns. This ﬁgure shows an example of Gaussian pattern for each of the 10 classes used and 3 diﬀerent complexities. The Gaussian patterns represent the AL neurons by the X-axis and their activity by the Y -axis. The variation of the standard deviation of these Gaussians determines the overlap degree and, therefore, their complexity level. Furthermore, we added noise to the activity generated by stimuli, since the neural response to them can change.

Input Pattern Complexity Determines Specialist and Generalist Populations

2.2

299

Neural Network and Neuron Model

The network model is a single-hidden-layer neural network that retains the most relevant structural properties of the insect olfactory system [4,6,7] (see Fig. 1). The input layer represents the AL, the hidden one is the KCs and the output layer is composed of MBON populations, between which there is lateral inhibition. The dimensions of this network are based on the Drosophila ones, 250 input neurons [13], 2500 hidden neurons [14] and 100 output neurons (which are divided into populations of 10 neurons, one for each of the 10 pattern classes). These three layers are connected by the matrices C and W , which are initialized at the beginning of each learning process. The connectivity matrix C is established randomly by independent Bernoulli processes with probability pc = 0.01 in the Drosophila [3] for each existing connection and 1 − pc for each lack of it [4]. The reason for this non-speciﬁc connectivity matrix is due to the individual connection variability of insects of the same species [5]. On the other hand, the matrix W is initialized by a random matrix N0 , because its weights will be gradually strengthened or weakened using a supervised Hebbian learning [7]. According to the learning rules, if the hidden layer neuron yj has ﬁred and the output neuron zl should ﬁre due to the output target, then the connection between these neurons (wlj ) is reinforced with a probability p+ . In case of the output neuron should not ﬁre, the connection is weakened with the same probability. Instead, If the hidden layer neuron yj has not ﬁred and the output neuron zl should ﬁre, then the connection between these neurons is weakened with a probability p− . The value chosen for these Hebbian probabilities were p+ = 1 and p− = 0.05 because of their good learning performance [4,7]. In terms of the neuron model, and taking into account the simple dynamics of KCs (mostly silent, a single spike followed by a reset and its response is produced by the coincidence of concurrent spikes) [8], we choose the McCullochPitts model. This neuron model changes slightly for the MBONs given the lateral inhibition present in them [12]. Hence, the equations for the KCs and MBONs are as follows: N AL (1) cji xi − θj ), j = 1, . . . , NKC , yj = ϕ( ⎛ zl = ϕ ⎝

i=1 N KC j=1

wlj yj −

1 Nclass

N class N KC k=1

⎞ wkj yj − εl ⎠ , l = 1, . . . , Nclass ,

(2)

j=1

where xi , yj and zl are activation states for an input neuron, a hidden neuron and a group of MBONs specialized in a certain pattern class, respectively. The input and hidden layer are linked by cji weights, and the hidden and output layer by wlj ones. On the other hand, the neural thresholds (bias) for the hidden and output layer are θj and εl . Finally, the Heaviside activation function ϕ is 0 when its argument is negative or 0 and 1 otherwise. In the case of MBONs, where we used the winner-take-all concept [9], the activation function ϕ is only 1 for the winner MBON group. Finally, we used diﬀerent thresholds for KCs (heterogeneous thresholds) and the same threshold for all MBONs (hom*ogeneous threshold). The reasons for

300

A. Montero et al.

using heterogeneous thresholds in KCs is their existence in this kind of neurons [8] and their use in neural networks can improve pattern recognition [6]. To select diﬀerent threshold values for KCs, we used the concept of limit threshold in the training phase [6,7]. Limit threshold is the neural activity of a neuron generated by a given stimulus and, therefore, the minimum value for which the neuron will not react to it. We extracted the distribution of limit thresholds for each KC and used these values to made the KCs react randomly to a percentage of patterns in order to introduce a greater diﬀerentiation between specialists and generalists in KC layer [7]. However, this variability was not needed in MBON layer, since a hom*ogeneous threshold for all neurons is enough because of the learning process in the matrix W and, furthermore, there are no records in biology about their presence in these neurons. 2.3

Selection Criteria of Specialist and Generalist Neuron

Specialist neurons are selective responding to stimuli, while generalists code for multiple stimuli [1]. Based on this deﬁnition, we can assume the extreme case that specialists respond only to one odorant class and generalists respond to all of them (10 pattern classes in our case). However, in a previous study [7], we observed that the computational model worked better when we did not exclude intermediate sensitivities (number of neural responses of a neuron to diﬀerent stimuli). Therefore, we decided to divide neural sensitivities equally between specialist and generalist neurons. Specialists will be those with a neural sensitivity from 1 to 5 and generalists the ones with neural sensitivity from 6 to 10. Once specialists and generalists had been deﬁned, we made two sets of each type of neurons. These sets will be used to create a new KC layer with the same dimensions than the original but with the percentages of these two types of neurons that we choose. To observe their impact on the classiﬁcation success, the KC layer starts with all generalist neurons and they are gradually replaced by specialist neurons. This balancing process will allow us to estimate which combination is the most suitable for pattern recognition.

3

Results

The following results are the average of 10 simulations with 5-cross-validation and supervised Hebbian learning for a total of 1000 Gaussian patterns (100 for each of the 10 pattern classes). 3.1

Balance of Specialists and Generalists

In Fig. 3, we can see that when the overlap is less than 28%, the maximum success is achieved for all combinations of specialist and generalist neurons in the KCs. Once this percentage of overlap has been overcome, the maximum classiﬁcation success rate is only achieved for a speciﬁc balance of these neurons. This balance initially requires a small number of specialists (10 − 20%), but for

Input Pattern Complexity Determines Specialist and Generalist Populations

301

overlaps greater than 70% this number increases quickly. This growth causes the neural network of the Drosophila ﬁnally only needs specialists to achieve the highest classiﬁcation success, for input patterns with extremely high overlap (∼90%) and, therefore, high complexity. These results are consistent with those observed for locust [7]. The only remarkable diﬀerence between both results is that the region of balance between specialists and generalists is greater in Drosophila, as well as the percentage of specialist neurons required on it is usually lower. A variation that may be due to the fact that the connection probability between AL and MB in Drosophila (pc ∼ 0.01) [3] is much lower than the one estimated for the locust (pc ∼ 0.2) [6,8]. Therefore, the amount of odor information transmitted by this lower connectivity will be also lower and the Drosophila system requires initially a larger number of generalists to oﬀset this loss.

Fig. 3. Relationship between overlap, the required percentage of specialist neurons and classification success. This picture shows the evolution of the classiﬁcation success and the percentage of specialists required to achieve this success based on the overlap between patterns. When the overlap is less than 28%, the maximum success is achieved for all combinations of specialists and generalists in KCs. For an overlap from 28% to 90%, the system requires a balance between these two types of neurons to classify correctly. During this period, the number of specialists required by the system increases quickly. Finally, for overlaps higher than 90%, the classiﬁcation gets worse and the system only needs specialists for improving its performance.

3.2

Neural Sensitivity Independent of the Number of Connections

As we mentioned previously, we have based on a recent and extensive study on Drosophila [3] to analyze the role of specialist and generalist neurons in its olfactory system. This study diﬀerentiates KCs according to their incoming connections, which led us to wonder if the randomness of the network that connects AL to KCs, matrix C (see Fig. 1), disappears after the balance between specialists and generalists. As shown in panel (a) of Fig. 4, the connectivity distributions between the initial matrix C and the solution matrix C (after the balancing process) are similar. The reason for not losing the random structure of connectivity by the balancing process could be due to the similarity between the connectivity distributions of specialists and generalists, panel (b) of Fig. 4. Therefore, when we

302

A. Montero et al.

modify the specialist and generalist populations in the KC layer, Subsect. 2.3, we do not aﬀect the number of connections between AL and KCs and how they are distributed. This leads us to think that the neural sensitivity of a neuron is not directly proportional to the number of incoming connections, if not mainly is due to the spatial distribution of stimuli in the input layer of AL.

Fig. 4. Number of connections to Kenyon cells for initial and solution connectivity matrices and specialist and generalist neurons. These panels show the mean values for diﬀerent simulations and overlap degrees, the standard deviations of these values are represented by error bars. Panel (a) shows the connectivity distributions of the initial random matrix and the solution matrix obtained by the optimal balance between specialists and generalists. Panel (b) shows the connectivity distributions of specialist and generalist neurons.

4

Discussion and Conclusions

In a previous study [7], we analyzed computationally what proportion of specialist and generalist neurons was suitable to improve the neural network learning of the olfactory system. We noted that when the complexity of the patterns was low, the system could reach the maximum classiﬁcation success with almost any ratio of specialists and generalists and, therefore, their roles in pattern recognition was unspeciﬁc. For intermediate complexities, the system required a balance between these types of neurons (both were relevant). Finally, when the input complexity was high, the pattern recognition problem was such that only specialist neurons could improve the classiﬁcation success. However, it was not clear that these results in the locust olfactory system would be generalized to other insects. So we decided to also study it for Drosophila and analyze the resulting connectome from the balance between specialists and generalists. We observed by using Gaussian patterns with diﬀerent levels of overlap (complexity) and a Drosophila-inspired neural network that the results obtained are consistent with the ones obtained for the locust. Furthermore, the balance between specialist and generalists neurons does not aﬀect the randomness of the connections between AL and MB in agreement with the biological facts. This fact is due to the similar number of incoming connections of these two types of

Input Pattern Complexity Determines Specialist and Generalist Populations

303

neurons, which means that the neural sensitivity of KCs seems to be related only to the spatial distribution of stimuli in the AL. Therefore, the regularization of the ratio of specialists and generalists could be applied in randomized neural networks [11] to improve their classiﬁcation without removing its randomness. Acknowledgments. This research was supported by MINECO/FEDER projects TIN2014-54580-R and TIN2017-84452-R (http://www.mineco.gob.es/). We also thank Ramon Huerta for his useful discussions.

References 1. Christensen, T.A.: Making scents out of spatial and temporal codes in specialist and generalist olfactory networks. Chem. Senses 30, 283–284 (2005) 2. Dubnau, J., Grady, L., Kitamoto, T., Tully, T.: Disruption of neurotransmission in drosophila mushroom body blocks retrieval but not acquisition of memory. Nature 411(6836), 476–480 (2001) 3. Eichler, K., et al.: The complete connectome of a learning and memory centre in an insect brain. Nature 548(7666), 175 (2017) 4. Huerta, R., Nowotny, T., Garcia-Sanchez, M., Abarbanel, H.D.I., Rabinovich, M.I.: Learning classiﬁcation in the olfactory system of insects. Neural Comput. 16, 1601– 1640 (2004) 5. Masuda-Nakagawa, L.M., Tanaka, N.K., O’Kane, C.J.: Stereotypic and random patterns of connectivity in the larval mushroom body calyx of Drosophila. Proc. Natl. Acad. Sci. USA 102, 19027–19032 (2005) 6. Montero, A., Huerta, R., Rodriguez, F.B.: Regulation of specialists and generalists by neural variability improves pattern recognition performance. Neurocomputing 151, 69–77 (2015) 7. Montero, A., Huerta, R., Rodriguez, F.B.: Stimulus space complexity determines the ratio of specialist and generalist neurons during pattern recognition. J. Frankl. Inst. 355(5), 2951–2977 (2018) 8. Perez-Orive, J., Mazor, O., Turner, G.C., Cassenaer, S., Wilson, R.I., Laurent, G.: Oscillations and sparsening of odor representations in the mushroom body. Science 297(5580), 359–365 (2002) 9. Rabinovich, M.I., Huerta, R., Volkovskii, A., Abarbanel, H.D., Stopfer, M., Laurent, G.: Dynamical coding of sensory information with competitive networks. J. Physiol. Paris 94(5–6), 465–471 (2000) 10. Rubin, J.E., Katz, L.C.: Optical imaging of odorant representations in the mammalian olfactory bulb. J. Neurophysiol. 23, 449–511 (1999) 11. Scardapane, S., Wang, D.: Randomness in neural networks: an overview. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 7(2), e1200 (2017) 12. Sch¨ urmann, F.W., Frambach, I., Elekes, K.: Gabaergic synaptic connections in mushroom bodies of insect brains. Acta Biol. Hung. 59, 173–181 (2008) 13. Shen, H.-C., Wei, J.-Y., Chu, S.-Y., Chung, P.-C., Hsu, T.-C., Hung-Hsiang, Y.: Morphogenetic studies of the drosophila DA1 ventral olfactory projection neuron. PloS One 11(5), e0155384 (2016) 14. Turner, G.C., Bazhenov, M., Laurent, G.: Olfactory representations by drosophila mushroom body neurons. J. Neurophysiol. 99, 734–746 (2008) 15. Wilson, R.I., Turner, G.C., Laurent, G.: Transformation of olfactory representations in the drosophila antennal lobe. Science 303(5656), 366–370 (2004)

A Hybrid Planning Strategy Through Learning from Vision for Target-Directed Navigation Xiaomao Zhou1,2(B) , Cornelius Weber2 , Chandrakant Bothe2 , and Stefan Wermter2 1

College of Automation, Harbin Engineering University, Nantong Street 145, Harbin 150001, China 2 Department of Informatics, University of Hamburg, Vogt-K¨ olln-Strasse 30, 22527 Hamburg, Germany {zhou,weber,bothe,wermter}@informatik.uni-hamburg.de http://www.informatik.uni-hamburg.de/WTM

Abstract. In this paper, we propose a goal-directed navigation system consisting of two planning strategies that both rely on vision but work on diﬀerent scales. The ﬁrst one works on a global scale and is responsible for generating spatial trajectories leading to the neighboring area of the target. It is a biologically inspired neural planning and navigation model involving learned representations of place and head-direction (HD) cells, where a planning network is trained to predict the neural activities of these cell representations given selected action signals. Recursive prediction and optimization of the continuous action signals generates goal-directed activation sequences, in which states and action spaces are represented by the population of place-, HD- and motor neuron activities. To compensate the remaining error from this look-ahead modelbased planning, a second planning strategy relies on visual recognition and performs target-driven reaching on a local scale so that the robot can reach the target with a ﬁner accuracy. Experimental results show that through combining these two planning strategies the robot can precisely navigate to a distant target.

Keywords: Navigation Vision-recognition

1

· Place cell · Head-direction cell

Introduction

Studies in neuroscience have revealed that animals’ spatial cognition and planning behaviors during navigation involve certain types of location- and directionsensitive cells in the hippocampus, which support an animal’s sense of place and direction [1,2]. More recent studies suggest that these spatially related ﬁring activities also underlie animals’ behavioral decisions [3]. c Springer Nature Switzerland AG 2018 V. K˚ urkov´ a et al. (Eds.): ICANN 2018, LNCS 11140, pp. 304–311, 2018. https://doi.org/10.1007/978-3-030-01421-6_30

Look-Ahead Planning Based on Learned Place and Head-Direction Cells

305

Considering existing approaches for modeling hippocampal cells, most of them just focus on how to develop the location- or direction-related ﬁring patterns while only few care about the computational principle underlying the formation of these ﬁring activities [4]. Slow feature analysis (SFA) [5] tries to explain this problem by an unsupervised learning algorithm that extracts slowly varying features from fast-changing source signals based on the slowness principle. In our previous work, place- and HD cells were simultaneously learned from visual inputs using a modiﬁed SFA learning algorithm which can develop separated populations of place and HD cell types by restricting their learning to separate phases of spatial exploration [6]. However there remains a question of how to use the metric information hidden in these cell activities, which are obtained by unsupervised learning, to support a navigation task. In this paper, based on the learned cell representations, we propose a navigation model that performs forward look-ahead planning and predicts a sequence of neural activities encoding intermediate waypoints from a starting position to a goal position, where the spatial positional state and directional state are represented by the learned place and HD cell representations, respectively. Furthermore, inspired by the biological ﬁnding that place cells are able to generate future sequences encoding spatial trajectories towards remembered goals, which demonstrates their predictive role in navigation [7], we propose a model of their functional role in directing spatial behaviors. Here, we mainly introduce the look-ahead planning whose architecture is shown in Fig. 1. The front part (visual processing part) consists of two parallel image-processing channels with a diﬀerent network for the emergence of place and HD cells, respectively. For the unsupervised training and network parameters please refer to our previous work [6]. The latter (route planning part) is a world model that supports the imaginary planning in goal-directed navigation, where the world state is represented by the ensemble activity of place and HD cells.

Fig. 1. An overview of the system architecture. The immediate response of the trained place or HD cell network to an image resembles the ﬁring activity of place and HD cells at a certain position or to a certain direction where the image is captured. The world model trained based on the learned cell representations is used to support look-ahead planning.

306

X. Zhou et al.

However such model-based forward planning suﬀers from signiﬁcant accumulation errors when dealing with long-range predictions. Furthermore, it takes into account only the place cell representations of the target, irrespective of speciﬁc visual properties of a target. In many cases, this planning can only lead the robot to the neighboring areas of a target, instead of to the precise target position. To solve this problem, we propose a second planning strategy that starts to perform after the look-ahead planning. Its aim is to recognize the target based on vision and to move directly towards it after recognizing it.

2

Hybrid Planning Strategy

Based on information learned from vision, the proposed hybrid planning strategy uses two diﬀerent coordinate systems. The ﬁrst one is based on space representations which are obtained in an unsupervised way. The second one is based directly on visual representations of the goal. The concept of switching between diﬀerent planning strategies during navigation can be found in similar work [8,9]. 2.1

Model-Based Look-Ahead Planning

For look-ahead planning, we ﬁrst train a predictive world model network which predicts the subsequent state given the current state and action. The continuous spatial state is represented by the ensemble activity of place and HD cells and the continuous action determines the change of moving direction during a transition, assuming a forward movement of constant speed. The world model is represented by a multi-layer perceptron (MLP) with 81 inputs (30 place cells + 50 HD cells + 1 rotation angle) and 80 outputs (30 place cells + 50 HD cells). The planning process is based on the recursive use of the fully trained world model which generates a sequence of neural activations encoding the spatial trajectory from an initial location to a given target location (represented in the same place- and HD space), together with corresponding action commands [10]. To generate an optimal route, the planner ﬁrst constructs a multi-step forward look-ahead probe by sequentially simulating the execution of each command in a given action sequence on a world model chain, as shown in Fig. 2. Then it optimizes the actions recursively in the direction of the desired goal location. The planning trajectory is optimized by modifying the actions via gradient descent to minimize the distance to the goal location. With this approach, routes towards a desired goal are imaginatively explored prior to execution by activating the place cell activities, while corresponding moving directions along the route are encoded by HD cell activities. For each optimization iteration, the action is updated as follows: 1 goal ∂Eplan , where Eplan = (S − Skpred )2 ∂a(t) 2 k K

Δa(t) = −η

(1)

The state vector S consists of an ensemble ﬁring activity of place and HD cells (K in total), η is a constant learning rate. The training objective is to

Look-Ahead Planning Based on Learned Place and Head-Direction Cells

307

optimize the action sequence a(t) such that the predicted ending state S pred is close to the goal state S goal , which is calculated by the SFA network given the image taken at the target position.

Fig. 2. An overview of the planning architecture. The world model which has been trained based on the learned cell representations is used to support look-ahead planning. Left (inset), the MLP used for one-step prediction. Right, multi-step prediction in the planning phase with feedback of the prediction error.

Note that planning assumes a predeﬁned prediction depth according to the distance to a goal location, while prior information about the optimal depth is not always available. To overcome this assumption of the existing model [10], we propose an adaptive-depth approach where the planning starts with a 1-step prediction and incrementally increases the depth until adding one more prediction step would let the ending position of the current plan go beyond the goal location. During depth increase, the previous plan naturally provides a good proposal for the initialization of the next plan whose prediction increases in depth. Since the previous plan is already optimized but fails due to its small prediction depth, this enables the planner to ﬁnd the best prediction depth towards a goal without any prior information to eﬃciently optimize the trajectory. 2.2

Vision-Directed Reaching Based on Target Recognition

While the look-ahead planning can approximately navigate the robot towards the target position, the robot will either overstep or stop short of the target by about one step size and will rarely stop precisely on the target. To solve this problem, we adopt a second planning strategy that is based on object/scene recognition. The goal-directed planning will be activated after the robot has executed the plans optimized by the look-ahead planning, in which case the robot is supposed to be close to the target and will be able to see the target. Since a target always refers to particular objects (like chair, computer. . . ) or speciﬁc scenes (like kitchen, corridor. . . ), the robot can recognize the target. After perceiving the target, the robot will adjust its head direction to keep the target in the center of its view and move towards it.

308

3 3.1

X. Zhou et al.

Experiments and Results Simulation Experiment for Look-Ahead Planning

To test the look-ahead planning, we ﬁrst used a simulated robot moving in a RatLab virtual-reality environment which also generated the visual data for training place- and HD cell networks [11]. RatLab is designed to simulate a virtual rat doing random explorations and allows to modify the environmental parameters and movement patterns according to the user’s purposes. We ﬁrst trained place- and HD cell networks by learning from the visual input with SFA, where the images generated during turning movements are used to train the place cell network, while the HD cell network is mainly trained using images from forwarding movements [6]. We trained 30 place cells and 50 HD cells whose ensemble activity encodes the spatial position and direction, respectively. Training results are partly shown in Fig. 3.

Fig. 3. Firing patterns of learned place and HD cells to diﬀerent positions or directions. (a) Firing patterns of 9 representative place cells. (b) Polar plots showing the ﬁring patterns of 9 representative HD cells.

For the planning result, Fig. 4(a) and (b) show separately plans with a ﬁxed depth of 10 and adaptive-depth plans, where the planning in the place cell space is mapped to the 2D space through ﬁnding the position that yields the most similar ﬁring pattern. The prediction depth of 10 for Fig. 4(a) is obtained empirically and the initial route 0 is gradually optimized towards the desired goal location. The given example shows plans with a quite good initialization, while if given a starting route 0 that extends into a very diﬀerent direction from the desired one, the planning may not be successful. This is because a long prediction makes the planning optimization based on back-propagating through a long chain of world models very diﬃcult. Due to a vanishing gradient, initial segments receive too little correction. While the adaptive-depth planning could start with a bad initialization, as route 0 shown in Fig. 4(b), the planning starts with a 1-step prediction and is optimized immediately to a better direction through the world model chain which currently contains only one model step. This optimized plan then works as a good basis for initializing the next plan with one step more.

Look-Ahead Planning Based on Learned Place and Head-Direction Cells

309

This explains why the initial part of each route in Fig. 4(b) clusters in a narrow area. The planning depth increases incrementally until ﬁnding an appropriate plan (route 8) to the goal location. To evaluate the look-ahead planning performance over the global area, we ﬁxed the starting position and uniformly sampled 120 positions from the environment as the target. As shown in Fig. 5, the planning performance deteriorates as the distance between the target and the starting position increases. Especially when the target lies in the areas behind the second obstacle, which is far away, planning becomes very diﬃcult and may fail. This might be due to the accumulation error in the long world model chain and also the optimization based on backpropagation is diﬃcult for a long-step planning.

Fig. 4. The proposed look-ahead trajectories with (a) a ﬁxed depth of 10 steps and (b) an adaptive depth. The solid dots represent the intermediate locations from the starting position to the target position (red star). The dashed line (route 9) represents a route that exceeds the goal. Planning is performed in place- and HD cell representation space and the trajectory based on actions of the plan is shown in x, y- space for visualisation. (Color ﬁgure online)

3.2

Real-World Experiment for Target Object Approaching

As a second step in our hybrid model, we test the vision-based target approaching in a real-world environment with a Turtlebot3 robot in a simple goal reaching task. The robot is placed at a position where the target is in the range of its vision (which refers to the state after executing the look-ahead planning) and its goal is to ﬁnd the target object and move close to it. For detecting and recognizing the target, we used the YOLO network which is fast and can accurately recognize, classify and localize objects [12]. If the robot cannot see the target object at the initial state, it will rotate locally with a constant speed until perceiving and recognizing the object with a certain probability. While trying to keep the target object in the center of the view, the robot moves directly towards it until reaching the threshold distance to the target (Fig. 6).

310

X. Zhou et al.

Fig. 5. (a) The prediction error of the world model increases with the number of the planning steps. (b) The planning error over the whole environment, where the starting position is ﬁxed (the black dot) and the target is sampled uniformly from the rectangular environment which has a size of 14×10 units and 120 positions are sampled from it. The error value is represented by the color. (Color ﬁgure online)

Fig. 6. Test of the object recognition and target approaching. The robot starts from a neighboring area and needs to reach the target orange. Left: The robot starts without the target in the current view (shown in the red box) and starts rotating. Middle: The robot perceives and recognizes the target and starts moving towards it. Right: The robot reaches the orange and stops just next to it. (Color ﬁgure online)

4

Conclusion and Future Work

We have proposed a navigation system that relies on a hybrid navigation strategy in order to precisely reach a target location, which consists of two planning strategies that work on diﬀerent distance scales but both rely on vision. The ﬁrst one is look-ahead planning that works on a global coordinate system and proposes a spatial trajectory close to the desired goal location. The spatial state is represented by the ensemble activity of place and HD cells, which are modeled by learning directly from visual input based on an unsupervised SFA learning algorithm. The planning network allows looking into the future based on a chain of world model predictions and adaptively proposes optimized prediction steps to the goal location. The second part is a target approaching strategy working on a local scale, which enables object recognition and goal-directed reaching. Through combining these two complementary strategies, the robot can move from a random position to a target position with a high accuracy using just its

Look-Ahead Planning Based on Learned Place and Head-Direction Cells

311

vision system. As future work, we will extend the simulated scenario to a physical world where place and HD cells are modeled on a real robot using its vision sensor and the planning is validated in a challenging dynamic environment. Acknowledgments. We acknowledge support from the German Research Foundation DFG, project CML (TRR 169) and the EU, project SECURE (No. 642667).

References 1. O’Keefe, J., Nadel, L.: The Hippocampus as a Cognitive Map. Clarendon Press, Oxford (1978) 2. Taube, J.S., Muller, R.U., Ranck, J.B.: Head-direction cells recorded from the postsubiculum in freely moving rats. I. Description and quantitative analysis. J. Neurosci. 10(2), 420–435 (1990) 3. Wills, T.J., Muessig, L., Cacucci, F.: The development of spatial behaviour and the hippocampal neural representation of space. Phil. Trans. R. Soc. B (2014). https://doi.org/10.1098/rstb.2013.0409 4. Zeno, P.J., Patel, S., Sobh, T.M.: Review of neurobiologically based mobile robot navigation system research performed since 2000. J. Robot. (2016). https://doi. org/10.1155/2016/8637251 5. Franzius, M., Sprekeler, H., Wiskott, L.: Slowness and sparseness lead to place, head-direction, and spatial-view cells. PLoS Comput. Biol. 3(8), e166 (2007) 6. Zhou, X., Weber, C., Wermter, S.: Robot localization and orientation detection based on place cells and head-direction cells. In: Lintas, A., Rovetta, S., Verschure, P.F.M.J., Villa, A.E.P. (eds.) ICANN 2017. LNCS, vol. 10613, pp. 137– 145. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68600-4 17 7. Pfeiﬀer, B.E., Foster, D.J.: Hippocampal place-cell sequences depict future paths to remembered goals. Nature 497(7447), 74–79 (2013) 8. Doll´e, L., Sheynikhovich, D., Girard, B., Chavarriaga, R., Guillot, A.: Path planning versus cue responding: a bio-inspired model of switching between navigation strategies. Biol. Cybern. 103(4), 299–317 (2010) 9. Oess, T., Krichmar, J.L., R¨ ohrbein, F.: A computational model for spatial navigation based on reference frames in the hippocampus, retrosplenial cortex, and posterior parietal cortex. Front. Neurorobotics (2017). https://doi.org/10.3389/ fnbot.2017.00004 10. Thrun, S., M¨ oller, K., Linden, A.: Planning with an adaptive world model. In: Advances in Neural Information Processing Systems, pp. 450–456 (1991) 11. Sch¨ onfeld, F., Wiskott, L.: RatLab: an easy to use tool for place code simulations. Front. Comput. Neurosci. (2013). https://doi.org/10.3389/fncom.2013.00104 12. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: uniﬁed, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)

Optimization/Recommendation

Check Regularization: Combining Modularity and Elasticity for Memory Consolidation Taisuke Kobayashi(B) Division of Information Science, Graduate School of Science and Technology, Nara Institute of Science and Technology, Nara, Japan [emailprotected] https://kbys t.gitlab.io/en/

Abstract. Catastrophic forgetting, which means that old tasks are forgotten mostly when new tasks are learned, is a crucial problem of neural networks for autonomous robots. This problem is due to backpropagation overwrites all network parameters, and therefore, can be solved by not overwriting important parameters for the old tasks. Hence, regularization methods, represented by elastic weight consolidation, give the globally stable equilibrium points to the optimal parameters for the old tasks. They unfortunately aim to hold all parameters, even if the regularization is weak. This paper therefore proposes a regularization method, named Check regularization, to consolidate only the important parameters for the tasks and to initialize the other parameters preparing for the future tasks. Simulations with two tasks to be learned sequentially show that the proposed method outperforms the previous method under a condition where the interference between the tasks is severe. Keywords: Continual learning Reinforcement learning

1

· Locally stable equilibrium point

Introduction

Highly versatile robots, such as humanoid robots, gain a high demand to perform various tasks on behalf of human [3,8]. It is however diﬃcult to preliminarily design all kinds of the various tasks, hence the versatile robots are desired to learn new tasks through their daily activities in the real world like human does. Development of such “autonomous robots” is the ﬁnal goal of this research. Reinforcement learning (RL) is a methodology to let an agent learn the optimal policy, which maximizes accumulation of rewards (i.e., return) from the environment by sampling the optimal action, through trial and error of interactions between the agent and the environment [19]. RL is absolutely suitable to control the autonomous robots described above. Recently, the state-of-the-art RL algorithms have outperformed human as video and board games players [18]. Even in applications of real autonomous c Springer Nature Switzerland AG 2018 V. K˚ urkov´ a et al. (Eds.): ICANN 2018, LNCS 11140, pp. 315–325, 2018. https://doi.org/10.1007/978-3-030-01421-6_31

316

T. Kobayashi

robots, they have acquired complicated tasks that could not be learned to date, such as manipulating deformable clothes [20]; and picking from bulked objects [12]. What is essential behind these successes is a function approximator by (deep) neural networks (NN or DNN) [9]. Speciﬁcally, the policy and value functions in RL are precisely approximated by NN, even when state space is extremely large like raw images. Note that, in general, the methods to stably learn them have been employed since the convergence of parameters in NN is not guaranteed. However, backpropagation of gradients of loss (or objective) functions would cause a crucial problem, so called “catastrophic forgetting,” which means that old tasks are mostly forgotten when new tasks are learned [2,13]. This problem must be solved to let the autonomous robots continuously learn the new tasks in the real world as human does, although it can easily be ignored by preparing huge data sets including all the tasks in ﬁelds where oﬄine learning is allowable such as image recognition. Storing all of data is of course intractable since it ultimately requires inﬁnite storage and memory. This problem can also be ignored when diﬀerent networks are prepared corresponding to respective tasks, however the cost of switching networks is not allowable in the autonomous robots, which require to switch tasks seamlessly. To solve such catastrophic forgetting, three approaches have mainly been studied as follows (their details are in the next section): (i) data for the old tasks are augmented by a generative model [5,17]; (ii) network is modularized to avoid interferences in the new tasks [1,10,22,23]; and (iii) parameters are given elasticity to the optimal values for the old tasks [7,11,14,24]. In the approach (i), it would be possible to reliably mitigate the catastrophic forgetting by using a sophisticated generative model by DNN in recent years [6,15], while the generative model also requires to mitigate its catastrophic forgetting. The approach (ii) tackles the cause of the catastrophic forgetting, while it basically has no function to keep the parameters in the optimal values. The approach (iii) has recently been established as a powerful method by designing the elasticity according to the importance of parameters for the old tasks, while it has no function to select only the minimum necessary parameters from all of them. To compensate respective functions of the approaches (ii) and (iii), this paper proposes a new regularization method, named Check regularization, which combines the modularization of network and the elasticization of parameters. These two, however, give diﬀerent globally stable equilibrium points, and therefore, they cannot be combined without any consideration. Hence, a log regularization term is heuristically introduced to derive two locally stable equilibrium points corresponding to the approach (ii) and (iii). All parameters are regularized toward either the two depending on the importance of parameters for the old tasks. That is, the necessary/unnecessary parameters for the tasks are regularized for the elasticization/initialization, respectively. Check regularization was evaluated through RL for two types of tasks in three kinds of simulations for each. As a result, its eﬀectiveness was veriﬁed in the simulations where there was strong interference between the tasks. To the best of my knowledge, this is the

Check Regularization

317

ﬁrst study for combining the modularization of network and the elasticization of parameters, i.e., the approaches (ii) and (iii), as regularization.

2 2.1

Related Work Data Augmentation

The most straightforward approach to mitigate the catastrophic forgetting is to learn the generative model of input (and output) data, instead of storing all data. The costs for storage and memory would be constant when the generative model is obtained as NN, such as variational autoencoder [6] and generative adversarial network [15] (or their relatives). The pseudo data related to the old tasks can be generated from the generative model, and mini batch, which includes the generated and observed data, is used for learning the parameters [5,17]. In that case, NN would maintain the performance of the old tasks without being biased to the new tasks. However, the generative model is not desired to be prepared for each tasks from the viewpoint of cost, but if a single NN is used to explain the multiple tasks, the catastrophic forgetting would be caused. 2.2

Modular Network

Switching perfectly diﬀerent networks for respective tasks is inappropriate from the viewpoint of control. It is however eﬀective to divide a single NN implicitly and modularize the area of NN (i.e., the parameters) used for respective tasks. Ellefsen et al. [1] employed evolutionary algorithm to promote such the modularization, which certainly mitigated the catastrophic forgetting although the task performance just after switching was somewhat deteriorated. Since it is a waste of resources that can be shared among tasks when NN is completely modularized, Velez and Clune [22] developed a diﬀusion-based neuromodulation, which not only induced task-speciﬁc learning but also produced functional parameters for each subtask. Yu et al. [23] also proposed the way to select whether the parameters should be shared/specialized among tasks/for single task depending on the gradients of loss function. Alternatively, NN can easily be modularized by L1 regularization, in particular, its truncated version [10] would have a capability to keep the parameters for the old tasks. Here, the gradient of the truncated L1 regularization is deﬁned as following equation and is illustrated in Fig. 1(a). LL1 = λL1 θ1 λL1 sign(θ) |θ| > Threshold ∴ g L1 = 0 Otherwise

(1) (2)

where λL1 is the magnitude of regularization and θ are the parameters in NN. The threshold is given as a half of maximum value among all parameters in this paper. θ is updated as θ ← θ − g L1 . L1 regularization can be interpreted that

318

T. Kobayashi

Fig. 1. Gradients of regularization to mitigate the catastrophic forgetting: (a) the parameters smaller than a threshold converge to 0 for modularization, and the other parameters are no longer regularized for keeping their values although they easily move by the gradients of loss function; (b) all parameters converge to θ ∗ , while the convergence speed and strength depend on F .

it has a globally (locally if the truncated version) stable equilibrium point to θ = 0. Even when the truncated version, however, the catastrophic forgetting would be caused because the parameters are never ﬁxed. 2.3

Elastic Parameters

If the important parameters for the old tasks are discriminated, letting their values be invariant would avoid to overwrite them. The methods, represented by elastic weight consolidation (EWC) [7], regularize the parameters toward the optimal values for the old tasks, θ ∗ , by a following gradient (see Fig. 1(b)). λEWC F (θ − θ ∗ )2 2 = λEWC F (θ − θ ∗ )

LEWC = ∴ g EWC

(3) (4)

where λEWC is the magnitude of regularization and F is the importance of the parameters, which have been deﬁned as the diagonal of Fisher information matrix in the paper of EWC. Speciﬁcally, θ ∗ and F correspond to the mean and the precision of diagonal multivariate Gaussian distribution of θ, respectively. Note that several types of relatives have been proposed: incremental moment matching is employed to approximate θ ∗ and F [11]; F is deﬁned in a biologically plausible manner [24]; and LEWC is converted from sum squared error to Kullback-Leibler divergence through variational inference [14]. Due to a non-veriﬁcation target, this paper employs a moving average to estimate θ ∗ and F for simplicity. This design means that the parameters with high precision (small variance) are forced to converge to θ ∗ , and the other parameters have room to learn the new tasks. Even with the room to learn the new tasks, however, it would not

Check Regularization

319

fully be utilized because this approach set globally stable equilibrium points to θ ∗ . That is, all parameters aim to converge to θ ∗ regardless of the magnitude of regularization (i.e., F ), thereby not minimizing the number of parameters that are used for the old tasks.

3 3.1

Check Regularization Formulation

As mentioned in the above section, the modularization of network and the elasticization of parameter let the parameters converge to respective globally (to be exact, locally in the truncated L1 regularization) stable equilibrium points. To achieve both properties, the globally stable equilibrium points should be converted into locally stable ones with an appropriate boundary, although it cannot be decided easily (see the left side of Fig. 2). Our proposal, named Check regularization, gives the appropriate boundary automatically depending on the mean and the precision of the parameters, as shown in the right side of Fig. 2. Here, the name “Check” comes from the shape of this gradient like a check mark. Its formulation is given as follows: ⎧ θ θ∗ < 0 ⎪ ⎨λL1 θ1 ∗ 2 |θ| > θ ∗ (5) LCheck = λEWC 2 F (θ − θ ) ⎪ ⎩ λ0 1 2 ln(1 + κ |θ|) + λ |θ| + λ θ Otherwise 1 κ 2 2 where λ0,1,2 and κ are design parameters, which are analytically derived in the next subsection. λL1 and λEWC are given as hyperparameters with almost the same values as the original ones. The boundary whether the parameter is assigned to the modularization or elasticization is given in the third equation.

Fig. 2. Concept of Check regularization: to combine the modularization of network and the elasticization of parameters, a boundary between them is diﬃcult to be determined; by adding a log regularization term, the boundary can be determined automatically depending on the precision of parameter.

320

T. Kobayashi

The gradient of Check regularization is derived as follows: ⎧ ⎪ θ θ∗ < 0 ⎪ ⎨λL1 sign(θ) ∗ |θ| > θ ∗ g Check = λ EWC F (θ − θ ) ⎪ ⎪ λ0 ⎩ 1+κ|θ | + λ1 sign(θ) + λ2 θ Otherwise

(6)

The gradients of the ﬁrst and second equations in Check regularization are almost the same as Eqs. (2) and (4), respectively (diﬀerence is whether there is the threshold or not). 3.2

Derivation of Design Parameters

Now, λ0,1,2 and κ are uniquely designed to give the appropriate boundary to separate the two locally stable equilibrium points. Note that θ ∗ is limited to be positive in this subsection without losing generality. In addition, only λ0,1,2 and κ for a single parameter θ (with θ∗ and F ) are derived as below since all parameters are independent. First, to make it branch naturally, the following three conditions are given. ∂gCheck = λEWC F, gCheck |θ=θ∗ = 0 (7) lim gCheck = λL1 , θ→+0 ∂θ θ=θ∗ Next, an additional design parameter, η, which corresponds to the boundary explicitly, are given so that the boundary exists in [0, θ∗ ]. gCheck |θ=(1−η)θ∗ = 0

(8)

The conditional equations are still insuﬃcient and this derivation becomes an illposed problem, and therefore, a constraint, where κ that gives two intersections of the gradient and θ axis is uniquely determined, is additionally given as follows: κ = 4λ2 /λ0

(9)

From the above ﬁve conditional equations, λ0,1,2 , κ, and η are uniquely solved as follows (their derivations are omitted due to page limitation). ⎫ ⎧ 2 ⎬ λEWC F θ∗ − λL1 1 ⎨ λEWC F θ∗ − λL1 κ= ∗ + + 3 (10) ⎭ θ ⎩ λEWC F θ∗ + λL1 λEWC F θ∗ + λL1 β= λ2 =

(κθ∗ − 1)(κθ∗ + 3) κθ∗ (κθ∗ + 1) λL1 θ∗ λEWC F θ ∗ (κθ ∗ +1) (κθ ∗ −1)(κθ ∗ +3)

(11) κθ∗ = 1 Otherwise

(12)

λ0 = 4λ2 /κ

(13)

λ1 = λL1 − λ0

(14)

Check Regularization

321

As for λ2 , two cases are prepared to obtain a numerically stable solution. In addition, a very small amount is added to θ∗ since θ∗ is desired not to be 0 for stable calculation. Let us conﬁrm that the gradient formed by the derived design parameters changes depending on the precision F . Note that the mean θ∗ is ﬁxed to be 0.1 since it would not change the property of the gradient. The gradients gCheck with low, middle, and high precisions are depicted in Figs. 3(a)–(c), respectively. As shown in Fig. 3(a), we found that the boundary (the intersection with lower value) is very close to θ∗ , thereby prioritizing the modularization of network. The boundary becomes close to 0 continuously, and ﬁnally, e.g., Fig. 3(c), the elasticization of parameters becomes dominant. In this way, Check regularization decides the boundary between the initialization/elasticization of parameters automatically without any additional hyperparameters.

Fig. 3. Examples of the gradients of Check regularization formed by the design parameters: the intersection of the gradient and θ axis (not θ∗ = 0.1), i.e., the boundary between the modularization of network and the elasticization of parameters, is automatically determined depending on the precision of the parameter, F .

Fig. 4. RL simulation environments: (a) Pendulum aims (i) to keep balance on the top and (ii) to maximize its angular velocity; (b) BallArm aims (i) to be close to the tip of arm and the ball and (ii) to maximize velocity of the ball; (c) Acrobot aims (i) to keep balance from swinging up and (ii) to maximize angular velocity of the root axis.

322

4 4.1

T. Kobayashi

Simulations Conditions

The performance of Check regularization is veriﬁed in three kinds of RL simulations, i.e., (a) Pendulum, (b) BallArm, and (c) Acrobot in Fig. 4. Respective environments have two diﬀerent tasks, which tend to interfere with each other. As a learning procedure, the target task to be learned is switched every 300 episodes by turns, and after third switching, the remaining two are used for evaluation. A score is deﬁned as a weighted mean with inversely proportional to the number of episodes of average rewards normalized by the maximum one, and is mainly aﬀected by up to 50 episodes. This procedure is conducted 20 times. To clarify the adverse eﬀect of the interference between the tasks, a reservoir computing [4] is used as one of the NN. It updates only readout parameters, namely it is regarded to be a linear regression model, which has an advantage that the parameters used for the task are in clear. Here, the number of parameters is roughly given as product of the number of neuron (500 in this paper) and action space ((a) 1, (b) 3, and (c) 2). In addition, experience replay is not applied not to reuse the observations for the old tasks. Instead, an actor-critic algorithm combining eligibility trace [16,21] enables to learn the current tasks eﬃciently. As baselines, the truncated L1 regularization [10] in Eq. (2) and EWC [7] in Eq. (4) are evaluated in the same manner. Learning rate is set as 0.01/500 so as to avoid local optima, and other hyperparameters for RL are set as typical values (e.g., discount rate is 0.99). λEWC is given as 10−14 since F is large when the learning rate is small. λL1 is heuristically given as 10−3 , but only for Check regularization, it is multiplied by 10 since the gradient is small near θ ∗ . 4.2

Results

Learning curves and scores were summarized in Figs. 5(a)–(c). Note that, in legends, the means and standard deviations of the scores for respective methods were additionally described. As can be seen in Fig. 5, in Acrobot and BallArm, Check regularization outperformed both the baselines, although all methods succeeded in avoiding the catastrophic forgetting in Pendulum. The catastrophic forgetting was observed in Acrobot and BallArm, except Pendulum, with all methods, in particular, the truncated L1 regularization. This is due to subtasks in the tasks, e.g., a swing-up motion in Acrobot and a motion approaching to ball in BallArm, which would cause the interference between the tasks. Although the catastrophic forgetting could be mitigated to a certain extent by EWC, the performance in the second task was sluggish. Check regularization, in contrast, succeeded in acquiring both tasks. This diﬀerence implies the importance of the modularization of network. Note that the elasticity in EWC and Check regularization would be too strong as can be deduced from the higher average rewards at the last episodes in the truncated L1 regularization. Nevertheless, signiﬁcant diﬀerences between Check regularization and the other methods could not be observed due to the ﬁxed random network in the

Check Regularization

323

Fig. 5. Learning curves and scores for respective environments: before and after dashed lines, the target task was changed; (a) all methods could avoid the catastrophic forgetting since the tasks hardly interfered with each other in practice; (b) EWC and Check regularization could keep the performances of both tasks in comparison with the truncated L1 regularization, although the performance of the ﬁrst task seemed to be deteriorated by their elasticity; (c) Check regularization could immediately recover the performances of both tasks from the catastrophic forgetting.

reservoir computing. Depending on the network structure, the number of parameters required to learn the task was increased, and the parameters that memorize multiple tasks were insuﬃcient. More trials with ﬁxed random seeds may show the validity of Check regularization statistically.

5

Conclusion

This paper proposed the regularization method, named Check regularization, to combine the two important functions for mitigating the catastrophic forgetting: the modularization of network and the elasticization of parameters. In Check regularization, two locally stable equilibrium points corresponding to respective functions are given each parameter. Their boundary is automatically determined according to the precision (and mean) of each parameter. As a result, the necessary/unnecessary parameters to the tasks are initialized/ﬁxed. Indeed, Check regularization outperformed the state-of-the-art method, i.e., EWC, in the three kinds of RL simulations. Future work in this study is to apply the proposed method to curriculum learning in real autonomous robots.

324

T. Kobayashi

Acknowledgement. This research has been supported by the Kayamori Foundation of Information Science Advancement.

References 1. Ellefsen, K.O., Mouret, J.B., Clune, J.: Neural modularity helps organisms evolve to learn new skills without forgetting old skills. PLoS Comput. Biol. 11(4), e1004128 (2015) 2. French, R.M.: Catastrophic forgetting in connectionist networks. Trends Cogn. Sci. 3(4), 128–135 (1999) 3. Hirai, K., Hirose, M., Haikawa, Y., Takenaka, T.: The development of Honda humanoid robot. In: IEEE International Conference on Robotics and Automation, vol. 2, pp. 1321–1326. IEEE (1998) 4. Jaeger, H., Haas, H.: Harnessing nonlinearity: predicting chaotic systems and saving energy in wireless communication. Science 304(5667), 78–80 (2004) 5. Kamra, N., Gupta, U., Liu, Y.: Deep generative dual memory network for continual learning. arXiv preprint arXiv:1710.10368 (2017) 6. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013) 7. Kirkpatrick, J., et al.: Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. 114(13), 3521–3526 (2017) 8. Kobayashi, T., Aoyama, T., Sekiyama, K., f*ckuda, T.: Selection algorithm for locomotion based on the evaluation of falling risk. IEEE Trans. Robot. 31(3), 750–765 (2015) 9. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classiﬁcation with deep convolutional neural networks. In: Advances in Neural Information Processing System, pp. 1097–1105 (2012) 10. Langford, J., Li, L., Zhang, T.: Sparse online learning via truncated gradient. J. Mach. Learn. Res. 10, 777–801 (2009) 11. Lee, S.W., Kim, J.H., Jun, J., Ha, J.W., Zhang, B.T.: Overcoming catastrophic forgetting by incremental moment matching. In: Advances in Neural Information Processing Systems, pp. 4655–4665 (2017) 12. Levine, S., Pastor, P., Krizhevsky, A., Quillen, D.: Learning hand-eye coordination for robotic grasping with large-scale data collection. In: Kuli´c, D., Nakamura, Y., Khatib, O., Venture, G. (eds.) ISER 2016. SPAR, vol. 1, pp. 173–184. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-50115-4 16 13. McCloskey, M., Cohen, N.J.: Catastrophic interference in connectionist networks: the sequential learning problem. In: Psychology of Learning and Motivation, vol. 24, pp. 109–165. Elsevier (1989) 14. Nguyen, C.V., Li, Y., Bui, T.D., Turner, R.E.: Variational continual learning. In: International Conference on Learning Representations (2018). https://openreview. net/forum?id=BkQqq0gRb 15. Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015) 16. Schulman, J., Moritz, P., Levine, S., Jordan, M., Abbeel, P.: High-dimensional continuous control using generalized advantage estimation. In: International Conference for Learning Representations, pp. 1–14 (2016)

Check Regularization

325

17. Shin, H., Lee, J.K., Kim, J., Kim, J.: Continual learning with deep generative replay. In: Advances in Neural Information Processing Systems, pp. 2994–3003 (2017) 18. Silver, D., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354 (2017) 19. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT press, Cambridge (1998) 20. Tsurumine, Y., Cui, Y., Uchibe, E., Matsubara, T.: Deep dynamic policy programming for robot control with raw images. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1545–1550 (2017) 21. Van Seijen, H., Mahmood, A.R., Pilarski, P.M., Machado, M.C., Sutton, R.S.: True online temporal-diﬀerence learning. J. Mach. Learn. Res. 17(145), 1–40 (2016) 22. Velez, R., Clune, J.: Diﬀusion-based neuromodulation can eliminate catastrophic forgetting in simple neural networks. PloS one 12(11), e0187736 (2017) 23. Yu, W., Turk, G., Liu, C.K.: Multi-task learning with gradient guided policy specialization. arXiv preprint arXiv:1709.07979 (2017) 24. Zenke, F., Poole, B., Ganguli, S.: Continual learning through synaptic intelligence. In: International Conference on Machine Learning, pp. 3987–3995 (2017)

Con-CNAME: A Contextual Multi-armed Bandit Algorithm for Personalized Recommendations Xiaofang Zhang1,2(&), Qian Zhou2, Tieke He1, and Bin Liang2 1

State Key Lab for Novel Software Technology, Nanjing University, Nanjing, China 2 School of Computer Science and Technology, Soochow University, Suzhou, China [emailprotected]

Abstract. Reinforcement learning algorithms play an important role in modern day and have been applied to many domains. For example, personalized recommendations problem can be modelled as a contextual multi-armed bandit problem in reinforcement learning. In this paper, we propose a contextual bandit algorithm which is based on Contexts and the Chosen Number of Arm with Minimal Estimation, namely Con-CNAME in short. The continuous exploration and context used in our algorithm can address the cold start problem in recommender systems. Furthermore, the Con-CNAME algorithm can still make recommendations under the emergency circ*mstances where contexts are unavailable suddenly. In the experimental evaluation, the reference range of key parameters and the stability of Con-CNAME are discussed in detail. In addition, the performance of ConCNAME is compared with some classic algorithms. Experimental results show that our algorithm outperforms several bandit algorithms. Keywords: Recommender systems Reinforcement learning Multi-armed bandit Context-aware This work is supported in part by the National Key Research and Development Program of China (2016YFC0800805).

1 Introduction Reinforcement learning (RL) is an important part in machine learning [1]. RL has gained much attention in last decade which can be used in combination with collaborative ﬁltering, Bayesian networks etc. for recommendations [2, 3]. In this work, a RL based contextual Multi-Armed Bandit (MAB) algorithm named Con-CNAME is discussed to implement a personalized recommendation. The primary target of recommender systems is to propose one or several items which users might be interested in. The books, articles or music provided by the recommender systems are items [4, 5]. Recommender systems need to focus on items that raise users’ interest and explore new items to improve users’ satisfaction at the © Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11140, pp. 326–336, 2018. https://doi.org/10.1007/978-3-030-01421-6_32

Con-CNAME

327

same time. That creates an exploration-exploitation dilemma, which is the core point of Multi-Armed Bandit (MAB) problems [6]. The payoff of a recommendation is widely measured by Click-Though Rate (CTR) [7]. Then the goal of recommendations is to maximize the CTR over all users. Personalized recommendation services identify the preferences of users and appropriately show the web content to suit to their preferences [8]. The classic collaborative recommender systems may not retain high CTR if large number of users or items are new to the system. Such an issue is referred to as a coldstart problem [9] and in such situations the recommendation task can be modelled as a contextual Multi-armed bandit problem [10]. Contextual bandit approaches are already studied in many ﬁelds of recommender systems [11]. We propose a context-aware bandit algorithm which tries to further improve the obtained CTR in personalized recommendations. The recommendation is made based on the user feedback and priori information of contexts. The cold start issue is addressed by continuously exploration and contexts. Exploration means learning new items’ payoff for a particular user by recommending new items. Exploitation means recommending the optimal items based on the payoffs observed so far. Experiments are made on the user click log dataset of Yahoo! Front Page Today Module. The aim of our algorithm is to achieve higher CTRs than some existed bandit approaches. The rest of the paper is organized as follows. Section 2 describes some related works. In Sect. 3, we introduce our algorithm and discuss the influence of key parameters. Section 4 discusses experimental results. Conclusion is made in Sect. 5.

2 Related Work Filtering-based and reinforcement learning methods are two main categories of recommendation algorithms [12]. In this paper, we focus on reinforcement learning methods. Reinforcement learning methods, such as MAB and Markov Decision Processes (MDPs) [13], are widely used in recommender systems. MDP-based approaches model the last k choices of a user as the state and the available items as the action set to maximize the long-run payoff. [14]. MAB-based approaches make recommendations by balancing exploration and exploitation, such as e-greedy [15], softmax [16], EXP3 [17] and UCB1 [6]. The e-greedy is the simplest approach among these context-free approaches, which always has competitive performance and is easy to be extended to various applications. Softmax makes recommendations according to a probability distribution based on user feedbacks. As a complicated variant of softmax, the main idea of EXP3 is to divide the payoff of an item by its chosen probability. UCB1 always recommends the item with the highest upper conﬁdence index. However, UCB1 needs to sweep all items during the initial period, it may be inappropriate for recommender systems whose items are huge. Contexts are considered, aiming at improving the effectiveness of recommendations. In a contextual MAB setting, there is a set of arms available to the algorithm at a time step which is associated with the contextual information vector. Generally, contexts represent the situations of the user when a recommendation is made, such as time, gender and age [18, 19]. Using the previously acquired knowledge and the context at the current time step, the algorithm chooses to show an arm and obtains a reward.

328

X. Zhang et al.

This reward is dependent on the contextual features and the chosen arm. The LinUCB algorithm is proposed to solve news article recommendation problems [20]. The Naive III and Linear Bayes approaches deﬁne a user-group via a set of features that individual users may have in common, but not that must have in common [21]. A MAB-based clustering approach constructs an item-cluster tree for recommender systems [22]. The CNAME and Asy-CNAME algorithms are based on the chosen number of minimal estimation, which are applied to recommender systems where the prior information is unavailable [23]. Speciﬁcally, the CNAME algorithm choses an arm according to exploration probability which is based on the chosen number of arm with minimal estimation. The exploration probability changes with the practical environment since the chosen number of minimal estimation can make full use of user feedback. To further improve the efﬁciency of the CNAME algorithm, the AsyCNAME algorithm is updated in an asynchronous manner.

3 Our Approach In this section, we present a context-aware bandit approaches for personalized recommendations. This approach is based on Contexts and the Chosen Number of Action with Minimal Estimation, namely Con-CNAME. Almost all the multi-armed bandit algorithms use the average rewards of actions as an estimation method. Inspired by the prior probability of contexts, we put forward another kind of estimation method: the chosen probability. There is initial probability distribution for actions, we adjust the chosen probability of every selected action according to the actual user feedback. Speciﬁcally, when the reward is 1 after choosing an action, i.e. the recommended article is clicked by users, the chosen probability of this action will be improved; when the reward is 0 after choosing an action, i.e. the recommended article is not clicked by users, and the chosen probability of this action will not be updated. Combining chosen probability of actions and prior probability of contexts, this paper proposes the Con-CNAME algorithm, which introduces weight b to control the influences of chosen probability and prior probability. The framework of Con-CNAME algorithm is shown in Fig. 1: Here, the prior probability is based on Naïve III algorithm, which deﬁnes a usergroup by a set of featuresP that individual users may have in P common [21]. Then, we deﬁne that clicks½a½i ¼ xt ðiÞgt and selections½a½i ¼ xt ðiÞ for at ¼ a, where t

t

each context xt contains some binary vectors indicating user’s contextual features, such as gender, age, language and so on. gt is user click status (i.e. 1 if article obtained click and 0 otherwise). P The article at recommended by Naive III algorithm at trial t is at ¼ arg maxa i6¼0 Pt ða; iÞ , where Pða; iÞ ¼ clicks½a½i=selections½a½i. P Different from Naive III algorithm, our Con-CNAME combines prior probability i6¼0 Pt ða; iÞ and chosen probability St ðaÞ by weight b. The article at recommended by ! P Con-CNAME at trial t is at ¼ arg max bSt ðaÞ þ ð1 bÞ Pt ða; iÞ during a

i6¼0

exploitation. Besides, different from most contextual bandit algorithm, Con-CNAME

Con-CNAME

329

Fig. 1. The framework of Con-CNAME algorithm

explores randomly according to exploration probability, and the exploration probability is updated based on user feedbacks (the classic estimation Qt ðaÞ) which has been proposed in our context-free algorithm named CNAME [23]. The full Con-CNAME algorithm is as follows:

330

X. Zhang et al.

The Con-CNAME starts by setting the parameters w, a and b(Line 1), where w 2 (0,1) affects the speed at which the exploration probability is changed, the learning rate a 2 (0,1) affects the update of chosen probability and the weight b controls the proportion of chosen probability and prior probability in Values. After initializing the estimations (the classic estimation Q(a) and our proposed chosen probability S(a)) and the chosen number N(a) of each action a (Line 2–6), it initializes the click vector and selection vector for a not in possibleActions (Line 7–10). Here, possibleActions is the list of actions (articles) that are available to user during that particular visit. Elements clicksFeature½a½i and selectionsFeature½a½i of clicksFeature[a] and selectionsFeature[a] represent clicks½a½i and selections½a½i respectively. Then calculate the prior probability and Values(a) for every action a in possibleActions (Line 11–14). The Con-CNAME iteratively chooses an action to play (referred to recommend an item in recommender systems) based on the exploration probability (Line 15–18), and receives a reward Xat ;t (Line 19). The exploration probability w=ðw þ m2t Þ is adjusted according to the chosen number of action with minimal estimated payoff, deﬁned by mt . The chosen probability St ðat Þ is improved only when Xat ;t [ 0 (Line 20–22). Finally, updates the chosen number and classic estimation at time step t (Line 23–24). There are three key points of Con-CNAME algorithm. Firstly, Values includes chosen probability (based on user feedback) and prior probability (based on contexts). Secondly, different from most contextual algorithm, the Con-CNAME algorithm keeps the exploration process, which can also help address cold start problem. Besides, exploration may bring surprise to users and help to learn users’ interest. Thirdly, there are a lot of emergency in practical process, the Con-CNAME algorithm can still work as CNAME algorithm normally if the contextual information is unobtainable suddenly. Similar to our proposed context-free Asy-CNAME algorithm, the Con-CNAME algorithm can be updated in an asynchronous manner. Asynchronous manner weakens the impact of the user’s short-term behavior to a certain extent, which plays a role in improving the CTR. On the other hand, the implementation complexity is reduced in an asynchronous manner, which can help decrease the calculation time.

4 Experimental Evaluation Evaluating a contextual multi-armed bandit algorithm by online evaluation has always been a challenging task mainly due to limited availability of data. The evaluator ideally desires for datasets that explicitly contain the data which forms the basis of evaluation, such as the changes in users’ preferences, demographics etc. In this section, the user clicks log dataset of Yahoo! Front Page Today Module, which has been widely used, is applied to evaluate the Con-CNAME algorithm. We discuss the influence of key parameters a and b, then provide the reference ranges of these two parameters through simulation on Yahoo! dataset. Furthermore, we compare the performance of our algorithm with other bandit algorithms.

Con-CNAME

4.1

331

Yahoo! Front Page Today Module User Click Log Dataset (R6B)

This dataset contains a fraction of user click log for news articles displayed in the Featured Tab of the Today Module on Yahoo! Front Page1. This dataset includes 15 days of data from October 2 to 16, 2011 and some raw features. There are 28,041,015 user visits to the Today Module on Yahoo!’s Front Page. Each of these user visits is a single line in the data ﬁle. For example, the structure of the data line is like tuples as follows: “1317513293 id-563643 0 |user 1 8 12 13 552077 |id-555224 |id-555528 |id-559744 563115 |id-563582 |id-563643 |id-563787 564604 |id-565364 |id-565479 |id-565515 565747 |id-565822”

22 16 18 54 24 26 17 |id-559855 |id-560290 |id-563846 |id-563938 |id-565533 |id-565561

42 19 25 15 61 14 21 |id-560518 |id-560620 |id-564335 |id-564418 |id-565589 |id-565648

|id|id|id|id-

Table 1. Meanings of tuples in the data line of Yahoo! R6B dataset Tuple of data line Tuple’s meaning 1317513293 Timestamp id-563643 Article ID 0 Click Status user Start of user’s contexts 1 8 12 13 22 16 18 54 24 26 17 42 19 25 15 61 14 21 User’s contexts |id-552077 |id-555224 …|id-565822 List of recommended articles

Table 1 shows the meaning of each tuple in data line. Timestamp is considered as a unique user. Article ID corresponds to the arms or actions in multi-armed bandit problem. Click status has two values: 1 if the article is clicked by a user and 0 otherwise. User’s contexts start from string “user” which is followed by binary vectors. Binary vectors indicate user’s contextual features, such as user’s age and gender. The list of recommended articles contains articles that are available to users during this particular visit. 4.2

Influence of Key Parameters

In this part, we study the influence of two parameters in Con-CNAME algorithm. We design learning rate a to affect the update of chosen probability and weight b to control the proportion of chosen probability and prior probability in Values. The performance of different parameter values evaluated through CTR are shown in Table 2. We make recommendations over the ﬁrst 200000 lines and 1200000 lines respectively. In the experimental setting, we adopt a step of 0.1, through 0.1 to 0.9, only to ﬁnd the variations of the results are not that obvious, so we choose to demonstrate the ﬁnal results with 0.2 and 0.8. The corresponding results are shown in Table 2. Two values of a and b are experimented with 0.5 as a benchmark. In detail, a ¼ 0:2 represents that 1

https://webscope.sandbox.yahoo.com

332

X. Zhang et al. Table 2. Obtained CTR with different parameter values

Lines = 200000 α

β

Lines = 12200000

0.2

0.8

α

0.2

0.0499

0.0461

0.8

0.0489

0.0469

β

0.2

0.8

0.2

0.0718

0.0719

0.8

0.0719

0.0721

Fig. 2. CTR obtained over the ﬁrst 87400000 lines of Yahoo! R6B dataset

chosen probability increases slightly each time while a ¼ 0:8 represents that chosen probability increases greatly each time. b ¼ 0:2 means the context has a larger impact than chosen probability on recommendations while b ¼ 0:8 means the context has a smaller impact. In Table 2, the best results are highlighted respectively in boldface. We can see that the values of learning rate a does not have obvious impact on CTRs. That’s easy to explain, no matter how fast chosen probability is updated each time, each updated chosen probability increases in same degree. From the results in Table 2, we can ﬁgure out that when the number of lines is small, there is little context information obtained through parameter a, the CTR is most influenced by b. As for weight b, it has an important influence on obtained CTRs: larger value of b brings lower CTR over the ﬁrst 200000 lines. When processed lines increase to 12200000, larger value of b is more likely to obtain higher CTR, but the influence of b is relatively weaken. In detail, when b ¼ 0:2, it means the context information contributes significantly, the results of smaller number of lines implies that context information are fully exploited, and when the number of lines increases, it leads to the local optimum. In order to conﬁrm the stability of Con-CNAME algorithm, we make recommendations over the ﬁrst 87400000 lines of Yahoo! R6B dataset with a = 0.8 and b = 0.8. Figure 2 shows the CTR obtained by Con-CNAME. At the beginning, the CTRs grow fast. Then the speed of CTRs increasing slow down with the increasing lines, but it still keeps increasing. Hence, it is indicated that the Con-CNAME algorithm can be applied to personalized recommendations.

Con-CNAME

4.3

333

Performance Comparison

In this section, performance comparison of various MAB-based approaches on recommendations for large-scale recommender systems is conducted. The Random approach randomly chooses an item each time. This can be seen as the benchmark for other approaches. The Most click approach always recommends the article which obtained the most clicks. The Click rate approach recommends article with the highest CTR. The Contextual click approach makes a recommendation according to the contextual information of the clicked article. Naïve III algorithm and Linear Bayes algorithm are also based on the context information. In addition to context-aware approaches, we compare Con-CNAME with our context-free approaches CNAME and Asy-CNAME. The CTR performance of these 9 approaches are summarized in Table 3, and Table 4 shows the relative variance of Con-CNAME over other comparison approaches, where the best results are highlighted respectively in boldface. Table 3. Performance in CTR on the Yahoo! R6B dataset Algorithm

Lines

2:0 105 Random 0.036 Most click 0.047 Click rate 0.046 Contextual click 0.040 Linear Bayes 0.033 Naive III 0.047 CNAME 0.043 Asy-CNAME 0.044 Con-CNAME 0.047

3:6 106 0.034 0.043 0.068 0.068 0.034 0.066 0.067 0.068 0.069

7:2 106 0.034 0.043 0.068 0.070 0.034 0.067 0.069 0.069 0.071

1:06 107 0.034 0.043 0.069 0.071 0.034 0.068 0.070 0.070 0.072

1:4 107 0.034 0.042 0.070 0.072 0.034 0.069 0.071 0.072 0.073

Table 4. Relative variance of Con-CNAME over other comparison approaches in CTR on the Yahoo! R6B dataset Algorithm difference Con-CNAME over Random Con-CNAME over Most click Con-CNAME over Click rate Con-CNAME over Contextual click

Lines 2:0 105 31%

3:6 106 103%

7:2 106 109%

1:06 107

1:4 107

112%

115%

0%

60%

65%

67%

74%

2%

1%

4%

4%

4%

18%

1%

1%

1%

1% (continued)

334

X. Zhang et al. Table 4. (continued)

Algorithm difference Con-CNAME Linear Bayes Con-CNAME Naive III Con-CNAME CNAME Con-CNAME Asy-CNAME

over

Lines 2:0 105 42%

3:6 106 103%

7:2 106 109%

1:06 107

1:4 107

112%

115%

over

0%

5%

6%

6%

6%

over

9%

3%

3%

3%

3%

over

7%

1%

3%

3%

1%

As shown in Tables 3 and 4, the Con-CNAME algorithm can get the highest CTRs over the ﬁrst 200000 to 14000000 lines with a = 0.8 and b = 0.8. The CTRs obtained by Con-CNAME is signiﬁcantly higher than those of Random, most click and Linear Bayes algorithm, and slightly higher than those of Click rate, Contextual click and Naïve III algorithm. Besides, compared with CNAME and Asy-CNAME, ConCNAME algorithm further improves CTR by using contextual information. With the increase of processed data, the CTR obtained by Most click approach does not continue to increase, but has a downward trend. Most click algorithm makes a recommendation only based on clicks, which may cause the recommended article is always popular article. Thus the articles recommended are stultifying or repeated in terms of the content. The Click rate approach makes use of user feedback, and obtains higher CTRs than Most click approach. On the other hand, the Click rate approach can get higher CTR than Contextual Click approach and Naïve III algorithm at the beginning of experiment. With the increase of lines, Contextual Click approach and Naïve III algorithm can learn more about users’ interests, the recommended articles are more likely to meet users’ interests. So the CTRs of Contextual Click approach and Naïve III algorithm catch up with and surpass the CTRs of Click rate approach in the later stages of the experimental process. To sum up, user feedback and contextual information are both helpful to improve the CTRs with various emphasis. Making a recommendation based on user feedback always prefers to maximize the short-term reward, which is easy to fall into local optimum. Based on the contextual information, the users’ interest can be better learned in the long run with the increase of contexts. The Con-CNAME algorithm combines user feedback and contextual information, and ﬁnally contributes to the highest CTRs.

5 Conclusion In this paper, we study recommender systems based on contextual MAB problems. The Con-CNAME algorithm makes good recommendations combining user feedback and contextual information. The cold start problem is addressed by continuous exploration and contexts in our approach.

Con-CNAME

335

Different from the classic contextual MAB algorithms, our algorithm keeps the exploration. And the Con-CNAME algorithm can still work as CNAME algorithm normally if the contexts are unobtainable during some sudden emergencies. The influences of key parameters of our algorithm are discussed, besides, the performance of our algorithm and other MAB-based recommendation approaches are compared on Yahoo! Front Page Today Module user click log dataset. Experimental results show that our algorithm outperforms other algorithms in terms of CTR. The Con-CNAME algorithm is effective and steady for personalized recommender systems. Although our algorithm achieves signiﬁcant result, a possible improvement can be made by updating it in an asynchronous manner.

References 1. Sutton, R.S., Barto, A.G.: Introduction to reinforcement learning. Mach. Learn. 16(1), 285– 286 (2005) 2. Li, S., Karatzoglou, A., Gentile, C.: Collaborative ﬁltering bandits. In: International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 539–548 (2016) 3. Eghbali, S., Ashtiani, M.H.Z., Ahmadabadi, M.N., et al.: Bandit-based structure learning for bayesian network classiﬁers. In: International Conference on Neural Information Processing, pp. 349–356 (2012) 4. Resnick, P., Varian, H.R.: Recommender systems. Commun. ACM 40(3), 56–58 (1997) 5. Balabanović, M., Shoham, Y.: Fab: content-based, collaborative recommendation. Commun. ACM 40(3), 66–72 (1997) 6. Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47(2), 235–256 (2002) 7. Liu, J., Dolan, P., Pedersen, E.R.: Personalized news recommendation based on click behavior. In: International Conference on Intelligent User Interfaces, pp. 31–40 (2010) 8. Dhanda, M., Verma, V.: Personalized recommendation approach for academic literature using high-utility itemset mining technique. Progress in Intelligent Computing Techniques: Theory, Practice, and Applications (2018) 9. Schein, A.I., Popescul, A., Ungar, L.H., et al.: Methods and metrics for cold-start recommendations. In: Proceedings of ACM SIGIR Conference on Research & Development in Information Retrieval, vol. 39(5), 253–260 (2002) 10. Mary, J., Gaudel, R., Philippe, P.: Bandits warm-up cold recommender systems. Computer Science (2014) 11. Tang, L., Jiang, Y., Li, L., Li, T.: Ensemble contextual bandits for personalized recommendation. In: RecSys, pp. 73–80 (2014) 12. Adomavicius, G., Tuzhilin, A.: Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions. IEEE Trans. Knowl. Data Eng. 17(6), 734–749 (2005) 13. Shani, G., Heckerman, D., Brafman, R.I.: An MDP-based recommender system. J. Mach. Learn. Res. 6(1), 1265–1295 (2005) 14. Ren, Z., Krogh, B.H.: State aggregation in markov decision processes. In: IEEE Conference on Decision and Control, pp. 3819–3824 (2002) 15. Cesa-Bianchi, N., Lugosi, G.: Prediction, Learning, and Games. Cambridge University Press, Cambridge (2006)

336

X. Zhang et al.

16. Cesa-Bianchi, N., Fischer, P.: Finite-time regret bounds for the multi-armed bandit problem. In: ICML, pp. 100–108 (1998) 17. Bubeck, S., Slivkins, A.: The best of both worlds: stochastic and adversarial bandits. J. Mach. Learn. Res. 23(42), 1–23 (2012) 18. Adomavicius, G., Tuzhilin, A.: Context-aware recommender systems. In: Recommender Systems Handbook, pp. 191–226 (2015) 19. Adomavicius, G., Sankaranarayanan, R., Sen, S., Tuzhilin, A.: Incorporating contextual information in recommender systems using a multidimensional approach. ACM Trans. Inf. Syst. 23(1), 103–145 (2005) 20. Li, L., Chu, W., Langford, J., Schapire, R. E.: A contextual-bandit approach to personalized news article recommendation. In: World Wide Web, pp. 661–670 (2010) 21. Song, L., Tekin, C., Schaar, M.V.D.: Online learning in large-scale contextual recommender systems. IEEE Trans. Serv. Comput. 9(3), 433–445 (2016) 22. Jośe, A.M.H., Vargas, A.M.: Linear bayes policy for learning in contextual-bandits. Expert Syst. Appl. 40(18), 7400–7406 (2013) 23. Zhou, Q., Zhang, X.F, Xu, J., et al.: Large-scale bandit approaches for recommender systems. In: International Conference on Neural Information Processing, pp. 811–821 (2017)

Real-Time Session-Based Recommendations Using LSTM with Neural Embeddings David Lenz1(B) , Christian Schulze2 , and Michael Guckert2 1

2

Fachbereich Wirtschaftswissenschaften, Justus-Liebig-Universit¨ at Gießen, Giessen, Germany [emailprotected] KITE - Kompetenzzentrum f¨ ur Informationstechnologie, Technische Hochschule Mittelhessen, Friedberg, Germany {christian.schulze,michael.guckert}@mnd.thm.de

Abstract. Recurrent neural networks have successfully been used as core elements of intelligent recommendation engines in e-commerce platforms. We demonstrate how LSTM networks can be applied to recommend products of interest for a customer, based on the events of the current session only. Inspired by recent advances in natural language processing, our network computes vector space representations (VSR) of available products and uses these representations to derive predictions of user behaviour based on the clickstream of the current session. The experimental results suggest that the Embedding-LSTM is well suited for session-based recommendations, thus oﬀering a promising method for attacking the user cold start problem. A live test gives proof that our LSTM model outperforms a recommendation model created with traditional methods. We also show that providing the learned VSR as features to neighbourhood-based methods leads to improved performance as compared to standard nearest neighbour methods. Keywords: LSTM · Neural embeddings Session-based recommendations · Real-time recommendations

1

Introduction

Real-time session based recommendations become increasingly important for state of the art e-commerce platforms. Recommendation systems predict useful items for users, providing them with a richer experience and increasing the success of the website in consequence [10]. Conventional approaches for recommendation systems typically use collaborative (CF) or content-based ﬁltering (CBF). This work was partially funded by LOEWE HA project PAROT (no. 509/16-21, State Oﬀensive for the Development of Scientiﬁc and Economic Excellence). c Springer Nature Switzerland AG 2018 V. K˚ urkov´ a et al. (Eds.): ICANN 2018, LNCS 11140, pp. 337–348, 2018. https://doi.org/10.1007/978-3-030-01421-6_33

338

D. Lenz et al.

Whenever a user rejects cookies or has not visited the website before, only information of the current session can be exploited. No historic data about previous purchases is available and therefore the challenge of creating recommendations is then referred to as the (user) cold-start problem [13]. Session based recommendations (SBR) view the user as being anonymous and algorithms can only make use of implicit user feedback, since no explicit ratings are available. However, SBR plays an increasingly important role for modern e-commerce websites. As collaborative ﬁltering methods heavily rely on historic data they are not applicable in session-based recommendation settings. The alternative of item-toitem recommendations often only has myopic access to items clicked and can’t exploit context-dependent preferences of the user. Other than most of the previously published research in SBR we focus on deep learning techniques to predict products of interest based on the current browsing session. We use vector space representations (VSR) of the items as input to the network, as it has been recently done with words in the context of NLP [9]. Diﬀerent from typical VSRs implementations which use an additional network for learning, our approach does not require pre-training of product features and can be learned in an end-to-end fashion, i.e. the product embeddings and network weights are learned simultaneously. We show that the Embedding-LSTM model provides more accurate and more diverse recommendations than other frequently used SBR approaches when applied to real-world datasets. In a case-study we demonstrate this by deploying an Embedding-LSTM into a production environment to capture live feedback from users. The ability to test models online gives access to new metrics beyond those mostly used in literature, e.g. we can evaluate our model performance based on how much revenue is generated by the recommendations. From an economic and business point of view this seems to be of higher relevance compared to merely counting the products an algorithm can correctly recommend. We demonstrate the applicability of the proposed approach with an LSTM generating higher click counts from a higher number of overall users, selling more products and creating a higher overall revenue compared to the currently implemented association rule model.

2

Related Work

The idea to use RNNs for session-based recommendations has gained much attention recently. [5] demonstrated the general applicability of RNNs in SBR and the improved performance of RNNs in comparison to widely used approaches. [6] showed how additional information can signiﬁcantly improve the performance of RNN based recommendation systems. We improve these approaches with an extended architecture using product embeddings as done by Barkan and Koenigstein [1], who propose a collaborative ﬁltering method based on neural network embeddings which they call item2vec. The authors use the Word2Vec algorithm to learn product embeddings similar to our approach. However, they follow the

Embedding LSTM for Session Based Recommendations

339

original shallow architecture. This means they discard the sequential information since they represent the history of events as a vector with only a single time step. They predict the most similar items to the current item based on the cosine similarity between the calculated VSRs. Their results were better than the baseline model used, which was item-based collaborative ﬁltering using singular value decomposition. [12] show that increasing the amount of training data through data augmentation techniques improves the performance of RNNs in SBR. [2] uses the item dwell time as an additional indicator of interest for speciﬁc items. The authors show that this leads to an increased performance as compared to the pure sequence based approach.

3

Session Based Recommendations with Embedding-LSTM

In order to achieve reasonable recommendations every information contained in a given sequence of events must be processed. Using recurrent neural networks allows us to easily frame the recommendation problem as a sequence prediction problem. Not only do we take into account the set of previously clicked items, but we also consider the order in which the items appeared, thus explicitly capturing how the preference of a user evolved over time. In Long Short-Term Memory (LSTM) [7] networks the hidden layer activation is split into multiple interacting computations. Using an elaborate architecture of gated cells LSTMs can keep information over a series of input steps. LSTMs have successfully been applied to model temporal and sequential aspects of data e.g. machine translation [11] and are an adequate tool for analysing sequential user activity. Learning vector space representations of products allows to capture ﬁnegrained relationships and regularities between products. We use an embedding method to represent products in a continuous high-dimensional vector space in which multiple relations for a single product can be represented. We therefore expect to capture the sequential relationships between items in such item vectors. The VSRs replace the extremely sparse one-hot encoding by a dense vector representation. The reduction of computational complexity as compared to a one-hot encoding scheme is one of the beneﬁts of this approach. Additionally, the learned embeddings can be reused as meaningful representations of input data for other machine learning models, e.g. we experimentally demonstrate how the learned VSR can be used as input to nearest neighbour recommendation methods to improve the recommendation performance. Investigation of the learned embeddings allows to learn more about relationships between products and the reasoning of the model.

4 4.1

Experiments Datasets

Our dataset contains data collected in three diﬀerent web-shops, which we denote by A, B, C covering a period of roughly 9 months (November 2016 to

340

D. Lenz et al.

August 2017). Table 1 details descriptive statistics of the datasets. Users is the number of unique users, Items the number of unique products and Observations is the number of recorded events per web-shop. Avg observations per item shows the average number of interactions between users and items and Avg daily events is the mean number of events recorded per day. Table 1. Descriptive statistics Data set

A

B

C

User Sessions

701,773

1,296,748

4,396,280

Items

5,937

7,829

8,999

Observations

1,679,144

3,820,461

13,837,585

Avg observations per item 283

488

1537

Avg daily events

6261

14,250

51,623

Max date

2017/07/25 2017/07/25 2017/07/25

Min date

2016/10/31 2016/10/31 2016/10/31

We evaluate the performance of our model for three diﬀerent dictionary sizes D ∈ {500, 2000, 5000}. For each D the most popular items are identiﬁed. Popularity based pre-ﬁltering is common in practical RS, since discarding unpopular products has negligible eﬀects on the evaluations [6]. The network then has D output nodes in a softmax layer that applies a cross-entropy loss function. 4.2

Baseline Algorithms

The following algorithms are commonly used baselines for session-based recommendations [5,6]. – POP: Popularity predictor always recommends the most popular items in the training set. – Item-KNN (I-KNN): This approach is inspired by traditional contentbased ﬁltering methods for session-based recommendations and recommends items similar to the item currently viewed. Similarity is measured based on the co-occurrence of items within sessions and calculated for each pair of products. See [5] for details. During inference, the top k items with the highest similarity to the current item are selected for recommendation. Item-KNN is one of the most commonly used item-to-item solutions in real world systems [5]. – Embedding-KNN (E-KNN): E-KNN also recommends products based on their similarity. The content-based ﬁltering approach uses learned vector space representations (VSR) extracted from an Embedding-LSTM and reuses them as features. Cosine similarity of two VSRs is used as similarity measure for the corresponding products.

Embedding LSTM for Session Based Recommendations

4.3

341

Metrics

The top-k metric is similar to the sps metric described in [4]. The top-k metric for a single example is one, if the next click of a user appears in the top-k recommendations provided by the model, otherwise zero. Formally, this can be written as shown in Eq. 1. 1 top-k = n i=1 n

1, 0,

if yi ∈ yˆk,i else

(1)

For a given example i, yˆk,i denotes the top k items with the highest predicted probability. Let yi be the actual target item and n the number of examples. We measure the top-k for k ∈ {1, 10, 20, 30}. For k = 1 the top-k metric equals the accuracy, i.e. the recommendation with the highest probability is indeed the correct label. The reciprocal rank metric measures the position of the relevant item in the list of recommended items. This is important in cases where the order of recommendations is relevant, for example if the lower ranked items are only visible after scrolling. The mean reciprocal rank (MRR) is the average of the reciprocal ranks for all examples and is calculated as 1 1 M RR = n i=0 ranki n

(2)

with n denoting the number of training instances and ranki the position in the recommendation list in which the correct item occurred.

5 5.1

Results Embedding Visualization

Dimensionality reduction techniques convert high-dimensional embeddings into lower dimensional data vectors while preserving local and global structures using t-SNE [8]. Figure 1 shows the two dimensional representation of the learned item embeddings for web-shop C with a dictionary size D = 2000 using the t-SNE algorithm in which points are coloured and annotated according to their product category. Housekeeping, Gardening & Recreation and Living are located closely together in the representation, Textile, Underwear and Shoes are interconnected with each other and Baby & Toddlers products while Wellbeing is close to Shoes and Housekeeping. This shows that the learned embeddings align with intuition. Moreover, items that are intuitively considered to be similar from a perspective of taste are close in the embedding space from what we can conclude that the network is able to learn meaningful representations of items that can be used to produce valuable recommendations.

342

D. Lenz et al.

Fig. 1. 2D Item Embeddings from t-SNE for web-shop C (D = 2000) Two dimensional representation of the embeddings from web-shop C (D = 2000) with the t-SNE algorithm (perplexity = 12). Items are coloured according to their aﬃnity to a product category.

5.2

Top-K

Table 2 provides the top-k and MRR metrics for our experiments. Results which we discuss in detail are printed in bold type. The table contains the results of the tested algorithms and the diﬀerence between LSTM and the second best competitor (column DIFF ) for diﬀerent shops, dictionary sizes D and values of k. LSTM outperforms the baseline algorithms in all data sets, with the POP algorithm being the weakest model throughout. As expected, results deteriorate with increasing dictionary size. Interestingly, I-KNN outperforms E-KNN in terms of plain accuracy (k = 1) in all data sets. However E-KNN has an edge over the I-KNN for all other k ∈ {10, 20, 30} except in data sets (A, 500) and (A, 5000), where the I-KNN is higher for k = 10. A possible interpretation for this eﬀect is that I-KNN learned a more problem speciﬁc solution (high k = 1) compared to E-KNN, in which the VSR captured the structure of the problem on a more general level (better performance for all other k). The mean value of the diﬀerences between E-KNN and I-KNN is 2.39 pp. which can be interpreted such that using VSR as features in the nearest-neighbour approach improved the recommendations. Therefore, an LSTM does not have to be implemented in a

Embedding LSTM for Session Based Recommendations

Table 2. Metrics: Top-k and MRR Shop D K A 500 1 10 20 30 MRR 2000 1 10 20 30 MRR 5000 1 10 20 30 MRR B 500 1 10 20 30 MRR 2000 1 10 20 30 MRR 5000 1 10 20 30 MRR C 500 1 10 20 30 MRR 2000 1 10 20 30 MRR 5000 1 10 20 30 MRR

LSTM 11.18 38.51 48.84 54.96 0.20 8.74 31.91 40.50 45.94 0.16 6.54 26.10 35.31 40.84 0.13 6.21 28.54 39.54 46.96 0.14 5.60 23.78 32.88 38.46 0.12 5.01 22.11 30.97 36.35 0.11 10.79 38.33 48.59 54.69 0.20 8.47 31.12 40.83 46.73 0.16 7.03 27.82 36.92 42.34 0.14

E-KNN 4.75 24.59 33.32 38.75 0.18 2.42 22.62 30.39 35.22 0.12 3.55 19.25 26.26 30.40 0.12 4.13 22.94 32.57 38.64 0.11 2.92 16.67 23.64 27.80 0.10 2.63 15.36 21.85 25.97 0.10 4.46 30.22 39.88 45.43 0.14 4.14 23.07 31.82 36.80 0.13 3.83 22.12 30.17 34.98 0.13

I-KNN 7.69 26.03 32.93 38.49 0.08 6.52 21.87 27.34 31.79 0.07 6.10 20.21 25.48 29.91 0.06 5.02 17.74 23.98 29.41 0.05 4.48 14.71 19.90 24.36 0.05 4.40 13.49 18.14 22.28 0.04 7.40 25.41 31.65 38.02 0.08 5.90 19.53 25.11 30.08 0.06 5.41 17.36 22.35 27.15 0.05

POP 0.62 6.01 9.99 13.80 0.01 0.42 2.96 4.92 6.74 0.00 0.24 2.25 3.87 5.35 0.00 0.85 6.71 11.71 16.19 0.01 0.46 3.38 5.88 8.06 0.01 0.29 2.54 4.49 6.54 0.00 0.89 6.44 11.51 15.95 0.01 0.68 3.15 5.39 7.11 0.01 0.43 2.23 3.74 5.13 0.00

DIFF 3.48 12.48 15.53 16.22 0.02 2.21 9.28 10.11 10.72 0.04 0.44 5.89 9.06 10.44 0.01 1.19 5.6 6.98 8.32 0.03 1.11 7.11 9.24 10.66 0.02 0.61 6.75 9.12 10.38 0.01 3.39 8.11 8.7 9.26 0.05 2.57 8.05 9.01 9.94 0.03 1.62 5.7 6.75 7.37 0.01

343

344

D. Lenz et al.

production environment to beneﬁt from its sequential knowledge, simply replacing the features in existing implementations with learned embeddings already improves the results instead. 5.3

MRR

The MRR can be translated back to the average position in the list of recommendations by taking M 1RR . The result is detailed in the boxplot in Fig. 2. In the upper ﬁgure all algorithms are shown, while the lower ﬁgure leaves out POP to allow for a better comparison between the remaining algorithms. The inner line represents the median value, the edges of the box indicate the upper and lower quartile and the whiskers detail the extreme values. For the LSTM, the median position of the correct recommendation is 7.31, the E-KNN at position 8.25, the I-KNN at position 16.81 and the POP at position 198.11. During the livetest of the model (next section), the number of recommendations shown at once is eight1 , so theoretically the LSTM would be the only algorithm where users would (given the median value) see the correct recommendation without using the slider. The LSTM also has the lowest uncertainty involved in the recommendation quality, indicated by the smaller overall range of the box plot (4.350 for the LSTM compared to 4.595 for the E-KNN), so the results are the most stable over all datasets.

Fig. 2. Average position of the correct recommendation over all datasets. The upper image displays all algorithms. In the lower image the POP predictor is not shown. Whiskers represent minimum and maximum values.

To see whether the predictions align with intuition it is useful to visualize some example predictions. Here we only provide the recommendations from the LSTM model, as this is the model of interest. Results are visualized in Table 3. Each row holds a single example of inputs and predictions. Column Input on the left contains the inputs to the model with the currently viewed item 1

This is dependent on the display size. Here we assume a 24 in. monitor.

Embedding LSTM for Session Based Recommendations

345

in column (xt ) and the previously viewed item in (xt−1 ). Column Predictions contains the predictions, sorted from left to right in descending order of their probabilities, so that the item for which is most likely to be clicked. In the ﬁrst row we assume a user has only clicked on the product ‘cabinet’ before. Without sequential information and only a single input this is simply an item-to-item recommendation. The big cabinet that is on the top position is quite similar to the currently viewed cabinet. In line with intuition more cabinets and commodes follow. In the second row a another piece of furniture is clicked and sequential information is now available. The model accounts for this by adjusting the importance of diﬀerent products. The big cabinet that has already been the top prediction in row 1 again is on top, now with a commode which has not been in the top 8 recommendations before as second best recommendation. Obviously, the model found evidence that the commode is an important recommendation given the sequence of previous inputs. In the third row the currently viewed article is again the cabinet, however this time another product has been viewed before. Again the top prediction has not changed but the previous article inﬂuences the order of the recommendations. This is seen in row 3 in which the list of top recommendations changes, e.g. rank 8 has not been in the recommendation list before at all and rank 4 was not listed in row 2. Table 3. LSTM example predictions from (C, 5000) Input xt−1

Predictions xt Top 1

Top 2

Top 3

Top 4

Top 5

Top 6

Top 7

Top 8

346

D. Lenz et al.

In row 4 the previously viewed product is identical to row 3, but the one currently viewed is now a shoe. The model completely ignores the previous item and only recommends items that are similar to the currently viewed product which might be explained by the fact that shoes ‘score’ on other shoes signiﬁcantly, so the importance of other shoes regarding the current session is greater than the importance of more furniture. 5.4

Model Deployment

The best performing model for web shop C with D = 5000 has been deployed into production giving us the opportunity to capture metrics, which are typically not available without user interaction. The model was benchmarked with an A/B test for one week against the currently deployed prediction model which uses association rule mining (see [3] for details). The model creates predictions exploiting user history (if available) combined with short-term predictions based on the current browsing session to come up with the ﬁnal set of recommendations. Intuitively, this should give the model an advantage whenever the system can identify a user and access the explicit purchasing history of this user. Rather than predicting the next item to be clicked, the benchmark model has been optimized to generate high revenue by increasing the importance of more expensive articles. Table 4 provides an overview of the metrics for both models as well as the sum or average of the metric. The recorded metrics are the number of users who clicked on a recommendation (Users), the number of clicks on recommendations (Clicks), the ratio of clicks per user (Clicks/Users), the average price of the clicked items (Clicked Avg price), the total value of the clicked items (Clicked Price), the number of diﬀerent products sold (Unique Products), the number of total sold products (Sold Quantity), the total revenue generated by the recommendations (Sold Value) and the average price of sold products (Sold Avg Value). 27.290 users saw the LSTM recommendations and 27,836 users saw the baseline recommendations. The LSTM attracted nearly twice as many users (2, 508 vs 4, 390), who more than doubled the number of overall clicks (4, 495 vs 9, 924). Additionally the users also spent more time clicking through the recommendations as can be seen by the higher Clicks/User metric (1.81 vs 2.31). The LSTM clicked price is 486,222 e higher and the average clicked price is 15.24 e above the baseline model. 95 additional unique products and 131 additional total products were sold by the recommendations generated by the LSTM model. This results in 18,131.4 e revenue generated by the model in one week, which is 2,238.7 e extra revenue compared to the currently deployed model. Over the course of a year, the LSTM would generate 826,420.4 e in revenues, which is an increase of 116,412.4e compared to the baseline2 . 2

From a marketing perspective, an interesting metric is the revenue/click. However, this neglects the cost of running the systems which is indeed high so that only looking at the revenue/click does not incorporate all relevant costs and is therefore only a skewed metric. Unfortunately, we cannot publish details about the associated cost structures.

Embedding LSTM for Session Based Recommendations

347

Table 4. Live-test results Model

Baseline

Users

2,508

LSTM 4,390

Clicks

4,495

9,924

Clicks/User

1.81

2.31

Clicked Price

e 274,350 e 760,572

Clicked Avg price e 70.26

e 85.50

Unique Products

298

393

Sold Quantity

339

470

Sold Value

e 15,892.7 e 18,131.4

Sold Avg price

e 52.52

e 42.42

Sold Avg Value

e 55.10

e 46.09

Interestingly, the average price of each unique product sold is over 10 e higher in the baseline model, while the average purchase value (users can buy several products at once) is 9.01 e higher in the benchmark. This stands in contrast to the average click price which has been around 15 e higher for the LSTM. A possible explanation might be the explicit usage of user histories by the baseline model. Since known users already interacted with the company, the initial interaction hurdle might be gone, so providing these users with improved recommendations can lead to a multiplying eﬀect. Furthermore, the baseline model has been optimized to maximize revenue, while the LSTM was optimized to predict the next click. Another interpretation of the diﬀerences is that users enjoy the recommendations from the LSTM and curiously click on the products to learn more about them without the intention to actually buy something. As users signal interest through clicking on recommendations, the high click rate leads to the suggestion that the LSTM architecture learned useful dependencies from the data to provide interesting recommendations.

6

Conclusion

We have demonstrated that recurrent neural networks can successfully be applied as real-time session-based recommendation engines. Our deep learning architecture outperformed standard algorithms in all metrics when applied to practicerelevant datasets. A live test provided proof for the superiority of EmbeddingLSTMs compared to the baseline model. Its recommendations lead to a signiﬁcantly higher number of users with higher clickrates and in consequence to an increase of products sold thus generating a higher overall revenue. Furthermore, we showed the emergence of meaningful vector space representations for the products using an eﬃcient end-to-end training approach. Our architecture enables smart marketing based on machine learning algorithms for a variety of customer orientated businesses in a scalable way. Future research will focus on further improving the proposed architecture.

348

D. Lenz et al.

References 1. Barkan, O., Koenigstein, N.: Item2Vec: neural item embedding for collaborative ﬁltering. CoRR abs/1603.04259 (2016). http://arxiv.org/abs/1603.04259 2. Dallmann, A., Grimm, A., P¨ olitz, C., Zoller, D., Hotho, A.: Improving session recommendation with recurrent neural networks by exploiting dwell time. ArXiv e-prints, June 2017 3. Davahri, M.: Kollaborative empfehlungssysteme im e-commerce. Technical report, Technische Hochschule Mittelhessen in cooperation with Dastani Consulting (2016) 4. Devooght, R., Bersini, H.: Collaborative ﬁltering with recurrent neural networks. CoRR abs/1608.07400 (2016). http://arxiv.org/abs/1608.07400 5. Hidasi, B., Karatzoglou, A., Baltrunas, L., Tikk, D.: Session-based recommendations with recurrent neural networks. CoRR abs/1511.06939 (2015). http://arxiv. org/abs/1511.06939 6. Hidasi, B., Quadrana, M., Karatzoglou, A., Tikk, D.: Parallel recurrent neural network architectures for feature-rich session-based recommendations. In: Proceedings of the 10th ACM Conference on Recommender Systems, RecSys 2016, pp. 241– 248. ACM, New York (2016). https://doi.org/10.1145/2959100.2959167. http:// doi.acm.org/10.1145/2959100.2959167 7. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735 8. van der Maaten, L., Hinton, G.: Visualizing high-dimensional data using t-SNE (2008) 9. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. CoRR abs/1310.4546 (2013). http://arxiv.org/abs/1310.4546 10. Ricci, F., Rokach, L., Shapira, B., Kantor, P.B. (eds.): Recommender Systems Handbook, 1st edn. Springer, Boston (2011). https://doi.org/10.1007/978-0-38785820-3 11. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. Technical report arXiv:1409.3215 [cs.CL], Google (2014). NIPS 2014 12. Tan, Y.K., Xu, X., Liu, Y.: Improved recurrent neural networks for session-based recommendations. In: Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, DLRS 2016, pp. 17–22. ACM, New York (2016). https://doi. org/10.1145/2988450.2988452. http://doi.acm.org/10.1145/2988450.2988452 13. Yuan, J., Shalaby, W., Korayem, M., Lin, D., AlJadda, K., Luo, J.: Solving coldstart problem in large-scale recommendation engines: a deep learning approach. CoRR abs/1611.05480 (2016). http://arxiv.org/abs/1611.05480

Imbalanced Data Classiﬁcation Based on MBCDK-means Undersampling and GA-ANN Anping Song

and Quanhua Xu(&)

School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China {apsong,beth0330}@shu.edu.cn

Abstract. The imbalanced classiﬁcation problem is often a problem in classiﬁcation tasks where one class contains a few samples while the other contains a great deal of samples. When the traditional machine learning classiﬁcation method is applied to the imbalanced data set, the classiﬁcation performance is bad and the time cost is high. As a result, mini batch with cluster distribution Kmeans (MBCDK-means) undersampling method and GA-ANN model is proposed in this paper to solve these two problems. MBCDK-means chooses the samples according to the clusters distribution and the distance from the majority class clusters to the minority class cluster center. This technology can keep the original distribution of cluster and increase the sampling rate of boundary samples. It is helpful to improve the ﬁnal classiﬁcation performance. At the same time, compared with the classic K-means clustering undersampling method, the presented MBCDK-means undersampling method has lower time complexity. Artiﬁcial neural network (ANN) is widely used in data classiﬁcation but it is easily trapped in a local minimum. Genetic algorithm artiﬁcial neural network (GA-ANN), which uses genetic algorithm to optimize the weight and bias of neural network, is raised because of this. GA-ANN achieves better performance than ANN. Experimental results on 8 data sets show the effectiveness of the proposed algorithm. Keywords: Imbalanced classiﬁcation Clustering sampling Artiﬁcial neural network Genetic algorithm

1 Introduction Imbalanced classiﬁcation problem refers to the pattern classiﬁcation problem in which the number of training samples is distributed unevenly among classes [1]. When traditional classiﬁcation methods are applied to imbalanced data, in order to improve the overall accuracy of the classiﬁcation, the classiﬁer will reduce the attention of minority classes and thus tend to favor the majority class. It makes that the minority class samples are difﬁcult to be identiﬁed and leads to a bad classiﬁcation performance. The literature [2] shows that in some applications, it is difﬁcult to build a correct classiﬁer when the class distribution imbalance ratio exceeds 1:35. Furthermore, Some © Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11140, pp. 349–358, 2018. https://doi.org/10.1007/978-3-030-01421-6_34

350

A. Song and Q. Xu

applications make it difﬁcult to establish a correct classiﬁer when the imbalance ratio reaches 1:10. Data resampling is an effective way to solve data imbalance problem. Resampling mainly contains two methods: over sampling [3–5] and under sampling [6–8]. Oversampling technique increases samples of minority class artiﬁcially. However, this will introduce redundant information. Under sampling technique balances the data set by reducing the number of majority class samples. Random under sampling (RUS) randomly reduces samples which will probably lose important information. Recently, some resampling methods based on clustering technology have been discussed. After clustering, the data in the same cluster is similar while the data in different cluster is unlike. Because of this, clustering technology is appropriate to be applied in resampling. Lin et al. [9] applied K-means to under sampling approaches. However, the time complexity of K-means undersampling algorithm is huge especially on big data. Besides, the distribution of clusters is not considered in this approach. Based on that, MBCDK-means is proposed in order to solve these two problems. At the same time, artiﬁcial neural network is prevailing in classiﬁcation task. Unfortunately, it is easy to be trapped in local minimum. That’s why we propose GA-ANN. Genetic algorithm is used here to optimize the weight and bias of neural network. The rest of this paper is organized as follows. Section 2 presents the proposed method including the construction of model and the algorithm flow. Results, discussions and comparative analysis are made in Sect. 3. Final conclusion is drawn in Sect. 4.

2 Methodology 2.1

MBCDK-means Undersampling

MBCDK-means undersampling divides the majority class samples into k clusters while the minority into a separate class. Assuming that M is the number of majority class and P mi is the number of the ith cluster, then M ¼ ki¼1 mi . Supposing that the distance between the ith majority class cluster center and the minority class cluster centers is di : di is denoted as follows: qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ di ¼ ðXi XN Þ2

ð1Þ

where Xi represents the ith majority class cluster center and XN is the minority class cluster center. Then the average distance from the majority class clusters to the minority class cluster davg can be deﬁned as follows. davg ¼

1 Xk d i¼1 i k

ð2Þ

The calculation formula of the sample number ni that needs to be extracted in the ith cluster is as follows:

Imbalanced Data Classiﬁcation

ni ¼

N davg * mi þ M di

351

ð3Þ

The number that needs to be extracted from each cluster is determined by the number of samples in the cluster and the distance between the majority class cluster centers and the minority class cluster center. The original distribution of the majority class samples can be retained. At the same time, it can increase the sampling number of boundary samples of the majority class. This contributes to the identiﬁcation of boundary samples. Moreover, mini batch K-means uses Mini Batch to calculate the distance. The advantage of Mini Batch is that it is not necessary to use all the data samples in the calculation process. Instead, some samples are taken from different types of samples to compute on behalf of each type. For each small batch, updated centroid is created by calculating the average value. Data in the small batch are assigned to the centroid. With the iteration, the changes in these centroids gradually decrease until the centers of clusters are stable or the speciﬁed number of iterations are reached. Then the calculation will be stopped. Assuming that the size of a batch is b, the number of cluster is k, sample number is m, feature number is n and the iteration number is t. As the number of clusters in the K-means undersampling algorithm is set to a small number of N, the time complexity is OðtNmnÞ. However, the time complexity of MBCDK-means algorithm becomes OðtbknÞ. Obviously, the under-sampling algorithm proposed in this paper has faster convergence rate than the K-means under-sampling algorithm. 2.2

GA-ANN

ANN is widely used in data classiﬁcation but it is easily trapped in a local minimum. Fortunately, genetic algorithm can solve this problem. Because of this, GA-ANN is put forward which uses genetic algorithm to optimize the weight and bias of neural network. GA-ANN mainly deals with two key problems, namely, the encoding mapping from weights to chromosome bit strings and ﬁtness function of genetic algorithm. 1. The encoding mapping from weights to chromosome bit strings. Considering a simple artiﬁcial neural network that has a input layer nodes, b hidden layer nodes and c output layer nodes, the neural network will2generate 4 matrices. 3 W11 W1b The weight matrix of input layer and hidden layer: W ¼ 4 . . . . . . . . . 5 Wa1 . . . Wab 2 3 a1 The threshold value matrix of hidden layer: a ¼ 4 . . . 5 ab 2 3 V11 V1c The weight matrix of hidden layer and output layer: V ¼ 4 . . . . . . . . . 5 Vb1 . . . Vbc

352

A. Song and Q. Xu

2

3 b1 The threshold value matrix of output layer: b ¼ 4 . . . 5 bc Using GA to optimize the weights of ANN, and the above four matrices are optimized. The four matrices are converted to the chromosome strings in GA operation. A binary string is used as the chromosome encoding. x chromosome bits represent a coefﬁcient value, and the range of x values is determined according to the range and accuracy of the weight range. The mapping relationship between chromosome bit strings and weight values is shown in Fig. 1. 2. The ﬁtness function. The ﬁtness function f of GA-ANN used to evaluate the chromosome is the area under ROC curve (AUC). AUC is based on the concept of confusion matrix, and a matrix used to represent the situation of sample identiﬁcation in binary classiﬁcation case. In this situation, the minority class is positive and the majority class is negative. TP indicates the prediction of positive samples is still positive; FN indicates the prediction of positive samples is negative; FP indicates negative samples’ prediction is positive, and TN indicates negative samples’ prediction is still negative. Each sample in the classiﬁcation has a corresponding probability value that belongs to a different category. The ﬁnal category prediction changes according to the set threshold on different probabilities. Each threshold corresponds to a set of metrics ðFPrate; TPrateÞ: FPrate is the false positive rate and TPrate is true positive rate. FPrate and TPrate are deﬁned as follows: FPrate ¼

FP FP þ TN

ð4Þ

TPrate ¼

TP TP þ FN

ð5Þ

Then the ﬁtness function f of GA-ANN is deﬁned as follows: Z f ¼ AUC ¼

1

TPrate d FPrate 0

Fig. 1. Mapping relationship between chromosome bit strings and weight values

ð6Þ

Imbalanced Data Classiﬁcation

2.3

353

Imbalanced Classiﬁcation Based on MBCDK-means Undersampling and GA-ANN

The flow of Imbalanced classiﬁcation based on MBCDK-means undersampling and GA-ANN is given by Fig. 2. The imbalanced dataset is split into training data and testing data. MBCDK-means is applied to training data and then gains balanced training data. GP-ANN is used to train balanced data and then test testing data.

Fig. 2. Flow of Imbalanced classiﬁcation based on MBCDK-means undersampling and GA-ANN

3 Computer Experiment Results 3.1

Datasets

This article discusses three experimental studies. 8 datasets are used in these experiments. 6 data sets with small scale are from UCI machine learning repository. Imbalanced ratio of these datasets is between 3.23 to 32.78, and the amount of sample is between 214 to 1484. A European credit card transaction record data set is used in our experiments. There are 284,807 records in this dataset, which only includes 492 fraud records. Feature number is 30, and the imbalanced ratio of data reaches up to 578.

354

A. Song and Q. Xu

The last dataset is KKBox’s Churn Prediction Challenge dataset in 2017. It includes more than 400 million data and feature number is 30, which includes 12 features extracted on our own. Training dataset’s imbalanced ratio is 14.58. The description of dataset is in Table 1. Table 1. Description of datasets Dataset Glass0 Glass2 Glass4 Vehicle0 Yeast5 Yeast6 Credit card KKBOXtrain KKBOXtest

3.2

No. of samples 214 214 214 846 1484 1484 284807

No. of minority class 51 13 9 200 44 37 492

No. of majority class 163 201 205 646 1440 1447 284315

No. of features 9 9 9 18 8 8 30

Imbalance ratio 3.19 15.47 22.81 3.23 32.78 39.15 577.88

992931

63741

929490

30

14.58

970960

87330

883630

30

10.12

Results

Experiment 1: Since the six UCI data sets are small, there is no need to compare the time complexity of the two under-sampling algorithms. We only compare the time complexity of the two under-sampling algorithms on the credit card and KKBox user churn prediction data sets. In the European credit card transaction recording experiment, K-means under-sampling took 801 s, while MBCDK-means took only 1 s to complete under-sampling. In the KKBox experiment, memory overflow happened in K-means under-sampling process after running for 7 h. In contrast, MBCDK-means took only 3 h to complete the sampling process. Apparently, the time complexity of MBCDK-means is much lower than that of K-means under-sampling. At the same time, we used C4.5 decision tree as a classiﬁer on 8 data sets to compare the difference in classiﬁcation performance after using MBCDK-means undersampling and K-means under-sampling respectively. Figure 3 shows the results of this comparative experiment. Obviously, on these 8 data sets, the under-sampling algorithm proposed in this paper achieves better classiﬁcation performance than K-means undersampling algorithm in classiﬁcation performance. It can be seen that the algorithm proposed in this paper can deal with the imbalanced dataset more effectively than K-means under-sampling algorithm.

Imbalanced Data Classiﬁcation

355

1.00 0.90 0.80 0.70 AUC

0.60 0.50 0.40 0.30 0.20 0.10 0.00

C4.5

Fig. 3. Comparison undersampling

of

classiﬁcation

C4.5

performance

of

MBCDK-means

and

K-means

Experiment 2: After the MBCDK-means under-sampling on 8 datasets, GA-ANN model and ANN model were sequentially used to compare the classiﬁcation performance of the two models. The experimental results are shown in Fig. 4. It can be seen that GA-ANN achieves better classiﬁcation performance than ANN. It shows that the genetic algorithm is effective for the improvement of ANN. Experiment 3: the classiﬁcation performance is compared between traditional machine learning methods and the classiﬁer based on MBCDK-means under-sampling and GA-ANN model. The traditional machine learning models used in this experiment are C4.5 classiﬁcation tree, bagging, random forest, and ANN and gradient boosting. The experimental results are shown in Fig. 5. Obviously, on small datasets, the classiﬁcation performances of the proposed algorithm, bagging, boosting and random forest are similar while ANN and C4.5 decision tree achieve bad results. On big datasets such as European credit card and KKBox user churn prediction, the proposed algorithm achieves better performance than traditional machine learning algorithm.

356

A. Song and Q. Xu

1 0.9 0.8

AUC

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Glass0

Glass4 GA-ANN

Yeast5 ANN

creditcard

Fig. 4. Comparison of classiﬁcation performance of GA-ANN and ANN

1

AUC

0.9 0.8 0.7 0.6 0.5

proposed

C4.5

ANN

RandomForest

Bagging

GradientBoosting

Fig. 5. Comparison of classiﬁcation performance between traditional classiﬁers and the classiﬁer based on MBCDK-means undersampling and GA-ANN

Imbalanced Data Classiﬁcation

357

4 Conclusion This article proposes a new sampling method called MBCDK-means undersampling and a classiﬁcation model named GA-ANN. MBCDK-means undersampling fuses mini batch into K-means resampling. Meanwhile, this resampling method chooses samples according to the distribution of cluster samples and the distance between the cluster centers of majority class samples and that of the minority class samples. It remains the information of original data distribution. Besides, it increases the sampling rate of boundary samples. It is effective to improve ﬁnal classiﬁcation performance. After acquiring balanced data, these data should be classiﬁed. ANN is a common classiﬁer but it is easy to be trapped in a local minimum. That’s why GA-ANN is presented. Genetic algorithm is a method to ﬁnd the optimal solution. Introducing GA to ANN helps ANN to ﬁnd the optimal weights and biases. In 8 datasets, compared with K-means undersampling algorithm, MBCDK-means achieves better classiﬁcation performance on the AUC. In the meantime, the time and space complexity of MBCDKmeans is much lower than K-means undersampling in the experiments of credit card and KKBox churn prediction. In addition, GA-ANN gains better classiﬁcation performance in contrast to ANN. In the end, the classiﬁcation performance based on MBCDK-means and GA-ANN is competitive in 6 small UCI datasets and is better than traditional machine learning methods in big datasets. Experiment results show that our method is efﬁcient in imbalanced data classiﬁcation.

References 1. Krawczyk, B.: Learning from imbalanced data: open challenges and future directions. Prog. Artif. Intell. 5(4), 1–12 (2016) 2. Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 6(1), 20–29 (2004) 3. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16(1), 321–357 (2002) 4. Dong, Y., Wang, X.: A new over-sampling approach: random-SMOTE for learning from imbalanced data sets. In: Xiong, H., Lee, W.B. (eds.) KSEM 2011. LNCS (LNAI), vol. 7091, pp. 343–352. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-25975-3_30 5. Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005, Part I. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005). https://doi.org/10.1007/ 11538059_91 6. Tomek, I.: Two modiﬁcations of CNN. IEEE Trans. Syst. Man Cybern. SMC-6(11), 769–772 (1976) 7. Laurikkala, J.: Improving identiﬁcation of difﬁcult small classes by balancing class distribution. In: Quaglini, S., Barahona, P., Andreassen, S. (eds.) AIME 2001. LNCS (LNAI), vol. 2101, pp. 63–66. Springer, Heidelberg (2001). https://doi.org/10.1007/3-54048229-6_9

358

A. Song and Q. Xu

8. Lin, W.C., Tsai, C.F., Hu, Y.H., Jhang, J.S.: Clustering-based undersampling in classimbalanced data. Inf. Sci. 5(8), 17–26 (2017) 9. Idris, A., Iftikhar, A., Rehman, Z.U.: Intelligent churn prediction for telecom using GPAdaBoost learning and PSO undersampling. Cluster Comput. 1–15 (2017)

Evolutionary Tuning of a Pulse Mormyrid Electromotor Model to Generate Stereotyped Sequences of Electrical Pulse Intervals Angel Lareo(B) , Pablo Varona, and F. B. Rodriguez Grupo de Neurocomputaci´ on Biol´ ogica, Departamento de Ingenier´ıa Inform´ atica, Escuela Polit´ecnica Superior, Universidad Aut´ onoma de Madrid, Madrid, Spain {angel.lareo,pablo.varona,f.rodriguez}@uam.es

Abstract. Adjusting parameters of a neural network model to reproduce complete sets of biologically plausible behaviors is a complex task, even in a well-described neural system. We show here a method for evolving a model of the mormyrid electromotor command chain to reproduce highly realistic temporal ﬁring patterns as described by neuroethological studies in this system. Our method uses genetic algorithms for tuning unknown parameters in the synapses of the network. The developed ﬁtting function simulates each evolved model under diﬀerent network inputs and compare its output with the target patterns from the living animal. The obtained synaptic conﬁguration can reveal new information about the functioning of electromotor systems. Keywords: Genetic algorithms · Complex ﬁring patterns Neural models · Network parameter optimization Information sequences · Pulse intervals · Electroreception

1

Introduction

To accomplish the robustness and ﬂexibility that shape characteristic temporal patterns in neural activations is a complex task that networks in the nervous system seem to perform in a robust manner. However, mimicking these temporal patterns in models is not an easy task, particularly taking into account that the same network has to generate diﬀerent patterns without changes in its structure. This is so even in simpliﬁed models with a reduced number of parameters. The main objective of this paper is to present an evolutionary method to adjust the parameters of a model in order to reﬂect the diﬀerent temporal structures of neural activations that occur in its biological counterpart. Genetic algorithms (GAs) are a convenient tool for computing global optimization, including temporal matching, inspired by biological evolution [14]. GAs have been extensively applied to parameter adjusting in neuron models [8,20] and modeled neural networks [19]. It has enable improvements in robot locomotion [15,23] c Springer Nature Switzerland AG 2018 V. K˚ urkov´ a et al. (Eds.): ICANN 2018, LNCS 11140, pp. 359–368, 2018. https://doi.org/10.1007/978-3-030-01421-6_35

360

A. Lareo et al.

and in the development of biomimetic neuroprosthesis [11]. Also, regarding temporal patterns of electrical activity, GAs have been applied, for instance, to design sequences of neural stimulation that improve clinically-standard patterns in biophysically-based models [6]. It is worth noting that multi-objective optimization for constraining a model by experimental data using GAs allows more ﬂexible and realistic models [10]. In this paper a method for adjusting the parameters of a neural network model to reproduce a set of diﬀerent temporal patterns of activity using the same topology is presented. The electromotor command network in pulse mormyrids, a family of weakly electric ﬁsh, is a well-known system [1] commonly used for studying information processing in the nervous system [12,16,17]. The rapid voltage transients (pulses) produced by the electric organ of these ﬁshes (known as electric organ discharges, or EODs) can be detected in the ﬁsh surroundings. This EODs are 1:1 correlated with pulses of a neural ensemble known as the command nucleus (CN - Fig. 1). As a result, pulse mormyrids constitute a well suited system for non-invasively monitoring a living nervous system during long-time periods. This system has other advantages. First, ethological studies have described stereotyped sequences of pulse intervals (SPI) in these animals (see Sect. 1.2). Furthermore, temporal patterns produced in the EOD are related with overall ﬁsh behavior, for example aggression of courtship [5]. Finally, physiological studies have described the network topology of the electromotor system [3]. The neural ensembles responsible for the generation of diﬀerent SPI patterns have also been described [2,4]. Information from these studies have been used to develop an initial model of this system (see Sect. 1.1). The topology of the network was composed as four neurons and ﬁve synapses [4,18]. The preliminary hand-tuned model was able to show some of the characteristics of the real system, but it was not able to reproduce the temporal structure of all the target SPIs with the described connectivity. Due to the intrinsic complexity of the network, hand-ﬁtting to experimental data is a hard and time-consuming process [24], which in most cases fails to achieve the expected results. Even though several studies have successfully hand-tuned neural models [21], an automatic approach for model optimization has many advantages. In particular, it allows for searches that meet several requirements in shaping speciﬁc temporal patterns using modern high performance computing systems. Two diﬀerent GAs were used to improve the initial electromotor model and reproduce the temporal structure of all the target SPIs patterns (see Sect. 2). These GAs are described in Sect. 2.1 and both use the same evaluation function, which is described in Sect. 2.2. The development of an adequate evaluation function is crucial. This function will guide the evolutionary process scoring the individuals. Here, a function based on the mean square error between the model output sequences and the target SPIs is presented. The convergence results of both GAs is showed in Sect. 3, aside with a simulation of the best individual. These results are analyzed and discussed in Sect. 4.

Evolutionary Tuning an Electromotor Model

1.1

361

Electromotor Command Network

The electromotor system commands the activity of the electric organ [2]. Each EOD is initiated by a pulse of the medullary command nucleus (CN) [4]. CN integrates inﬂuences mainly from two sources: the mesencephalic precommand nucleus (PCN) and the adjacent thalamic dorsal posterior nucleus (DP). After an EOD, motor outputs return to the command network through a corollary discharge pathway that activate the dorsal region of the ventroposterior nucleus (VPd). Finally, VPd provides inhibition feedback to DP and PCN, regulating the resting electromotor rhythm. Figure 1 shows a simpliﬁed representation of this network.

Fig. 1. Stereotyped sequences of pulse intervals (Scallop, acceleration, rasp and cessation) and simpliﬁed representation of the electromotor command network (center), based in [2, 4]. Each SPI chart represents inter-pulse intervals (or IPIs, Y axis) along time (X axis). In the schematic, the neurons (VPd, DP, PCN, CN) are connected by ﬁve synapses, three of them are excitatory (those ended by arrows) and two of them inhibitory (those ended by circles). The dashed line represents the corollary discharge pathway. CN is the output of the network. Colors relate each SPI pattern with the neuron ensemble: DP activation is related with accelerations, PCN activation is related with scallops, VPd activation is related with cessations and activation of both DP and PCN is related with rasps.

362

1.2

A. Lareo et al.

Stereotyped Sequences of Pulse Intervals (SPIs)

Sequences of EODs are not random. They are grouped composing stereotyped patterns of pulse intervals [2,5]. Four SPI patterns have been well-described: accelerations, scallops, rasps and cessations. Accelerations are prolonged decreases of electrical inter-pulse intervals (IPIs) to a series of nearly regular shorter intervals, as a result of activation in the DP nucleus. This kind of pattern is variable both in the ﬁnal duration and in the minimum IPI reached. They are related to aggressive behaviors. Scallops are sudden drops to very short IPIs followed by an immediate recovery, where IPIs rapidly increase to regular values. PCN activation is related to this electrical signalling. It may function as an advertisem*nt signal. Rasps have an initial sudden decrease to very short IPIs, similar to the ones observed in scallops, followed by a sustained slow increase like in accelerations. Both DP/PCN nuclei activations lead to this EOD pattern, which is used by male ﬁsh for courtship. Cessations are a stop in the EOD generation during long time periods of around one second. It has been related with both aggressive and submissive behavior. This ﬁring modality is triggered by activation of VPd. The model was built to reproduce the temporal structure of these patterns as a function of the network inputs, without changing the network topology. Inputs were diﬀerent stimuli that corresponded to the neuron ensemble activations described by the experimental studies. Due to the complexity of the task, an automatic method for synaptic parameter adjusting was developed.

2

Evolving the Network

We started from a previously developed electromotor model ([18] and unpublished work). Both the neuron and synapse models were initially hand-tuned to mimic the main characteristics of the real system. We improved the model through a trial-and-error process. Manual ﬁtting of individual parameters was followed by an analysis of the results obtained, which iteratively leads to new changes in the parameters. Nevertheless, hand-ﬁtting the synaptic parameters became almost impossible, as little is known about synaptic conductances in the system. As a result, a GA method was developed to reﬁne the synaptic parameters. The method selected for modeling the synapses describes receptor bindings to describe the dynamics of synaptic conductances [9]. The synaptic current received by the post-synaptic neuron is calculated as follows: I(t) = g · r(t) · (Vpost (t) − Esyn ) where g is the synaptic conductance, Vpost (t) the postsynaptic potential, Esyn the synaptic reversal potential and r(t), the ratio of bound chemical neurotransmitter receptors, which is given by: α[T ](1 − r) − βr, if t ≤ tmax r˙ = −βr, otherwise

Evolutionary Tuning an Electromotor Model

363

Fig. 2. Schematic representation of the GA ﬁtting process. Iterations continue until 100 generations are reached.

where α and β are the forward and backward rate constants for transmitter binding and [T] is the neurotransmitter concentration. As [T] is described as a pulse in this model, it is maximum while t ≤ tmax and, when t > tmax , [T ] = 0. According to the insight gained from the manual ﬁtting process, a set of synaptic parameters controlling the time evolution of the conductance were selected for being evolved. Four parameters of each of the synapses were modiﬁed: (i) α, forward rate constant (chemical neurotransmitter binding); (ii) β, backward rate constant (chemical neurotransmitter unbinding); (iii) g, synaptic conductance; (iv) tmax , maximum release time. Validity ranges for each parameter were limited by diﬀerent percentages (5%, 20% and 50%) relative to its initial value, set from the hand-tuned model. 2.1

Genetic Algorithms

Diﬀerent kind of GAs and operators were tested: a simple GA (SGA) [13] and a steady-state GA (SSGA) [7]. Individuals in both GAs were diﬀerent sets of parameters and each parameter was represented by a real value. In both SGA and SSGA, the initial population was formed by clones of the initial hand-tuned model, provided as an input. In SGA, each generation created an entirely new population of individuals. First, it selected individuals from the previous population, by elitism (best ﬁtting individual remained unchanged between generations) and roulette wheel selection (the ﬁtness value of each individual determined its probability for being selected). Selected individuals were crossed to produce individuals for the new population. This process continued for 100 generations. In SSGA, the initial population was created in the same way. Nevertheless, in each generation, a temporary population was created and added to the previous

364

A. Lareo et al.

population. Then, individuals were ranked and the worst of them were removed to return the population to its original size, with a 10% overlap between generations. 2.2

Evaluation Function

Each individual I in the population contained diﬀerent synaptic parameter values for the model. I was formed by a set of 20 parameter values, the ones indicated above (α, β, g, tmax ) for each synapse (Fig. 1). On each generation, parameters of all individuals were used for building a model. Then, it was evaluated with a set of simulations (Fig. 2). Four diﬀerent simulations (S) were deﬁned, each one corresponding to a target SPI: acceleration (Sacc ), scallop (Ssca ), rasp (Srasp ), cessation (Scess ). Each simulation S established the inputs to the network. Each I was simulated under all four simulation cases. The ﬁtness function (F (I)) of the overall individual was deﬁned as the sum of the evaluation results under each case (F (Ii ) = facc (I) + fsca (I) + frasp (I) + fcess (I)). The four target patterns P (acceleration, scallop, rasp and cessation) were deﬁned in terms of an ordered sequence of IPIs (P = p0 , ..., pn ) where pi is each interval. For evaluating a pattern (fS (I), where S was one of the four simulations Sacc , Ssca , Srasp , Scess ), the individual I was simulated and the output of CN was obtained in term of IPIs: P S (I) = pS (I)0 , ..., pS (I)m . The evaluation searched in the output for the best ﬁtting sequence with the target pattern. Mean squared error (M SE) was used for the evaluation. If m < n (i.e. the number of IPIs in the simulation was smaller than those in the target pattern) the ﬁt value was 0. Otherwise, the ﬁtting value was calculated as follows: n (pi − pS (I)l+i )2 ) M SE = minl ( i=0 n where 0 700, numerical issues arise due to the exponentials involved. Fortunately there is a simple rule-of-thumb solution for both problems that consists of applying a softmax function with “best guess” parameters several times in Eq. (4). This complicates the gradient, but as long as the ﬁnal softmax function gives a suﬃciently hard winner assignment, the learning rule (18) remains valid. Software frameworks like TensorFlow can compute the gradient symbolically, so even the exact gradient can be used regardless of how often softmax was applied. We found that a three-fold application was always suﬃcient to guarantee a unique winner selection. The parameter S0 is usually made to depend on the map size. A rule of thumb that always worked well is to choose it proportional to the diagonal of the quadratic K × K map, i.e., S0 = K 4 . In contrast, classiﬁcation experiments always give best results the smaller S∞ is, so this is always ﬁxed at small values like S∞ = 0.01. The values of t0 , tA and t∞ can be determined empirically be requiring that (i) self-adaptation has occurred before tA (ii) the energy function

An Energy-Based Convolutional SOM Model

429

has converged to a stable value before t0 and (iii) that the energy function is as low as possible while still satisfying all constraints at t∞ . Here, we see the value of an energy function as it can be used to determine convergence, so these parameters which for SOMs have to be obtained by visual inspection, can be determined by cross-validation. By a similar reasoning, a good value for the learning rate can be obtained, where smaller values are always acceptable but lead to increased training time. The mini-batch size is generally assumed to be N = 1 in this article. The self-adaptation rate, αd N , should be chosen such that the constraints are approximately upheld during prototype adaptation, meaning it will depend on the choice of α and is thus not a free parameter but can be indirectly obtained by cross-validation.

3

Experiments

The ReST model used in all experiments is implemented in Python using TensorFlow 1.5 [1]. The gradients (18, 15) are computed automatically by the software. Energy minimization is done by plain stochastic gradient descent, although more advanced optimizers minimize the ReST energy function equally well. 3.1

Self-organization and Self-adaptation in the ReST Model

In this section we will demonstrate that the ReST model, while diﬀering from both the original SOM model [7] and the energy-based “Heskes model” [6], achieves the same basic type of prototype self-organization. At the same time, we will demonstrate the eﬀectiveness of ReST’s self-adaptation process as described in Sect. 2.1 and comment on its beneﬁcial eﬀects. To this end, we will conduct simulations with the dataset described in Sect. 2. ReST parameters are chosen as follows (in the terms of Sect. 2.1): K = 10, T = 40000, tA = 5000, t0 = 10000, t∞ = 30000, S0 = K/4, S∞ = 0.1, αd = 0.01, α = 0.05, eσ = 3 and eμ = 0.1. After ReST convergence at t∞ , statistics is collected for 5000 iterations and subsequently evaluated. Histograms of all neural activities during these 5000 iterations are computed and compared to the theoretical log-normal distribution determined by μ and σ. From Fig. 1, it can be observed that self-organization proceeds exactly in the same manner as in a SOM, starting with a coarse “global ordering” of prototypes followed by reﬁnement as S(ν) is decreased, showing that ReST performs essentially the same function as a SOM, only with convergence in 2D guaranteed and a self-adaptation process that give a probabilistic interpretation to the computed activities. As can be seen in Fig. 2, the ﬁt between theoretical and measured distribution is generally acceptable for all datasets, although of course a perfect ﬁt is not to be expected. This is because we only ﬁt the ﬁrst two moments of the log activities to deﬁned values. For a better ﬁt, at least the third moment of the log activities should be controlled, which would however result in a more complex constrained optimization scheme. Figure 1 shows this hom*ogeneity is achieved by quite heterogeneous settings of the perneuron parameters oi and si , see Eq. (1).

430

A. Gepperth et al.

Fig. 1. Upper two rows: Diﬀerent stages of ReST training on the MNIST dataset. Upper row, from left to right: ReST prototypes with long-term geometric activity averages superimposed on them for times t = 7000, 12000, 24000. Middle row, from left to right: ReST prototypes with long-term geometric standard deviation averages superimposed on them for times t = 7000, 12000, 24000. We observe that activity averages and deviations are strictly adhered to, as well as the SOM-like topological organization of prototypes. Lower row: distribution of per-neuron parameters oi and si after convergence of the ReST layer at iteration 24000.

Fig. 2. Activity histograms for neuron (4, 4) in a ReST layer trained on MNIST both for the case of enabled (left) and disabled (right) self-adaptation. The theoretical lognormal density is superimposed onto the histograms as a solid green line, showing a very good match.

An Energy-Based Convolutional SOM Model

3.2

431

Convolutional ReST Experiments

As with CNNs, convolutional ReST layers have a great number of possible conH ﬁgurations for the ﬁlter sizes (fxH , fyH ) and step sizes (ΔH x , Δy ), of which we can test only a few. Experimental outcomes are the learned ﬁlters for each conﬁguration as shown in Fig. 3, where we see that ReST performs both topological organization (as in a SOM) as well as feature extraction (as in a CNN layer).

Fig. 3. Prototypes for convolutional/independent ReST architectures (left to right), deﬁned by fxH , ΔH x , y, x: ind-14-7-0-0, ind-14-7-1-1, ind-14-7-2-2, conv-14-7, ind-7-3-33, ind-7-3-6-6.

3.3

Intuitive Interpretation of the Self-adaptation Process

To better understand what the self-adaptation mechanism in ReST actually does, we create a set of 10.000 two-dimensional data points x i ∈ R2 which are drawn from a normal distribution with mean µ = (0.5, 0.5)T and standard deviation Σ = 0.15. We subsequently train a non-convolutional ReST layer of size K = 10 using the parameters of Sect. 3.1. The ﬁnal prototype positions and values of the per-neuron parameters si and oi are shown in Fig. 4 and show the following things: – where data points are more dense(sparse), overall oﬀsets oi are lower(higher). This is intuitive since prototypes that react to less frequently occurring samples need to have a higher oﬀset to maintain a constant average activity. – where data points are more dense(sparse), selectivities si are higher(lower), meaning that a neuron will react less(more) strongly to nearby samples.

Fig. 4. Prototype positions overlaid in color with per-neuron parameters oi (left) and si (right) when training a ReST layer on a 2D normal distribution.

432

A. Gepperth et al.

Fig. 5. ReST execution speed depending on batch size and map size, measured on: CPU without updating(left), GPU without updating (middle), GPU with updating (right).

This is intuitive as well, since a higher number of nearby samples would mean a near-constant activity, with low variance, if neurons could not become more selective in their reactions. These results show that ReST neurons can adapt to the sample density in their Voronoi cell, a behavior that closely mimicks self-adaptation mechanisms in biological neurons. 3.4

Implementation and GPU Speed-Up

Unless otherwise stated, benchmark experiments always use the following parameters: Map size H H = W H = 10, Σ0 = 2, 0 = 0.1, ∞ = 0.0001, Σ∞ = 0.01, Tconv = 3000, Tconv = 10000. We compare the execution time per sample by feeding the ReST model 2000 randomly selected samples, either running it on CPU or GPU (NVIDIA GeForce 1080), and vary the map size W H = H H ∈ {10, 15, 20, 30, 50} and the input batch size N I ∈ {1, 5, 10, 20, 50, 100} independently. The results of Fig. 5 show that, ﬁrst of all, GPU acceleration is most eﬀective at high batch sizes and amounts to a factor of roughly 10–20 w.r.t. CPU speed. Secondly, as expected, for high batch and map sizes the GPU is saturated, resulting in no more speed improvements from parallelization. And lastly, updating the ReST layer incurs a heavy speed penalty even on GPU, probably because of the convolution in Eq. (4).

4

Discussion and Conclusion

The experiments of the last section have shown that the energy-based ReST model is both eﬃcient and can proﬁt from GPU acceleration, that it behaves as one would expect a SOM to behave, and that the self-adaptation process is both feasible and leads to a clear probabilistic interpretation of ReST activities. We believe that the new ReST model (in its non-convolutional form) can be used as a drop-in replacement anywhere SOMs are used, albeit in a much more intuitive way because both the ReST energy function as well as ReST activities have a clear interpretation. In its convolutional form, ReST layers can be stacked in deep hierarchies, which we believe can be a very interesting approach when creating “deep” versions of incremental learning methods as proposed, e.g., in [3].

An Energy-Based Convolutional SOM Model

433

References 1. Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. In: OSDI, vol. 16, pp. 265–283 (2016) 2. Erwin, E., Obermayer, K., Schulten, K.: Self-organizing maps: ordering, convergence properties and energy functions. Biol. Cybern. 67(1), 47–55 (1992) 3. Gepperth, A., Karaoguz, C.: A bio-inspired incremental learning architecture for applied perceptual problems. Cogn. Comput. 8(5), 924–934 (2016) 4. Graepel, T., Burger, M., Obermayer, K.: Self-organizing maps: generalizations and new optimization techniques. Neurocomputing 21(1–3), 173–190 (1998) 5. Heskes, T.: Energy functions for self-organizing maps. In: Kohonen Maps, pp. 303– 315. Elsevier (1999) 6. Heskes, T.M., Kappen, B.: Error potentials for self-organization. In: 1993 IEEE International Conference on Neural Networks, pp. 1219–1223. IEEE (1993) 7. Kohonen, T.: Self-organized formation of topologically correct feature maps. Biol. Cybern. 43, 59–69 (1982) 8. LeCun, Y., Bottou, L., Bengio, Y., Haﬀner, P.: Gradient-based learning applied to document recognition. In: Haykin, S., Kosko, B. (eds.) Intelligent Signal Processing, pp. 306–351. IEEE Press 9. Lefort, M., Hecht, T., Gepperth, A.: Using self-organizing maps for regression: the importance of the output function. In: European Symposium on Artiﬁcial Neural Networks (ESANN) (2015) 10. Tolat, V.: An analysis of kohonen’s self-organizing maps using a system of energy functions. Biol. Cybern. 64(2), 155–164 (1990)

A Hierarchy Based Influence Maximization Algorithm in Social Networks Lingling Li, Kan Li(&), and Chao Xiang Beijing Institute of Technology, 5 South Zhongguancun Street, Haidian District, Beijing, China {2120161008,likan}@bit.edu.cn, [emailprotected]

Abstract. Influence maximization refers to mining top-K most influential nodes from a social network to maximize the ﬁnal propagation of influence in the network, which is one of the key issues in social network analysis. It is a discrete optimization problem and is also NP-hard under both independent cascade and linear threshold models. The existing researches show that although the greedy algorithm can achieve an approximate ratio of ð1 1=eÞ, its time cost is expensive. Heuristic algorithms can improve the efﬁciency, but they sacriﬁce a certain degree of accuracy. In order to improve efﬁciency without sacriﬁcing much accuracy, in this paper, we propose a new approach called Hierarchy based Influence Maximization algorithm (HBIM in short) to mine topK influential nodes. It is a two-phase method: (1) an algorithm for detecting information diffusion levels based on the ﬁrst-order and second-order proximity between social nodes. (2) a dynamic programming algorithm for selecting levels to ﬁnd influential nodes. Experiments show that our algorithm outperforms the benchmarks. Keywords: Social networks

Influence maximization Hierarchy

1 Introduction Influence maximization is to ﬁnd K nodes (called the seeds) in a social network, so that the expected spread of the influence can be maximized by activating these nodes. Kempe et al. [9] ﬁrst formulated influence maximization as a discrete optimization problem. Besides, they proposed a greedy algorithm that can approximate the optimal solution within a factor of ð1 1=eÞ, which is the best approximation guarantee one can hope for according to Feige’s approximation threshold for max k-cover [5]. However, due to the large scale of social network data, greedy algorithms have poor efﬁciency although their accuracy is high. Despite the fact that many more efﬁcient algorithms [2, 3, 8, 10, 11, 13] have been proposed, most methods still spend a lot of time calculating the expected spread of influence. In this paper, we propose a new approach called Hierarchy based Influence Maximization algorithm, the basic idea is that nodes rely on social groups to spread information and nodes with similar social attributes are more likely to influence each other. We describe a group of nodes by a hierarchical structure which is segmented into levels according to the belonging coefﬁcients of nodes. The belonging coefﬁcients [1] © Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11140, pp. 434–443, 2018. https://doi.org/10.1007/978-3-030-01421-6_42

A Hierarchy Based Influence Maximization Algorithm

435

reflect the strength of relation between a node and its social group. Nodes with similar belonging coefﬁcients are in the same level. Intuitively, a level is a densely connected subset of nodes that are only sparsely linked to the remaining network [1, 6]. If we ﬁnd influential node in these levels instead of the whole network, then the prohibitive cost of mining K seeds will be greatly reduced. The method we propose is two-phase. Firstly, we assume that the information publisher is known, then we calculate belonging coefﬁcients by random walk according to the ﬁrst-order proximity and the second-order proximity between nodes, and then the nodes are segmented into levels by linear regression and dynamic programming. The ﬁrst-order proximity [14] describes whether the two nodes have an edge, that is, pairwise proximity. The secondorder proximity [14] describes whether two nodes have common neighbors, that is, the similarity degree of the neighbor structure of a pair of nodes. Secondly, we use dynamic programming to ﬁnd the level where influential nodes lie in, then in order to ﬁnd the influential nodes in the level, we exploit the expected spread value of nodes instead of the traditional Monte-Carlo simulation to calculate the optimization function which signiﬁcantly improves the efﬁciency of our algorithm. Our method achieves a good performance compared with benchmark algorithms on three real world datasets. In summary, the contributions of the paper are given as follows. 1. we propose a new approach called Hierarchy based Influence Maximization. The method exploits the ﬁrst-order and second-order proximity between nodes to detect information diffusion levels, and then mines the influential nodes from these levels by dynamic programming. To the best of our knowledge, we are among the ﬁrst to use both the ﬁrst-order and second-order proximity to divide the levels, which is robust to sparse network. 2. we conduct experiments on three real world datasets, compared to benchmark algorithms, our algorithm outperforms in mining influential nodes in social networks.

2 Related Work The influence maximization problem is ﬁrst proposed by Domingos and Richardson [4]. Later, Kempe et al. [9] formulated it as a discrete optimization problem, besides, they proposed a greedy climbing approximation algorithm to approximate the optimal solution. Leskovec et al. [11] proposed the CELF (Cost- Effective Lazy Forward) schema, an optimized greedy algorithm to reduce the running time. However, since the greedy algorithms use Monte-Carlo simulation to accurately calculate the influence of candidate nodes, when the network size increases, the running time will increase sharply. As a result, some researchers began to consider using heuristics. Chen et al. [3] presented the DegreeDiscount heuristic which assumes that influence propagation is related to the degree of nodes, and it cut reduces the running time while not sacriﬁcing too much accuracy. In recent years, there are some other heuristics, such as SIMPATH [7], IRIE (Influence Rank Influence Estimation) [8], TIM [13]. Although these heuristics improve the efﬁciency to some extent, the accuracy is more or less affected. Given that nodes disseminate information and influence each other in the form of social

436

L. Li et al.

groups, some researchers began to study the influence maximization problem from the perspective of community structure. Wang et al. [15] proposed the CGA (Communitybased Greedy Algorithm) for mining top-K influential nodes. Unlike other community detection, CGA takes into account information diffusion between nodes when detecting community. Zhu et al. [16] put forward the hierarchical community structure based algorithm (HCSA) for influence maximization, which gains a wider range of influence spread and less running time compared with heuristic algorithms. The aforementioned approaches handle the efﬁciency or accuracy issues by improving greedy or heuristic algorithms or by leveraging the community structure of social networks, but, none of them take into consideration the hierarchical structure of nodes according to the ﬁrstorder and second-order proximity.

3 Method 3.1

Problem Deﬁnition

In order to clarify the main idea, we list the major notations used in the paper in Table 1. Table 1. Major notations used in the paper. Notations G(V, E) V E pp K Ik M levelm RðIk Þ R m ð Ik Þ

Descriptions A social network graph The set of nodes, jV j ¼ N The set of edges The propagation probability The seed set size The set of influential nodes obtained in the previous k steps The number of levels The mth level The influence degree in G of set Ik The influence degree, in levelm of set Ik

In the paper, we use Independent Cascade (IC) model as information diffusion model. The principle of IC model can be stated as: Given a social network graph GðV; E Þ, V represents the set of nodes ðjV j ¼ NÞ, E represents the set of edges ðE ¼ ðu; vÞju; v 2 VÞ, and pp represents the propagation probability. For each edge ðu; vÞ 2 E, ppuv 2 ½0; 1. In IC model, if a node changes from an inactive state to an active state at the moment t, it only has once opportunity to try to activate its inactive neighbor nodes. Moreover, once the node is activated, it will remain active in the whole process. This process terminates until there are no more new nodes are activated.

A Hierarchy Based Influence Maximization Algorithm

3.2

437

The Hierarchy Based Influence Maximization

The Hierarchy based Influence Maximization algorithm proposed in this paper is divided into two stages. The ﬁrst stage is to divide the information diffusion into levels, and the second stage is to ﬁnd influential nodes in each level. Detection of Information Diffusion Levels. To better measure the closeness of other nodes with respect to the information publisher, we use random walk to calculate the belonging coefﬁcient of other nodes relative to the source node (publisher) to ensure the comparability and continuity of results when detecting the information diffusion levels. The main idea of random walk is: at each step, a walker standing at a node selects one node from its neighbors to move according to a transition probability. As the walking proceeds, the probability of reaching a node gradually decreases, this process ensures that every node has a path to the source node, thus ensuring the continuity of information diffusion. The measurement of the belonging coefﬁcient consists of the ﬁrst-order proximity and the second-order proximity between the nodes. S1ij and S2ij are used to represent the ﬁrst-order proximity and second-order proximity between node vi and vj , respectively, then their deﬁnitions are as follows: S1ij ¼

Aij di

S2ij ¼ jjNN ððiiÞÞ \[ NN ðð jjÞÞjj

ð1Þ ð2Þ

Where, Aij indicates whether there is an edge between node vi and node vj , if it exists, Aij ¼ 1, otherwise, Aij ¼ 0. di indicates the degrees of node vi . N ðiÞ and N ð jÞ represent the neighbor node set of node vi and vj , respectively. And then we give the deﬁnition of transition probability pij between two nodes vi and vj : pij ¼ aS1ij þ ð1 aÞS2ij

ð3Þ

Where, the corresponding matrix P is called the transition matrix, the adjustment factor a 2 ½0; 1. Given a information publisher s, the probability of walking from node vi to node s within T steps is the belonging coefﬁcient: CsT ¼

T P t¼1

qts ðiÞ

ð4Þ

P Where qts ðiÞ ¼ Nj¼1 qt1 s ð jÞQij is equals to the transition matrix except Qsi ¼ 0; i ¼ 1; ; N. The belonging coefﬁcient measures the closeness degree of the node vi relatives to the source node s. Nodes with more paths and fewer steps from the source node have a higher belonging coefﬁcient. We sort the nodes according to the belonging coefﬁcient and use L to represent the result sequence. Nodes at the same level tend to form a line segment, and the combination of multiple line segments constitutes the entire sequence. We use linear regression to ﬁt the line segment of each level. Because there are more

438

L. Li et al.

connections within the information hierarchy than that in the outside, the gaps of the ranked belonging coefﬁcient can represent the boundaries of the information diffusion levels. For a given M (representing the number of levels to be detected), we choose ðM 1Þ breakpoints to minimize the inconsistency. To solve this problem, we design an algorithm based on dynamic programming. We deﬁne the state fij using the minimum value of inconsistency of the ﬁrst j nodes within i levels. Assuming that fi1;j0 ðj0 ¼ 1; 2; . . .; jÞ is known, we can calculate fij by enumerating the last breakpoint. The deﬁnition of the transition function fij is as follows: fij ¼ min fi1;x þ costðx þ 1; jÞ ; x ¼ i 1; ; j 1

ð5Þ

Where, costðp; qÞ represents the minimum residual error to ﬁt the part of order from the p th node to the q th node, which is deﬁned as follows: q P

costðp; qÞ ¼

ð yð i Þ L i Þ 2

ð6Þ

i¼p

yðiÞ ¼ xi þ b P P P n iLðiÞ i P Pi i Pi LðiÞ x¼ 2 n

b¼

i

i

i

i

i

i

P 2P P P i i iLðiÞ i i Pi L2ðiÞP iP n

i

i

i

i

i

i

ð7Þ ð8Þ ð9Þ

Where, i ¼ p; ; q; n ¼ q p þ 1. Then we use gij to record the breakpoint selected by fij . The deﬁnition of gij is as follows: gij ¼ argmax fi1;x þ costðx þ 1; jÞ ; x ¼ i 1; ; j 1 x

ð10Þ

By iteratively computing, we can obtain the division of the information diffusion levels by the breakpoints stored in g. For the number of the ﬁnal levels, we set it according to the stability of the hierarchical structure. The stability of the hierarchical structure is determined by the ﬁtting function FLm ¼ FitnessðCm Þ ¼ Sm din m0 levelm0 , where, Cm consists of top-m levels, din and dout represent din þ dout Cm ¼ the internal and external degrees of the nodes in these levels, respectively. A hierarchical structure with a local maximum quality is considered stable, as M increases, although more segments can better match the order of the nodes’ belonging coefﬁcients, the structure’s inconsistency is also increasing. Therefore, we use as few lines as possible to obtain a good hierarchical structure. Finding the Influential Nodes. In stage one, we have divided social networks into levels. The remaining challenge is to choose which level to ﬁnd top-K influential nodes. We use a dynamic programming algorithm to select the level of the k th ðk 2 ½1; K Þ influential node lies in. Let Ik1 denote the set of influential nodes obtained in the

A Hierarchy Based Influence Maximization Algorithm

439

previous ðk 1Þ steps, if the k th node is mined in levelm , the maximal increase DRm of the influence degree of levelm is calculated as follows: DRm ¼ maxfRm ðIk1 [ f jgÞ Rm ðIk1 Þjj 2 levelm g

ð11Þ

RðIk Þ ¼ rðNIk Þ

ð12Þ

Where, rðIk Þ represents the number of nodes influenced by the set Ik in the process of information dissemination. N indicates the number of network nodes. In order to ﬁnd the k th influential node, we need to select the level that produces the largest increment of influence among all levels. Let R½m; k (m 2 ½1; M and k 2 ½1; K ) expresses the influence degree of mining the k th influential node in the ﬁrst m levels, we have: R½m; k ¼ maxfR½m 1; k; R½M; k 1 þ DRm g; ðRm ½m; 0; R½0; k ¼ 0Þ

ð14Þ

We select one of the ﬁrst m levels to mine the k th influential node. We use a sign function s½m; k to record the selected level, and the sign function is deﬁned as follows: ( s½m; k ¼

s½m 1; k ;

jR½m 1; k R½M; k 1 þ DRm

m;

jR½m 1; k\R½M; k 1 þ DRm

s½0; k ¼ 0

ð14Þ

After ﬁnding the level s½m; k where the k th influential node locates, we need to ﬁnd this influential node in the level and add it to the seed set. When calculating rðIk Þ, the existing influence maximization algorithm mostly uses Monte Carlo simulation to calculate the average influence of a solution set, resulting in a time consuming operation. In this paper, we use the expected spread value of Ik instead of the Monte Carlo simulation when calculating rðIk Þ to reduce the computational cost. Let NBðIk Þ denote the one-hop area of the set Ik , and E denote the set of edges of the social network, then NBðIk Þ is deﬁned as follows: NBðIk Þ ¼ fuju 2 Ik g [ vj9u 2 Ik ; ! uv 2 E

ð15Þ

r ðvÞ ¼ uju 2 Ik ; ! uv 2 E

ð16Þ

We extract the edges between the nodes in NBðIk Þ and NBðIk Þ to form a subgraph of graph G. Then given a small propagation probability pp in the IC model, we use the expected number of nodes activated by Ik in the one-hop area as the ﬁtting function rðIk Þ which is given as follows: rð I k Þ ¼ j I k j þ

P v2NBðIk ÞIk

1 ð1 ppÞrðvÞ

ð17Þ

To sum up, we use dynamic programming to ﬁnd out the level that the influential node lies in ﬁrstly, then we ﬁnd the influential node in the level and add it to the seed

440

L. Li et al.

set. The aforementioned process is repeated until the size of the seed set reaches the target K.

4 Experiments 4.1

Datasets

We adopt the three datasets: Facebook, Twitter and Epinions downloaded from the Stanford Large Network Dataset Collection http://snap.stanford.edu/data/index.html as experimental datasets. Table 2 lists the statistical properties of these three datasets. Table 2. Statistical properties of datasets. Dataset Facebook Twitter Epinions

4.2

Nodes 4,039 81,306 75,879

Edges 88,234 1,768,149 508,837

Average degree Directed 43.7 False 43.5 True 6.71 True

Baseline Algorithms

We compare the performance of the proposed HBIM algorithm with several existing algorithms as follows: Greedy: a greedy algorithm [11] that makes use of 20,000 Monte-Carlo simulations to evaluate the influence spread. Degree-Discount: a single degree discount heuristic algorithm [3] that based on nodes’ out-degree. The node’s out-degree decreases by 1 if its neighbor is selected as a seed node. CGA: a community-based greedy algorithm [15], which ﬁrst detects communities by considering into information diffusion, then exploits dynamic programming algorithm to ﬁnd influential nodes in these communities. TIM: an influence maximization algorithm [13] based on the-state-of-the-art random sampling with theoretical support. 4.3

Parameter Settings

In order to obtain the influence spread of the seed set, we run 20000 Monte Carlo simulations on the network, and then take the average of the results as the ﬁnal influence spread. As for the benchmark algorithms, we use the parameter settings mentioned in their papers [3, 11, 13, 15]. As for the HBIM algorithm proposed in this paper, We implement the algorithm under the IC model, besides, we assign the propagation probability ppuv of the link ðu; vÞ in the following way.

A Hierarchy Based Influence Maximization Algorithm

ppuv ¼ minfpuv ; 0:01g

441

ð18Þ

Where, puv denotes the transition probability from node u to node v, which is calculated by the ﬁrst-order proximity and the second-order proximity between u and v, and its calculation method was mentioned as Eq. (3). 4.4

Experiment Results

Influence Spread. In order to estimate the influence spread of the HBIM algorithm and the benchmark algorithms, we run a Monte-Carlo simulation with 20000 times and take the average of all the simulation results as the ﬁnal influence spread of the selected seed sets returned from the experiments. We run the ﬁve algorithms to be compared on the three datasets to obtain influence spread results with regards to the seed set size K which increases from 5 to 50 with a spacing of 5. We list the results in Fig. 1(a)–(c).

Fig. 1. (a)–(c). Influence spread results varying from seed set size K on the three datasets.

According to Fig. 1(a)–(c), the influence spread results increase with the increment of K, and HBIM gains signiﬁcant performance on three datasets. An important observation result is that in the three ﬁgures, the influence spread of HBIM is comparable to that of Greedy which indicates that the accuracy of the HBIM algorithm is guaranteed. The idea of nodes segments can divide nodes with similar social attributes into the same level, which avoids the overlapping problem of influence in the process of ﬁnding seed set, and that results in a higher accuracy of the influence spread. Although the influence spread of TIM algorithm is comparable to that of HBIM, the TIM algorithm has a technical flaw in that it will run again to obtain a smaller set than the one it gets for the ﬁrst time, that is, it does not guarantee the sequence of the seed set is the order of the influential nodes. The CGA detects communities based on label propagation, the main principle of label propagation is that the community to which the node belongs is a community that contains the maximum number of its influenced neighbors. The algorithm neglects the influence of common neighbors on the detection of communities, therefore, the accuracy of CGA is not as high as that of HBIM. The influence spread of DegreeDiscount is relatively small on three datasets, this is because it reduces time cost at the expense of accuracy. Furthermore, as the scale of the datasets increase, the gap between DegreeDiscount and the other algorithms is gradually increased, that is, the scalablity of DegreeDiscount is not as good as other algorithms.

442

L. Li et al.

Running Time. Figure 2 shows the time cost of the four algorithms on the three datasets. Particularly, compared to other algorithms, Greedy has a much more orders of magnitude of running time, especially when the dataset is large-scale, so we did not show it in the ﬁgure. From Fig. 2, we can see that the order of the magnitude of running time of HBIM proposed in this paper is equivalent with that of DegreeDiscount and CGA, which shows that HBIM guarantees the accuracy while improving the efﬁciency.

Fig. 2. Running time of different algorithms with seed size K = 50.

5 Conclusions and Future Work In this paper, we propose HBIM, an influence maximization method based on the hierarchical structure of social network nodes, to mine the top-K influential nodes. HBIM has two main contents. One is to detect information diffusion levels by considering into nodes’ ﬁrst-order and second-order proximity, and the other is to use dynamic programming to select levels to discover seed nodes. Empirical studies on three real-world social network datasets show that our algorithm outperforms in both accuracy and efﬁciency. In addition, it scales well to big networks. In the future, we can take into account the semantic mechanisms [12] in the process of detecting of information diffusion levels at the ﬁrst phase of the HBIM algorithm. Acknowledgments. The research was supported in part by National Basic Research Program of China (973 Program, No. 2013CB329605).

References 1. Chen, F., Li, K.: Detecting hierarchical structure of community members in social networks. Knowl.-Based Syst. 87, 3–15 (2015) 2. Chen, W., Wang, C., Wang, Y.: Scalable influence maximization for prevalent viral marketing in large-scale social networks. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1029–1038. ACM (2010) 3. Chen, W., Wang, Y., Yang, S.: Efﬁcient influence maximization in social networks. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 199–208. ACM (2009) 4. Domingos, P., Richardson, M.: Mining the network value of customers. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 57–66. ACM (2001)

A Hierarchy Based Influence Maximization Algorithm

443

5. Feige, U.: A threshold of ln n for approximating set cover. J. ACM (JACM) 45(4), 634–652 (1998) 6. Girvan, M., Newman, M.E.: Community structure in social and biological networks. Proc. Natl. Acad. Sci. 99(12), 7821–7826 (2002) 7. Goyal, A., Lu, W., Lakshmanan, L.V.: Simpath: an efﬁcient algorithm for influence maximization under the linear threshold model. In: 2011 IEEE 11th International Conference on Data Mining (ICDM), pp. 211–220. IEEE (2011) 8. Jung, K., Heo, W., Chen, W.: IRIE: scalable and robust influence maximization in social networks. In: 2012 IEEE 12th International Conference on Data Mining (ICDM), pp. 918– 923. IEEE (2012) 9. Kempe, D., Kleinberg, J., Tardos, É.: Maximizing the spread of influence through a social network. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 137–146. ACM (2003) 10. Kim, J., Kim, S.K., Yu, H.: Scalable and parallelizable processing of influence maximization for large-scale social networks? In: 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp. 266–277. IEEE (2013) 11. Leskovec, J., Krause, A., Guestrin, C., Faloutsos, C., VanBriesen, J., Glance, N.: Costeffective outbreak detection in networks. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 420–429. ACM (2007) 12. Razis, G., Anagnostopoulos, I.: Semantifying twitter: the influence tracker ontology. In: International Workshop on Semantic and Social Media Adaptation and Personalization, pp. 98–103 (2014) 13. Tang, Y., Xiao, X., Shi, Y.: Influence maximization: near-optimal time complexity meets practical efﬁciency. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 75–86. ACM (2014) 14. Wang, D., Cui, P., Zhu, W.: Structural deep network embedding. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1225–1234. ACM (2016) 15. Wang, Y., Cong, G., Song, G., Xie, K.: Community-based greedy algorithm for mining topk influential nodes in mobile social networks. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1039–1048. ACM (2010) 16. Zhu, C.S., Zhu, F.X., Yang, X.L., School, C., University, W.: Hierarchical community structure based algorithm for influence maximization. Computer Engineering & Design (2017)

Convolutional Neural Networks in Combination with Support Vector Machines for Complex Sequential Data Classification Antreas Dionysiou1 , Michalis Agathocleous1 , Chris Christodoulou1(B) , and Vasilis Promponas2 1

2

Department of Computer Science, University of Cyprus, P.O. Box 20537, 1678 Nicosia, Cyprus {adiony01,magath06,cchrist}@cs.ucy.ac.cy Department of Biological Sciences, University of Cyprus, P.O. Box 20537, 1678 Nicosia, Cyprus [emailprotected]

Abstract. Trying to extract features from complex sequential data for classiﬁcation and prediction problems is an extremely diﬃcult task. Deep Machine Learning techniques, such as Convolutional Neural Networks (CNNs), have been exclusively designed to face this class of problems. Support Vector Machines (SVMs) are a powerful technique for general classiﬁcation problems, regression, and outlier detection. In this paper we present the development and implementation of an innovative by design combination of CNNs with SVMs as a solution to the Protein Secondary Structure Prediction problem, with a novel two dimensional (2D) input representation method, where Multiple Sequence Alignment proﬁle vectors are placed one under another. This 2D input is used to train the CNNs achieving preliminary results of 80.40% per residue accuracy (Q3), which are expected to increase with the use of larger training datasets and more sophisticated ensemble methods. Keywords: Convolutional Neural Networks Support Vector Machines · Deep learning · Machine learning Bioinformatics · Protein Secondary Structure Prediction

1

Introduction

Learning, is a many-faceted phenomenon. The learning process includes the acquisition of new declarative knowledge, the development of cognitive skills through instructions and practice, the organizing of new knowledge into general, the eﬀective representation of data and ﬁnally, the discovery of new theories and facts through practice and experimentation. Analysis of sequential data, feature extraction and prediction through Machine Learning (ML) algorithms/techniques, has been excessively studied. Nevertheless, the complexity c Springer Nature Switzerland AG 2018 V. K˚ urkov´ a et al. (Eds.): ICANN 2018, LNCS 11140, pp. 444–455, 2018. https://doi.org/10.1007/978-3-030-01421-6_43

CNNs in Combination with SVMs, for Sequential Data Classiﬁcation

445

and divergence of the big data that exist nowadays keep this ﬁeld of research open. When designing ML techniques for complex sequential data prediction, one must take into account, (a) how to capture both short- and long-range sequence correlations [1], and (b) how to focus on the most relevant information in large quantities of data [2]. A Convolutional Neural Network (CNN) is a class of deep, feedforward artiﬁcial neural networks (NN) that has successfully been applied to analyzing visual imagery [3,4]. CNNs were inspired by the human visual system, where individual cortical neurons respond to stimuli, only in a restricted region of the visual ﬁeld, known as the receptive ﬁeld. The receptive ﬁelds of diﬀerent neurons partially overlap such that they cover the entire visual ﬁeld. CNNs have enjoyed a great success in large-scale image and video recognition [5]. This has become possible due to the large public image repositories, such as ImageNet [3], and high-performance computing systems, such as GPUs or large-scale distributed clusters [6]. Overall, CNNs are in general a good option for feature extraction, immense complexity sequence and pattern recognition problems [3–10]. Support Vector Machines (SVMs) were introduced by Cortes and Vapnik [11], initially for binary classiﬁcation problems. SVMs are a powerful technique for linearly and non-linearly separable classiﬁcation problems, regression, and outlier detection, with an intuitive model representation [11]. A challenging task for ML techniques is to make predictions on sequential data that encode high complexity of interdependencies and correlations. Application examples include problems from Bioinformatics such as Protein Secondary Structure Prediction (PSSP) [12–15]; even though the three dimensional (3D) structure of a protein molecule is determined largely by its amino acid sequence, yet, the understanding of the complex sequence-structure relationship is one of the greatest challenges in computational biology. A ML model designed for such data has to be in position to extract relevant features, and at the same time reveal any long/short range interdependencies in the sequence of data given. The major key point that needs to be considered when trying to solve the PSSP problem is the complex sequence correlations and interactions between the amino acid residues of a protein molecule. In order to maximize the prediction accuracy of a proposed NN technique for a speciﬁc amino acid in a protein molecule, the adjacent amino acids have to be considered by the proposed NN architecture. In this paper we present a hybrid machine learning method based on the application of CNNs in combination with SVMs, for complex sequential data classiﬁcation and prediction. The implemented model is then tested on the PSSP problem for 3-state secondary structure (SS) prediction.

2 2.1

Methodology The CNN Architecture

CNNs are biologically-inspired variants of Multi-Layer Perceptrons (MLPs). The CNN architecture consists of an input layer (inactive), multiple hidden layers and an output layer. Generally speaking, CNNs combine three architectural ideas to

446

A. Dionysiou et al.

ensure some degree of shift, scale, and distortion invariance: local receptive ﬁelds, shared weights, and spatial subsampling/pooling [7]. The hidden layers of a CNN typically consist of convolutional layers, pooling layers and fully connected layers. There are four main operations performed by a CNN: (a) convolution, (b) non linearity (Rectiﬁer Linear Unit - ReLU), (c) pooling or sub sampling, and (d) classiﬁcation. One of the major characteristics of CNNs is that they take advantage of the fact that the input would be like an “image”, so they constrain the architecture in a more sensible way. Every layer of a CNN transforms one volume of activations to another through a diﬀerential function. The arrangement of a CNN’s neurons, unlike a regular NN, is in 3 dimensions: width, height and depth. The Convolutional Layer (CL) is the core building block of a CNN that basically performs the feature extraction process. The key hyperparameter of a CL is the kernel. The kernel is basically a 2D array initialized with random values, and it is used to compute dot products between the entries of the ﬁlter and the input volume at any position. The stride is another important hyperparameter that deﬁnes the amount of sliding of the kernel across the width and height of the input volume. The result of the kernel sliding over the width and height of the input volume is the feature map, a 2D array holding the responses/activations of the kernel at any spatial position. Moreover, the CNNs’ ability to handle complex sequential data relies in part to the sparse connections of neurons. More speciﬁcally, each neuron is connected to only a local region of the input volume (i.e., receptive ﬁeld), and as a result CNNs are capable of encoding complex sequential data correlations in their structure. The Pooling Layer (PL) is another critical block, for building a CNN. Generally speaking, a common technique for constructing a CNN is to insert a pooling layer in-between successive CLs. The main purpose of a pooling layer is to (a) reduce the representation size, (b) reduce the amount of computation in the NN, and (c) control overﬁtting. The PL uses a ﬁlter of a certain dimension and resizes the input given spatially, by striding the ﬁlter across the input volume and performing usually the MAX operation. The last layer of a CNN is usually a fully-connected Softmax output layer. Nevertheless, this ﬁnal step can be practically realized with any suitable classiﬁer. In particular, a small advantage was reported when the softmax output layer of a CNN was replaced by a linear SVM [16]. In this work, the libraries used for CNN and SVM implementations are Deeplearning4j (https://deeplearning4j.org) and LibSVM [17] with Scikit-learn front-end (http://scikit-learn.org), respectively. 2.2

Data Representation

As mentioned above, CNNs are capable of analyzing image-like inputs. The major obstacle on trying to solve a complex sequential data classiﬁcation problem with CNNs is the representation of the data, in such a way that the network is able not only to understand the shape of the input volume, but also to track the complex sequence correlations among the input volume. Transforming the sequential data shape so as to make it look like an “image”, allows CNNs to capture the complex sequence-structure relationship, including to model the SS

CNNs in Combination with SVMs, for Sequential Data Classiﬁcation

447

interactions among adjacent or distant amino acid residues in the PSSP problem. Along these lines, we reorganised the input data shape so that the vectors of each sample in the sequential data are placed one under another, and in such a way create an “image-like” input that will be eﬀectively read correctly and understood by the CNN. In particular, for PSSP we have created a new input volume by placing Multiple Sequence Alignment (MSA) [18] proﬁle vectors of each amino acid one under another to construct a 2D representation of the MSA proﬁles of a certain number of neighbouring amino acid residues (Fig. 1). By sliding the kernel over the newly constructed input volume, CNNs are able to perform feature extraction for each record data, but also consider neighboring correlations and interactions, if any exist. Note that unlike other techniques, the attention given to any neighboring record correlations is equally weighted across all the input volume, for each sample given. This lets the CNN discover and capture any short, mid- and long range correlations among the input records and consider them all equally in terms of the output volume created. One of the major contributions of this paper is this innovative input data representation, especially designed for the complex sequential data of the PSSP problem.

Fig. 1. Example of Data Representation Method: An example of data representation of an input sample using a window size of 15 amino acids. Each line represents the MSA proﬁle vector for the speciﬁc amino acid. The SS label for the example input sample showed in this ﬁgure, is the SS label for the middle amino acid.

2.3

Application Domain and Data

High quality datasets for training and validation purposes are a prerequisite when trying to construct useful prediction models [2]. Therefore, we have chosen PSSP a well known bioinformatics problem, which is characterized by the complexity of the correlations between the data records due to the existence of combinations of short, mid and long range interactions. The PSSP, which is based on the Primary Structure (PS) of a protein molecule is considered to be an important problem, since the SS can be seen as a low-resolution snapshot of a protein’s 3D structure, and can thus shed light

448

A. Dionysiou et al.

on its functional properties and assist in many other applications like drug and enzyme design. As mentioned above, the understanding of the complex sequencestructure relationship is one of the greatest challenges for the PSSP problem. Since the currently known experimental methods for determining the 3D structure of a protein molecule are expensive, time consuming and frequently ineﬃcient [12], diﬀerent methods and algorithms for predicting the secondary structure of a protein molecule have been developed [8,12,14,15,19,20]. In particular, Recurrent Neural Networks (RNNs) were successful in the PSSP problem [20], as their architecture may capture both short- and long-range interactions needed for PSSP. CNNs though can detect and extract high complexity features from an input sequence and at the same time track any short-, mid- or long-range interactions depending on the window size. Thus we decided to use CNNs in combination with our novel data representation method for the PSSP problem. A protein is typically composed by 20 diﬀerent amino acid types which are chemically connected to form a polypeptide chain, folding into a 3D structure by forming any-range interactions. There are eight main SS states that each amino acid can be assigned to, when a protein 3D structure is available, which are typically grouped in three classes, namely: Helix (H), Extended (E) and Coil/Loop (C/L) with diﬀerent geometrical and hydrogen-bonding properties. In this work, we use CB513 [19], a non-redundant dataset which has been heavily used as a benchmark for the PSSP problem that contains 513 proteins excluding eight proteins with names: 1coiA 1-29, 1mctI 1-28, 1tiiC 195-230, 2erlA 1-40, 1ceoA 202254, 1mrtA 31-61, 1wfbB 1-37 and 6rlxC -2-20 due to corrupted MSA proﬁles. The use of MSA proﬁles enhanced the performance of PSSP ML algorithms, since they incorporate information of hom*ologous sequences, which may facilitate the detection of subtle, yet important, patterns along the sequences [14]. In particular, for representing each protein sequence position, we use a 20-dimensional vector, which corresponds to the frequencies of 20 diﬀerent amino acid types as calculated from a PSI-BLAST [21] search against the NCBI-NR (NCBI: https:// www.ncbi.nlm.nih.gov/) database. Note that we have also performed an experiment on a much larger dataset, namely PISCES [22] which shows promising results. 2.4

Support Vector Machines (SVMs)

The main idea behind SVMs is that the input vectors are non-linearly mapped to a higher dimensional feature space using an appropriate kernel function with the hope that a linearly inseparable problem in the input space becomes linearly separable in the new feature space, i.e., a linear decision surface can constructed [23]. An important advantage of SVMs is that the search for the decision surface that maximizes the margin among the target class instances ensures high generalization ability of the learning machine [24]. Their robust performance with respect to sparse and noisy data makes them a good choice in a number of applications from text categorization to protein function prediction [25]. Moreover, SVMs were shown to be the best technique for ﬁltering on the PSSP problem [13]. Given this, we decided to test the ﬁltering capabilities of SVMs

CNNs in Combination with SVMs, for Sequential Data Classiﬁcation

449

on the CNNs’ SS prediction results, to see whether the accuracy is improved, and correct the predicted SS of a protein molecule gathered from an ensemble of CNNs.

3 3.1

Results and Discussion Optimising the Parameters

The CNN implementation using the innovative input data representation described in Sect. 2.2 has been used and tested on the PSSP problem. To train the CNN, we have used the already mentioned CB513 dataset. More speciﬁcally, the model’s input was a combination of a certain number of neighboring amino acids MSA proﬁle record vectors, one under another, forming a 2D array. The target output label was the SS class for the middle point amino acid that had been examined. A single CNN has been trained each time. We have decided to track the optimal hyperparameter values using a speciﬁc fold after dividing CB513 dataset into ten (10) folds. The main reason for optimizing the hyperparameters on a speciﬁc fold is the small size of CB513 dataset. Accuracy results using diﬀerent hyperparameter values on the other folds are not expected to vary considerably. During this phase, multiple experiments were performed in order to tune up our model and ﬁnally achieve the highest results using the CNN. These were Q3 of 75.155% and Segment OVerlap (SOV [30]) of 0.713. CNNs with diﬀerent numbers of CLs, PLs, kernel sizes, strides, number of parallel ﬁlters in each CL, and Gradient Descent (GD) optimization algorithms (Fig. 2) have been tested for optimising the parameter values. The optimization algorithms used are: Gradient Descent (GD), Gradient Descent with momentum (GD with momentum), Adaptive Gradient Algorithm (AdaGrad) [26], RMSprop [27], AdaDelta [28], Adaptive Moment Estimation (Adam) [29]. The two most critical hyperparameters that showed a big impact on the results are: (a) the optimization method used and (b) the number of neighboring amino acids to be considered in each sample (window size). More speciﬁcally, the parameter W is the number of total amino acids to be considered by the CNN when trying to predict the SS of the ﬂoor(W/2) + 1 amino acid. Then, according to the W parameter we reconstruct the input sample so as to become a 2D array with shape W × 20. The results are shown in Fig. 3. Unlike Wang’s et al. [8] method, where they use 42 input features for each residue in an one dimensional input vector format, we use 20 × W (20 input features for each amino acid × window size) input features for each residue in a two dimensional input vector format where each line represents the MSA proﬁle of an amino acid at any speciﬁc position. Generally speaking Wang’s et al. [8] 42 input features used include our 20 input features (MSA proﬁle for each amino acid) plus extra 22 input features for each amino acid. In this way, our method reduces the dimensionality of the problem without losing too much important information. Moving forward, we had to tune up the parameters that determine the network’s architecture.

450

A. Dionysiou et al.

Fig. 2. Optimizers: CNNs Q3 accuracy results using diﬀerent Gradient Descent (GD) optimization algorithms.

Fig. 3. Window Size: CNNs Q3 accuracy results with diﬀerent window (W) sizes.

To get a general idea about the CNN performance we have trained it using the CB513 dataset. After tuning up the network architecture, the following optimal CNN parameter values resulted: (a) Number of convolutional layers: 3, (b) Number of Pooling Layers: 0, (c) Kernel/Filter size: 2 × 2, (d) Stride: 1, (e) Number of Parallel Filters per Layer: 5, (f) Neurons Activation Function: Leaky ReLU, and (g) Optimization method: Gradient Descent with momentum = 0.85.

CNNs in Combination with SVMs, for Sequential Data Classiﬁcation

451

The number of neighboring amino acids (W) that leads to some among the highest Q3 results and at the same time limiting the complexity of information been used (i.e., minimizing the window) was 15. Moreover, no signiﬁcant change on Q3 accuracy results was noticed using larger window (W) sizes (Fig. 3). Based on the results, we realized that (i) smaller W values do not provide enough information to the network regarding the adjacent interactions between amino acids, and (ii) larger W values contain way too much (unnecessary in some way) information for the network to be handled and decoded properly. We did not use pooling layers for our CNN architecture due to the fact that subsampling the features gathered from CNN is not relevant in the PSSP problem. Getting only the maximum value of a spatial domain does not work in PSSP as every value extracted from CLs may represent interactions of amino acids in a certain region. These are the most important factors that lead to low Q3 and SOV results using PLs. 3.2

10-Fold Cross-Validation on CB513

In order to validate the robustness of the model as well as to prove its eﬃciency to the exposure of various training and testing data, we had to complete the evaluation of the PSSP problem on the CB513 dataset, using a 10-fold crossvalidation test. All the experiments made are with the optimal parameters of the model as described in Sect. 3.1. As shown in Table 1, the Q3 and SOV accuracy results of CNN with 10-fold cross-validation are 75.15% and 0.713 respectively. Table 1. Summary of the results for all methods. M ethod

Q3 (%) QH (%) QE (%) QL (%) SOV

SOVH SOVE SOVL

CNN

75.155 69.474

67.339

84.566

0.713

0.696

0.669

0.734

CNN Ensembles

78.914 72.748

68.854

85.385

0.744

0.738

0.722

0.737

CNN Ens. + ER Filt.

78.692 70.147

66.921

87.053

0.756 0.669

0.713

0.731

70.578

85.165

0.736

0.716

0.743

CNN Ens. + SVM Filt. 80.40

3.3

80.911

0.724

Ensembles and External Rules Filtering

After tracking the optimal parameters for the CNN, we have performed six (6) experiments for each fold. Then, in an attempt to maximize the quality of the results gathered as well as to increase the Q3 and SOV accuracy, we proceeded with using the winner-take-all ensembles technique [31,32] on every single fold separately. This technique obtains the predictions of a number of same ML model experiments, and applies the winner takes all method on each amino acid residue SS class predicted. The dramatically improved results are shown in Table 1. Filtering the SS prediction using external empirical rules is usually the last step made, as a ﬁnal attempt to improve the quality of the results. This is accomplished by removing conformations that are physicochemically unlikely to

452

A. Dionysiou et al.

happen [15]. Applying the external rules ﬁltering on the CNN’s SS prediction, interestingly, does not improve the Q3 score, but it improves the SOV. The results are shown in Table 1. 3.4

Filtering Using Support Vector Machines (SVMs)

CNNs showed very good results on the PSSP (Figs. 2, 3 and Table 1). Nevertheless, as mentioned above, we tried to use SVMs to perform the ﬁltering task. More speciﬁcally, after gathering the predictions from the CNN we have trained a SVM using a window of SS states predicted by the CNN. After performing several experiments using diﬀerent kernels, misclassiﬁcation penalty parameters (C) [11], Gamma values (G) [11] and window sizes (WIN), we have decided for the optimal SVM parameters that lead to the highest Q3 and SOV accuracy on the PSSP problem and which are: (a) Kernel: Radial Basis Function, (b) C = 1, (c) G = 0.001 and (d) WIN = 7. The results are shown in Tables 2 and 3. 3.5

Summary of the Results

The results shown in Table 1 summarize the Q3 accuracy and SOV results gathered, with all the methods discussed in this paper, using 10-fold cross-validation. It is shown that the CNN can achieve relatively high Q3 and SOV results (75.155% and 0.713 respectively) by its own. Nevertheless, the CNN using ensembles improved the Q3 accuracy results by approximately 3% and SOV score by 0.031. Moving on, ﬁltering the results using External Rules mentioned above, decreases the overall Q3 accuracy results to 78.692%, but dramatically increases the SOV score from 0.744 to 0.756. This was expected as ﬁltering with External Rules has previously been reported to improve SOV scores, but at the same time decrease the overall Q3 accuracy [12]. Finally, using the combination of CNN ensembles and SVM as a ﬁltering technique, achieves the highest Q3 accuracy results (80.40%). The Q3 values for diﬀerent folds vary from 78.96% to 83.91% and the SOV from 0.71 to 0.78 (Table 2). This indicates that the results for diﬀerent folds are of comparable quality. Moreover, the accuracies for the three classes, H, E, L, are calculated separately (see QH , QE , QL and SOVH , SOVE , SOVL in Table 2) for getting deeper insight on the quality of the classiﬁer, and mispredictions are quantiﬁed in a confusion matrix, graphically represented in Fig. 4. As we can see from Table 2, Q3 accuracy results gathered using CNN Ensembles and SVM ﬁltering are just over 80%, which is considered to be a high enough percentage when it comes to PSSP, and which also makes this combination of NN techniques a good option when it comes to complex sequential data classiﬁcation and prediction problems. Heﬀernan’s et al. [20] method achieves 84.16% Q3 accuracy using Bidirectional Recurrent Neural Networks without using a window, but these results are not directly comparable with our results, as they make use of a much larger dataset that contains 5789 proteins, compared to CB513 which contains 513 proteins. As a conclusion to all the results presented in this paper, we can see that the CNNs can eﬀectively detect and extract features from complex sequential data,

CNNs in Combination with SVMs, for Sequential Data Classiﬁcation

453

Table 2. CNN Ensembles and SVM Filtering: Q3 and SOV Results for each Fold. F old Q3 (%) QH (%) QE (%) QL (%) SOV

SOVH SOVE SOVL

79.69

79.77

70.05

84.75

0.74

0.73

0.71

0.75

1

79.74

78.69

68.06

86.77

0.73

0.73

0.71

0.74

2

78.96

78.64

68.27

84.94

0.72

0.71

0.71

0.73

3

79.55

79.09

67.89

86.12

0.71

0.72

0.70

0.73

4

79.26

78.55

70.00

84.79

0.73

0.72

0.73

0.72

5

79.70

80.27

70.18

84.31

0.73

0.71

0.72

0.73

6

79.64

79.85

68.87

85.26

0.73

0.73

0.71

0.74

7

83.70

87.68

76.86

83.91

0.76

0.73

0.71

0.77

8

83.91

87.53

76.33

84.62

0.78

0.75

0.74

0.79

9

79.85

79.04

69.27

86.18

0.73

0.71

0.72

0.73

70.57

85.16

0.736 0.724 0.716 0.743

Avg. 80.40 80.91

Table 3. CNN Ensembles and SVM Filtering: Statistical Analysis Q3

SOV

Sample standard deviation (s)

1.8140 0.0141

Variance (Sample standard) (s2 )

3.2906 0.0002

Mean (Average)

80.4

0.736

Standard error of the mean (SEχ¯ ) 0.5736 0.0044

Fig. 4. Confusion Matrix: Predictions and mispredictions of the secondary structure classes H, E and C/L after applying ensembles on each fold using CB513 dataset. Q3 accuracy scores are shown for each class.

454

A. Dionysiou et al.

by utilizing our proposed “image” like data representation method used to train the CNNs for the PSSP problem. This is due to the fact that our CNN architecture was exclusively designed to face such problems. In addition, SVMs seem to be a good technique to be used for ﬁltering the CNN output. The combination though, of these two ML algorithms seem to be a great option for complex feature extraction and prediction on sequential data, as we take advantage of the beneﬁts of both techniques. Finally, by observing the results from the confusion matrix of Fig. 4, we can conclude that the combination of CNNs with SVMs ﬁltering is a robust and high quality methodology and architecture, as it maximizes the correct predictions for each SS class. Results are expected to be improved by collecting more experiments for each fold, using larger datasets (e.g., PISCES) and deploying more sophisticated ensemble techniques.

References 1. Graves, A.: Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850 (2013) 2. Blum, A.L., Langley, P.: Selection of relevant features and examples in machine learning. Artif. Intell. 97(1–2), 245–271 (1997) 3. Krizhevsky, A., Sutskever, I., Hinton, G. E.: ImageNet classication with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 25: Proceedings of the 26th International Conference on Neural Information Processing Systems, pp. 1097–1105. Curran Associates, Lake Tahoe, Nevada, Red Hook, NY (2012) 4. Rawat, W., Wang, Z.: Deep convolutional neural networks for image classiﬁcation: a comprehensive review. Neural Comput. 29(9), 2352–2449 (2017) 5. Srinivas, S., Sarvadevabhatla, R.K., Mopuri, K.R., Prabhu, N., Kruthiventi, S.S., Babu, R.V.: A taxonomy of deep convolutional neural nets for computer vision. Front. Robot. AI 2, 36 (2016) 6. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014) 7. LeCun, Y., Bengio, Y.: Convolutional networks for images, speech, and time series. In: Arbib, M.A. (ed.) The Handbook of Brain Theory and Neural Networks, pp. 255–258. MIT Press, Cambridge (1998) 8. Wang, S., Peng, J., Ma, J., Xu, J.: Protein secondary structure prediction using deep convolutional neural ﬁelds. Sci. Rep. 6, 18962 (2016) 9. Bluche, T., Ney, H., Kermorvant, C.: Feature extraction with convolutional neural networks for handwritten word recognition. In: Proceedings of the 12th IEEE International Conference on Document Analysis and Recognition, pp. 285–289 (2013) 10. Graves, A., Mohamed, A.R., Hinton, G.: Speech recognition with deep recurrent neural networks. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, pp. 6645–6649 (2013) 11. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995) 12. Baldi, P., Brunak, S., Frasconi, P., Soda, G., Pollastri, G.: Exploiting the past and the future in protein secondary structure prediction. Bioinformatics 15(11), 937–946 (1999)

CNNs in Combination with SVMs, for Sequential Data Classiﬁcation

455

13. Kountouris, P., Agathocleous, M., Promponas, V.J., Christodoulou, G., Hadjicostas, S., Vassiliades, V., Christodoulou, C.: A comparative study on ﬁltering protein secondary structure prediction. IEEE/ACM Trans. Comput. Biol. Bioinform. 9(3), 731–739 (2012) 14. Rost, B., Sander, C.: Combining evolutionary information and neural networks to predict protein secondary structure. Proteins: Struct. Funct. Bioinform. 19(1), 55–72 (1994) 15. Salamov, A.A., Solovyev, V.V.: Prediction of protein secondary structure by combining nearest-neighbor algorithms and multiple sequence alignments. J. Mol. Biol. 247(1), 11–15 (1995) 16. Tang, Y.: Deep learning using linear support vector machines. arXiv preprint arXiv:1306.0239 (2013) 17. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 27 (2011) 18. Wallace, I.M., Blackshields, G., Higgins, D.: Multiple sequence alignment. Curr. Opin. Struct. Biol. 15(3), 261–266 (2005) 19. Cuﬀ, J.A., Barton, G.J.: Evaluation and improvement of multiple sequence methods for protein secondary structure prediction. Proteins: Struct. Funct. Bioinform. 34(4), 508–519 (1999) 20. Heﬀernan, R., Yang, Y., Paliwal, K., Zhou, Y.: Capturing non-local interactions by long short-term memory bidirectional recurrent neural networks for improving prediction of protein secondary structure, backbone angles, contact numbers and solvent accessibility. Bioinformatics 33(18), 2842–2849 (2017) 21. Schaﬀer, A.A., et al.: Nucl. Acids Res. 25, 3389–3402 (1997) 22. Wang, G., Dunbrack Jr., R.L.: PISCES: a protein sequence culling server. Bioinformatics 19(12), 1589–1591 (2003) 23. Vapnik, V.N.: An overview of statistical learning theory. IEEE Trans. Neural Netw. 10(5), 988–999 (1999) 24. Meyer, D., Wien, F.T.: Support vector machines. R News 1(3), 23–26 (2001) 25. Furey, T.S., Cristianini, N., Duﬀy, N., Bednarski, D.W., Schummer, M., Haussler, D.: Support vector machine classiﬁcation and validation of cancer tissue samples using microarray expression data. Bioinformatics 16(10), 906–914 (2000) 26. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011) 27. Tieleman, T., Hinton, G.: Lecture 6.5 - RMSProp, Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw. Mach. 4(2), 26–31 (2012) 28. Zeiler, M. D.: ADADELTA: An Adaptive Learning Rate Method. arXiv preprint arXiv:1212.5701 (2012) 29. Kingma, D. P., Ba, J. L.: Adam: a method for stochastic optimization. In: Suthers, D., Verbert, K., Duval, E., Ochoa, X. (Eds.) Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), Leuven, Belgium, pp. 1–13. ACM, New York, NY, USA (2015) 30. Rost, B., Sander, C., Schneider, R.: Redeﬁning the goals of protein secondary structure prediction. J. Mol. Biol. 235(1), 13–26 (1994) 31. Granitto, P.M., Verdes, P.F., Ceccatto, H.A.: Neural network ensembles: evaluation of aggregation algorithms. Artif. Intell. 163(2), 139–162 (2005) 32. f*ckai, T., Tanaka, S.: A simple neural network exhibiting selective activation of neuronal ensembles: from winner-take-all to winners-share-all. Neural Comput. 9(1), 77–97 (1997)

Classiﬁcation of SIP Attack Variants with a Hybrid Self-enforcing Network Waldemar Hartwig1(&), Christina Klüver1, Adnan Aziz2, and Dirk Hoffstadt2 1

2

Computer Based Analysis of Social Complexity, University of Duisburg-Essen, 45117 Essen, Germany [emailprotected], [emailprotected] Computer Networking Technology Group, University of Duisburg-Essen, 45141 Essen, Germany {Adnan.Aziz,Dirk.Hoffstadt}@uni-due.de

Abstract. The Self-Enforcing Network (SEN), a self-organized learning neural network, is used to analyze SIP attack trafﬁc to obtain classiﬁcations for attack variants that use one of four widely used User Agents. These classiﬁcations can be used to categorize SIP messages regardless of User-Agent ﬁeld. For this, we combined SEN with clustering methods to increase the amount of trafﬁc that can be handled and analyzed; the attack trafﬁc was observed at a honeynet system over a month. The results were multiple categories for each User Agent with a low rate of overlap between the User Agents. Keywords: Self-Enforcing Network SEN VoIP Session initiation protocol SIP Misuse Fraud Reference type Clustering

1 Introduction Voice over IP (VoIP) systems enable advanced communication (such as voice or video) over the Internet and other data networks and therefore are replacing the traditional phone infrastructures. Nowadays, VoIP is widely used in organizations, companies, and private environments, as it has the advantage of the flexibility and low costs. Many existing devices and applications use standardized VoIP protocols (e.g. SIP for signaling [1] or Real-Time Transport Protocol (RTP) for media transmission [2]). SIP is a text-based application layer protocol similar to File Transfer Protocol (FTP) used to establish, maintain and terminate multimedia sessions between User Agents (UA). The SIP communication uses a request-response protocol, i.e., the source sends a SIP request message and receives a SIP response message. SIP is an inherently stateful protocol and uses the HyperText Transfer Protocol (HTTP) Digest Authentication for user authentication [3]. In its simplest form SIP uses the transport protocol User Datagram Protocol (UDP), but others can also be used, e.g., Transmission Control Protocol (TCP) or Stream Control Transmission Protocol (SCTP). This high availability of SIP-based VoIP systems has lured attackers to misuse the VoIP systems. The SIP servers, particularly if they are accessible from external © Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11140, pp. 456–466, 2018. https://doi.org/10.1007/978-3-030-01421-6_44

Classiﬁcation of SIP Attack Variants with a Hybrid SEN

457

networks, are subject to fraudulent registration attempts as a prerequisite for calls via compromised SIP accounts. This is extremely attractive for attackers because they can gain immediate ﬁnancial beneﬁt by making toll calls (international, cellular, premium services) via third-party accounts. This attack is called Toll Fraud and can cause the account owner substantial ﬁnancial damage in a very short time. Accordingly, several anti-fraud and anti-phishing techniques were applied, consisting of rule-based approaches, supervised and unsupervised methods, as well as hybrid techniques (for an overview [4]). It has also become very prevalent to perform denial-of-service (DoS) attacks at application level due to increased code complexity and modular nature of the Internet. Elsabagh et al. [5] have proposed a practical system, Cogo, for early detection and mitigation of software DoS attacks. Cogo recognizes the future exhaustion of resources by employing the Probabilistic Finite Automata (PFA) on the network I/O events, modeled in linear time fashion. Manunza et al. [6] have presented a rule-based real-time fraud detection system for VoIP networks, Kerberos, that is highly dependent on an Online Charging System, which generates events associated with setup, evolution, and termination of calls in the VoIP network. Kerberos uses these events to identify patterns associated with the malicious use of the resources. Vennila et al. [7] have proposed a 2-tier model to protect users from spam over Internet telephony (SPIT) calls. This 2-tier model is based on stochastic models, Markov Chain (MC) and incremental support vector machine (ISVM). Aziz et al. [8, 9] have used a Honeynet System to capture the SIP attack trafﬁc to analyze the attacker behavior. This approach is useful in scenarios where it is not possible to access the user’s data due to security policies of the country. In this paper, the goal is to identify an unknown amount of attack patterns for four User Agents. For this purpose, a hybridization of the self-organized learning neural network, namely the Self-Enforcing Network, and a modiﬁed Single Linkage (MSL) clustering algorithm is used to analyze attack trafﬁc at honeynet systems. The remainder of this paper is organized as follows: Sect. 2 gives a brief overview of SIP, Toll Fraud attack and four dominant attack tools recorded at the honeynet systems. An overview of the artiﬁcial neural network, Self-Enforcing Network (SEN), is given in Sect. 3 followed by the presentation of the sequential clustering in Sect. 4 where the organization of the data with SEN is augmented with other clustering algorithms. In Sect. 5 the analysis of the SIP attack data is covered. Finally, Sect. 6 concludes the paper.

2 SIP and Attack-Tools SIP is a signaling protocol used to establish, modify and terminate multimedia sessions in IP-based networks. It supports a number of messages for different purposes. For this paper, the following SIP messages are relevant: The User Agent (UA) (i.e., SIP device) uses REGISTER method to register its location to the SIP server. During this process, the UA sends credentials (username and password) to the SIP server. After successful registration, the UA can initiate calls using INVITE messages. The OPTIONS

458

W. Hartwig et al.

messages allow a UA to query a server’s capabilities and to discover information about the supported SIP methods, extensions, codecs, etc. without establishing a session. The Toll Fraud attack comprises of the following four stages 1. SIP Server & Device Scan. An attacker can use OPTIONS packets to “ping” any single IP address or whole subnets in order to identify SIP devices, because of the fact that the SIP protocol requires every SIP device to answer OPTIONS packets. Even if a UA’s SIP stack implementation is not standard compliant, the attacker can instead use REGISTER requests to identify SIP devices. 2. Extension Scan. To identify active extensions (user accounts) of known SIP servers, the attacker tries to register at several extensions, typically without using a password. An extension identiﬁer consists of digit sequences and/or strings. If the extension exists, the server normally answers with a 401 UNAUTHORIZED, because no password is given. If it does not exist, a 404 NOT FOUND is returned. The result of this attack stage is a complete list of existing extensions (provider accounts). 3. Registration Hijacking. To register for a given extension, the attacker tries to guess the password sending – possibly many – REGISTER messages with different passwords to a speciﬁc extension. If a valid password is found, the information is stored by the attacker and used later on the credentials to register at this extension. 4. Toll Fraud. The term multi-stage “toll fraud” is used if a person generates costs (toll) by misusing a hijacked extension using the VoIP functionality to make calls, speciﬁcally international calls or calls to premium numbers. Another motivation to use a hijacked account is to obfuscate the caller identity. In terms of SIP messages, the attacker ﬁrst sends a REGISTER message with the correct password. After the “200 OK” message from the server, the attacker can initiate calls by using INVITE messages. The ﬁrst three stages (1–3) of multi-stage Toll Fraud can be executed, either completely or partially, by using paid/freely available tool suites. Some commonly used tools are SIPVicious, SIPCli, VAXSIPUserAgent, and Random user agent (RUA). SIPVicious contains several small programs: The ﬁrst one is a SIP scanner called “svmap”. It scans an IP address range for SIP devices, either sequentially or in random order, typically with OPTIONS packets. SIPVicious also provides tools to ﬁnd active SIP accounts with REGISTER messages (“svwar”) and to crack passwords (“svcrack”). If not modiﬁed, SIPVicious identiﬁes itself as UA “friendly-scanner”. SIPCli is a Windows-based command line tool, which usually sends only the INVITE packets and is capable to perform all four stages of the multi-staged Toll Fraud attack. VaxSIPUserAgent [9] is another tool used to perform the Toll Fraud attacks. It sends REGISTER and INVITE packets. RUA tool [10] sends OPTIONS packets only; therefore, it performs Server Scans only. To analyze the behavior of different attackers and attack tools – used to perform the Toll Fraud attacks – it is necessary to inspect the attack trafﬁc. Due to data protection laws in Germany, it is not possible to access the user data from VoIP service providers. The Computer Network and Technology Group (TdR) of the University of DuisburgEssen have implemented a honeynet system [11] to capture the SIP trafﬁc. The trafﬁc destined to this honeynet system is by default attack trafﬁc, as it does not contain any

Classiﬁcation of SIP Attack Variants with a Hybrid SEN

459

legitimate user in it. The attack trafﬁc is stored in MySQL database using a tool, SIP Trace recorder (STR) [12]. The STR then performs some statistical analysis on the captured attack trafﬁc, e.g., number of requests per day, number of requests per attack tool, clustering the requests with respect to different stages of multi-staged Toll Fraud attack, etc. The clustering of SIP requests performed by STR is based on MySQL queries. No machine learning techniques or clustering algorithms were used to group the SIP requests into different stages of multi-staged Toll Fraud attacks.

3 The Self-enforcing Network The Self-Enforcing Network (SEN) is a self-organized learning neural network, developed by the Research Group “Computer-Based Analysis of Social Complexity” (CoBASC). In this section, the functionalities that are relevant to the study are briefly presented. More in-depth descriptions of the SEN are found in e.g. [13–15]. The data (objects and attributes) are represented in a “semantical matrix” where the rows represent the objects o and the columns represent the attributes a. A value of the matrix wao represents the degree of afﬁliation of an attribute to an object. In this case, the values of the semantical matrix are the encoded ﬁelds of the SIP request messages monitored by the honeynet system. The encoding is explained in Sect. 5 below. The training of the network is done by transforming the min-max normalized values of the semantical matrix (interval [–1.0 – 1.0] or [0.0 – 1.0] depending on the attribute) into the weight matrix of the network with the following learning rule: wðt þ 1Þ ¼ wðtÞ þ Dw; and Dw ¼ c wao

ð1Þ

where c is a constant usually deﬁned as 0 c 1 with the same purpose as the learning rate in standard neural networks. For the analysis of real data, the “cue validity factor” (cvf) is introduced, which is a measure of how important an attribute is for the membership in each category [16]. The cvf allows to exclude (cvf = 0.0) or to dampen (0.0 < cvf < 1.0) certain attributes to steer the formation of clusters. Equation (1) then becomes D w ¼ c wao cvf a

ð2Þ

For the activation function the Enforcing Activation Function (EAF) was used: aj ¼

n X i¼1

wij ai 1 þ wij ai

ð3Þ

aj is the activation value of the receiving neuron j, ai is the activation values of the sending neurons i and wij is the respective weight value (= wao). In this study, the topology of SEN can be seen as a two-layered network with a feed-forward topology with the attributes as input neurons and the objects as output neurons.

460

W. Hartwig et al.

After the learning process is ﬁnished, new input vectors (new SIP request messages) with the same attribute names can be inserted and classiﬁed. These input vectors are computed in two different ways: the computed similarities according to the highest activated neuron (ranking), and the smallest difference (distance) between the input vectors and the learned vectors from the semantical matrix. SEN offers several methods to visualize the results allowing for a fast interpretation. For the analysis primary the “map visualization” was used. It maps the objects to a 2D plane according to the Euclidean distance between two objects. Similar objects are moved closer to one another while dissimilar objects are moved further apart.

4 Sequential Clustering Self-organized learning neural networks along with clustering methods excel in situations where unlabeled data have to be organized into a probably unknown number of groups of objects [17]. The conditions are that objects in a group should be as similar to one another as possible, and objects in different groups as dissimilar as possible. For an overview [18–20] describe different frameworks and methods. Note that the goal of this study is to identify an unknown amount of attack variants from the data with a focus on attack variants related to the four known ATs (see Sect. 2). Therefore, an unsupervised method is considered more suitable than a supervised or semi-supervised one. For that reason, the self-organized learning Self-Enforcing Network (SEN) was chosen for the analysis of the attack trafﬁc. In addition, the visualization components of SEN allow a fast interpretation of the results. To increase the amount that can be processed with SEN the data is split into fragments, which are read sequentially into the SEN as objects. The SEN then organizes these fragments and clusters are calculated based on the highest activation values and the smallest distance between the inserted objects as input vectors and the learned objects. The centroids or geometric centers of the clusters are extracted with a clustering algorithm and used as “reference types” [14] of their clusters and all centroids from all fragments are read into a single SEN. The clustering algorithm used is called MSL. It is a modiﬁcation of a SingleLinkage [21] algorithm (MSL) with two distance metrics and thresholds instead of one, namely the previous mentioned calculations. Both thresholds have to be met for two clusters to be merged into a single one. Meaning that • the activation of at least one object o1 in the ﬁrst cluster for an object o2 in the other cluster has to be higher than the activation-threshold and • the distance between these objects o1 and o2 has to be smaller than the distancethreshold. Instead of a ﬁxed threshold MSL introduces a dynamic deﬁnition of the thresholds through a single parameter r. The parameter r dictates that the activation between two objects o1 and o2 must be at least the rth highest and the distance at max the rth lowest compared all other activation- and distance-pairs with o1.

Classiﬁcation of SIP Attack Variants with a Hybrid SEN

461

Previous, other clustering algorithms that were tested with smaller sized data from the honeynet system were a variation of Lloyd’s k-means algorithm [22], a Single Linkage algorithm using only one distance metric and a combined algorithm using evidence accumulation. The variation of k-means uses multiple k-means runs and selects the k-means run with the least mean square error as the result of the algorithm [23]. The combined algorithm using evidence accumulation uses multiple k-means runs to generate multiple clusterings. The co-occurrences of a pair of patterns in a cluster are mapped to a co-association matrix. An MST-based clustering algorithm (minimum spanning tree [21]) is then applied to this matrix to calculate the ﬁnal clustering [23, 24]. While most algorithms were able to successfully detect all known attack variants (see [8] for the variants) in the preliminary test data the MSL algorithm stood out with its low time complexity and its ease of use. It performed signiﬁcantly faster than Multiple k-Means and Evidence Accumulation and the lack of parameters meant less parameter exploration to discover successful parameter combinations.

5 SIP-Data Analysis and Results The data used for the analysis contains SIP requests, for the month of January 2016, recorded at the Honeynet system (cf. Sect. 2), consisting of around 4.2 million SIP packets. For the analysis purposes, the following SIP header ﬁelds we used: SourceIP, SourcePort, DestinationIP, DestinationPort, Method, CallID, UserAgent, ContactUser, ContactHost, ToUser, ToHost, FromUser, FromHost, Via and Time. Aside from numerical ﬁelds like SourcePort most other ﬁelds had to be encoded into real values before they could be inserted into a SEN. e.g. the IP addresses were split into three attributes: the ﬁrst 8 and the last 24 bits as decimals and the IP address class as an integer (1 to 5 where 1 represents A and 5 represents E). Other nonnumerical ﬁelds were encoded as the arithmetic mean of the ASCII decimal values of all characters rounded to an integer value with values above 130 were changed to –130. Additionally, 12 comparison attributes were added, which contain +1.0 if the ﬁelds of the compared attributes are equal and –1.0 otherwise. In this analysis, 17 attributes were considered, which have a cvf equal to 1. Figure 1 shows these attributes along with their lower and upper limits, which are needed for the min-max normalization. For the analysis and the sequential clustering, the Enforcing Activation Function with c = 0.1 and one learning step was used. The dataset was split into fragments of 2000 objects each and the clustering algorithm parameter r was set to 1 (r = 1) with a precision of two decimal places. The 4.2 million SIP requests were compressed to 3400 elements through the sequential clustering, which is organized into 15 groups by SEN (see Fig. 2). For visibility only four attack tools (ATs) SIPVicious, SIPCli, VAXSIPUserAgent and RUA (with UserAgents names “friendly-scanner” (FS), “SIPCli” (CLI), “VaxSIPUserAgent” (VAX) and a random String consisting of 8 alphabetical characters (RND)) are shown in red, blue, green and purple respectively.

462

W. Hartwig et al.

Fig. 1. Selected attributes with the cvf = 1 for all attributes

Fig. 2. Visualization of AT categories (15 identiﬁed groups) after sequential clustering (Color ﬁgure online)

13 out of 15 groups contain elements of a single Attack Tool only, while group 3 contains FS and VAX-elements and group 7 contains FS-, CLI- and VAX-elements. Messages from RND can be distinguished in 100% of all cases; CLI in 4 out of 5, FS in 5 out of 7 and VAX in 3 out of 5. Considering that the groups 3 and 7 contain more elements than a single Attack Tool (AT), 18 reference types were derived from the groups to identify all elements. The correlation of all reference types for Friendly Scanner (FS) with the attack variants is shown in Table 1. The ﬁrst seven FS reference types correlate to an attack variant presented by [8]. For example, reference type 1 correlates with variant SS-b or SS-f just from the shown characteristics.

Classiﬁcation of SIP Attack Variants with a Hybrid SEN

463

Table 1. Correlation of Reference Types (1–7) to Attack Variants from [8] with new additional characteristics of the reference types Ref. type 1 2 3 4 5

Attack variant SS-b SS-d SS-f RH-a SS-d

6 7 8 9 10 11 12 13 14 15 16

ES-a ES-b1 – – – – – – – – –

17 18

– –

Additional characteristics CU/TU/FU = 100; TH/FH = 1.1.1.1 CU/TU/FU = 100; TH/FH = 1.1.1.1 CU/TU/FU = 100; TH/FH = invalid; CU/TU/FU = 123; CH = 1.1.1.1 CU/TU/FU = “” and others mixed; CH/TH/FH = invalid; Method, 50% REG CH = 1.1.1.1 Method: 75% REG, 20% INV Method: 99.9% INV CU/FU = 100; TU = “”; TH = invalid CU/TU/FU = 123 CU/FU = “”; CH/FH = invalid CU/FU = “”; CH/FH = invalid – TH/FH = 1.1.1.1 – CU = 1 or multiple 1 s; TU/FU = 100; CH = 127.0.0.1; TH/FH = 1.1.1.1 Method: 40% INV CU/TU/FU = alphabetical Strings

The characteristics of the other 11 reference types (to the best knowledge of the authors) are not presented in any other publication. Reference types 11 and 12, for example, have invalid IP-addresses and empty ContactUser and FromUser ﬁelds, while reference type 16 contains “1.1.1.1” for ToHost and FromHost and “127.0.0.1” for ContactHost. Moreover, the reference types were tested with random data samples, i.e. new input vectors, from August 2016 and December 2016 where multiple new reference types were discovered. These reference types are shown in Fig. 3 together with the reference types from January 2016. The 18 previously found reference types are conﬁrmed by the new data, and additional 7 attack variations for the four dominant attack tools are detected and deﬁned as reference types. The obtained 25 reference types are identiﬁed and organized in 17 groups out of which 12 contain only reference types of a single Attack Tool. Out of six attack tool-pairs, VAX/RND can be distinguished for all reference types.

464

W. Hartwig et al.

Fig. 3. 18 reference types from January 2016 with additional 7 reference types from test data.

6 Conclusion and Further Work The self-organized learning neural network Self-Enforcing Network (SEN) was used to analyze received SIP request messages from a honeynet system at the University of Duisburg-Essen. The goal was to identify categories for four dominant User Agents and the attack tools behind them. For that purpose, the SEN was extended with clustering capabilities to increase the amount of data that can be analyzed. The SIP attack trafﬁc for January 2016 was analyzed and 18 different categories were identiﬁed for four User Agents including “friendly-scanner” (FS), “SIPCli” (CLI), “VaxSIPUserAgent” (VAX) and a string of eight random alphabetical characters (RND). The 7 out of 18 categories were correlated to attack variants discovered by [8] while SEN discovered the previously unknown other 11. The 18 categories are organized into 15 groups. Most contain categories of a single Attack Tool meaning that input messages classiﬁed into those categories could be mapped to a speciﬁc AT without conflict. A test with messages from two other months discovered seven new categories for three out of four User Agents: three FS attack variations, three VAX variations, and an additional RND variation. As part of future work, the importance and consideration of certain attributes could be adjusted to create more diverse categories for the User Agents. Another part of future work is the analysis of other user agents that were not considered in this study. The ﬁnal part of future work includes the analysis of trafﬁc from users in regular VoIP Systems and the comparison to the identiﬁed categories for the purpose of distinguishing attacks from regular usage in a live system.

Classiﬁcation of SIP Attack Variants with a Hybrid SEN

465

References 1. Rosenberg, J., et al.: SIP – Session Initiation Protocol. No. RFC 3261 (2002) 2. Jacobson, V., Frederick, R., Casner, S., Schulzrinne, H.: RTP – A transport protocol for realtime applications. No. RFC 3550 (2003) 3. Franks, J., Hallam-Baker, P., Hostetler, J., Lawrence, S., Leach, P., Luotonen, A., Stewart, L.: HTTP authentication – Basic and Digest Access Authentication. No. RFC 2617 (1999) 4. Aleroud, A., Zhou, L.: L.: Phishing environments, techniques, and countermeasures: a survey. Comput. Secur. 68, 160–196 (2017) 5. Elsabagh, M., Fleck, D., Stavrou, A., Kaplan, M., Bowen, T.: Revisiting difﬁculty notions for client puzzles and DoS resilience. In: Dacier, M., Bailey, M., Polychronakis, M., Antonakakis, M. (eds.) ISC 2012. LNCS, vol. 10453, pp. 39–54. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-319-66332-6_20 6. Manunza, L., Marseglia, S., Romano, S.P.: Kerberos: a real-time fraud detection system for IMS-enabled VoIP networks. J. Netw. Comput. Appl. 80, 22–34 (2017) 7. Vennila, G., Manikandan, M.S.K., Suresh, M.N.: Detection and prevention of spam over Internet telephony in Voice over Internet Protocol networks using Markov chain with incremental SVM. Int. J. Commun. Syst. 30(11), e3255 (2017) 8. Aziz, A., Hoffstadt, D., Ganz, S., Rathgeb, E.: Development and analysis of generic VoIP attack sequences based on analysis of real attack trafﬁc. In: 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom 2013), pp. 675–682. IEEE, Melbourne (2013) 9. Aziz, A., Hoffstadt, D., Rathgeb, E., Dreibholz, T.: A distributed infrastructure to analyse SIP attacks in the internet. In: IFIP Networking Conference 2014 (IFIP Networking), pp. 1–9 (2014) 10. Gruber, M., Hoffstadt, D., Aziz, A., Fankhauser, F., Schanes, C., Rathgeb, E., Grechenig, T.: Global VoIP security threats – large scale validation based on independent honeynets. In: IFIP Networking Conference (IFIP Networking) 2015, pp. 1–9 (2015) 11. Hoffstadt, D., Marold, A., Rathgeb, E.P.: Analysis of SIP-based threats using a VoIP honeynet system. In: Conference proceedings of the 11th IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom). Liverpool, UK (2012) 12. Hoffstadt, D., Monhof, S., Rathgeb, E.: SIP trace recorder: monitor and analysis tool for threats in SIP-based networks. In: 2012 8th International on Wireless Communications and Mobile Computing Conference (IWCMC), August 2012 13. Klüver, C.: Steering clustering of medical data in a self-enforcing network (SEN) with a cue validity factor. In: 2016 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1–8 (2016) 14. Klüver, C.: A self-enforcing network as a tool for clustering and analyzing complex data. Procedia Comput. Sci. 108, 2496–2500 (2017) 15. Klüver, C., Klüver, J., Zinkhan, D.: A self-enforcing neural network as decision support system for air trafﬁc control based on probabilistic weather forecasts. In: Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN), Anchorage, Alaska, USA, pp. 729–736 (2017) 16. Rosch, E., Mervis, C.B.: Family resemblances: studies in the internal structure of categories. Cognit. Psychol. 7(4), 573–605 (1975) 17. Liu, H., Ban, X.J.: Clustering by growing incremental self-organizing neural network. Expert Syst. Appl. 42(11), 4965–4981 (2015)

466

W. Hartwig et al.

18. Xu, R., Wunsch, D.: Survey of clustering algorithms. IEEE Trans. Neural Netw. 16(3), 645– 678 (2005) 19. Fahad, A., et al.: A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans. Emerg. Top. Comput. 2(3), 267–279 (2014) 20. Aggarwal, C.C., Reddy, C.K. (eds.): Data clustering: algorithms and applications. CRC Press, Boca Raton (2013) 21. Gower, J.C., Ross, G.J.: Minimum spanning tree and single linkage cluster analysis. Appl. Stat. 18, 54–64 (1969) 22. Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982) 23. Jain, A.K.: Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 31(8), 651– 666 (2010) 24. Fred, A.L., Jain, A.K.: Data clustering using evidence accumulation. In: 16th International Conference on Pattern Recognition Proceedings, vol. 4, pp. 276–280. IEEE (2002)

Anomaly Detection/Feature Selection/Autonomous Learning

Generalized Multi-view Unsupervised Feature Selection Yue Liu, Changqing Zhang(B) , Pengfei Zhu, and Qinghua Hu School of Computer Science and Technology, Tianjin University, Tianjin 300350, China {liuyue76,zhangchangqing,zhupengfei,huqinghua}@tju.edu.cn

Abstract. Although many unsupervised feature selection (UFS) methods have been proposed, most of them still suﬀer from the following limitations: (1) these methods are usually just applicable to single-view data, thus cannot well exploit the ubiquitous complementarity among multiple views; (2) most existing UFS methods model the correlation between cluster structure and data distribution in linear ways, thus more general correlations are diﬃcult to explore. Therefore, we propose a novel unsupervised feature selection method, termed as generalized Multi-View Unsupervised Feature Selection (gMUFS), to simultaneously explore the complementarity of multiple views, and complex correlation between cluster structure and selected features as well. Speciﬁcally, a multi-view consensus pseudo label matrix is learned and, the most valuable features are selected by maximizing the dependence between the consensus cluster structure and selected features in kernel spaces with Hilbert Schmidt independence criterion (HSIC).

Keywords: Unsupervised

1

· Multi-view · Feature selection

Introduction

For many real-world applications, such as image understanding [15], bioinformatics [20] and text mining [19], data are usually represented as high dimensional feature vectors. However, direct utilization of these high-dimensional data usually suﬀers from high computation cost, heavy storage burden and, performance degradation. Feature selection can reduce time and space requirements, alleviate the over-ﬁtting problem due to the “curse of dimensionality” and address the poor performance resulting from irrelevant and redundant features [8]. According to whether labels are available, feature selection approaches are basically categorized into supervised and unsupervised ones. Supervised feature selection methods usually jointly evaluate the importance of diﬀerent features via the correlation between features and class labels [8,25]. Unfortunately, labeled data are usually scarce and manually labeling is rather expensive, while unlabeled data are much more abundant. Therefore, unsupervised feature selection (UFS) [3,14,17,24] is practically important and has attracted close attention. c Springer Nature Switzerland AG 2018 V. K˚ urkov´ a et al. (Eds.): ICANN 2018, LNCS 11140, pp. 469–478, 2018. https://doi.org/10.1007/978-3-030-01421-6_45

470

Y. Liu et al.

Early methods [10,24] usually evaluate the importance of each feature individually and select features in the one by one manner, which could not well explore the correlation among features. Then, some methods [3,25] address this issue in two steps. Recently, researchers have proposed methods [9,14,17,23,26] to simultaneously exploit discriminative information and feature correlation in a uniﬁed framework. In this manner, it avoids the separation of structure identiﬁcation and feature selection, and thus better performance could be expected. Generally, there are two key factors for the success of unsupervised feature selection, i.e., identification of underlying data structure and, exploration of correlation between underlying data structure and selected features. For the ﬁrst one, due to the lack of label information, the underlying data structure is diﬃcult to accurately identiﬁed. Therefore, we attempt to borrow other information from the data to guide the process of feature selection. These multiview representations can capture rich information from multiple cues to beneﬁt the underlying structure identiﬁcation. For the second one, most existing methods [14,17,26] hold the underlying assumption that there exists linear correlation between the selected features and data structure. However, correlation in practice is usually much more complex than linear correlation in most existing approaches. To address the above limitations, we propose a novel unsupervised feature selection approach for multi-view data, termed as generalized Multi-View Unsupervised Feature Selection (gMUFS). Speciﬁcally, there are two contributions in gMUFS. First, our method identiﬁes cluster structure of data with the help of complementarity among multiple views. Second, to explore more general correlation between the consensus data structure and the selected features, Hilbert Schmidt independence criterion (HSIC) is introduced to capture feature-label dependence in kernel spaces. To solve our problem, an eﬃcient alternating optimization algorithm is developed. Experimental results on benchmark datasets validate the eﬀectiveness of the proposed approach over other state-of-the-arts.

2 2.1

gMUFS: Our Feature Selection Model Preliminaries

Throughout this paper, we use bolded lower-case letters to denote vectors in column form, bolded upper-case letters to denote matrices and upper-case letters to denote constants. We denote the data collection with N samples and V views as D = {X(v) ∈ RDv ×N }Vv=1 , where X(v) is the feature matrix of the v th view. By concatenating these views, the feature matrix corresponding V to all views can be denoted as X = [X(1) ; · · · ; X(V ) ] ∈ RD×N , where D = v=1 Dv . Considering the eﬀectiveness of spectral clustering technique [2,18], it is utilized to learn pseudo cluster labels to guide the process of feature selection in our approach. Speciﬁcally, for the aﬃnity matrix used in spectral clustering, a k-nearest-neighbor graph is introduced since local structure of the data generally reﬂects both important discriminative and cluster information, and its eﬀectiveness has been empirically proved by many feature selection methods

Generalized Multi-view Unsupervised Feature Selection

471

[3,14]. Moreover, it usually works better than other ones constructed according to global geometry structures. The aﬃnity matrix S is deﬁned as: xi −xj 2 Sij = exp(− σ2 ), Nk (xi , xj ) = 1 (1) 0, otherwise, where Nk (xi , xj ) indicates the k-nearest neighboring relationship. Speciﬁcally, Nk (xi , xj ) = 1 if xi (or xj ) belongs to the set of k-nearest neighbors of xj (or xi ), otherwise, Nk (xi , xj ) = 0. Accordingly, the objective function of spectral clustering with local geometric structure is deﬁned as follows: N

T ui uj Sij || √ − ||2 = min Tr(UT LU), s.t. U(v) U(v) = I, (2) U U Dii Djj i,j=1 n where D is a diagonal matrix with Dii = j=1 Sij , and L is the normalized graph Laplacian matrix constructed with L = D−1/2 (D − S)D−1/2 .

min

2.2

Generalized Correlation

For unsupervised feature selection [6,14,17], to ensure the quality of the selected features, many approaches try to maximize the correlation between the selected features and the loss function is usually Tthe pseudo 2 label Tmatrix. Typically, deﬁned as X W − U F or X W − U 2,1 for robust issue, where W ∈ RD×C is the feature selection matrix. Clearly, the underlying assumption is that the pseudo label matrix could be linearly reconstructed by the selected features, which is limited for the cases of complex dependence in practice. Therefore, we propose to measure the dependence in kernel space, which maps variables into a reproducing kernel Hilbert space such that the correlations measured in that space corresponds to high-order joint moments between the original distributions [1,7,16]. Supposing that Z = {xi , yi }N i=1 are jointly drawn from two domains X (xi ∈ X ) and Y (yi ∈ Y), with F and G being kernel spaces on X and Y respectively, then, the dependence of the two random variables is measured as: HSIC(Z, F, G) = (N − 1)−2 Tr(K1 HK2 H),

(3)

where K1 and K2 are the Gram matrices corresponding to diﬀerent variables. For a constant N , 1N ∈ RN is a column vector with all elements being 1 and H = I − N1 1N 1TN ∈ RN ×N centers the matrix to have zero mean. For our approach, we aim to select the features with high correlation with pseudo labels. Therefore, we should maximize the dependence between selected features and pseudo labels. By ignoring the constant scaling factor (N − 1)−2 , we should select features that could maximize the following objective: HSIC(XT W, U) = Tr(KF HKL H),

(4)

where KF and KL are the Gram matrices corresponding to the selected features and pseudo labels, respectively.

472

2.3

Y. Liu et al.

Generalized Multi-view UFS

For our multi-view feature selection model, the goal is to jointly select features from diﬀerent views, thus, we will explore the complementarity across these views. To ensure the consistency of multi-view data, we introduce a consensus pseudo label matrix and enforce each view-speciﬁc pseudo label matrix U(v) towards consensus pseudo label matrix U(∗) that will be more reasonable and robust, accordingly, well guides the feature selection. Speciﬁcally, we introduce the disagreement measure [12] between the consensus pseudo cluster label matrix and that of each view as follows: 2 KU(v) KU(∗) , − (5) DA(U(v) , U(∗) ) = ||K (v) ||2 ||KU(∗) ||2F F U F where KU(·) is the aﬃnity matrix for U(·) and || · ||F denotes the Frobenius norm of a matrix. T Under the condition U(v) U(v) = I and with using inner product kernel, i.e., T KU(v) = U(v) U(v) , we have ||KU(v) ||2F = C, where C is the number of clusters. By ignoring the constant additive and scaling terms, Eq. (5) turns out to be: T

T

DA(U(v) , U(∗) ) = −Tr(U(v) U(v) U(∗) U(∗) ).

(6)

Accordingly, the proposed generalized multi-view unsupervised feature selection model is induced as: V

min

U(1) ,··· ,U(V ) U(∗) ,W

+α

V

T

Tr(U(v) L(v) U(v) ) + γ W2,1

v=1

DA(U(v) , U(∗) ) + βIND(XT W, U(∗) )

v=1 T

T

s.t. U(v) U(v) = I, U(∗) U(∗) = I, U(v) ≥ 0, U(∗) ≥ 0.

(7)

Note that, under the nonnegative and orthogonal constraints, there is only one element in each row of U which is greater than zero and all of the others are zeros. The structure-sparsity regularization on W is realized by 2,1 -norm. For the consistence of signs for diﬀerent terms, we deﬁne the independence measure as IND(·, ·) = −HSIC(·, ·). The nonnegative scalars α, β and γ are tradeoﬀ parameters. The nonnegative constrains are imposed on pseudo labels matrices to agree with label deﬁnition and interpretability [14]. 2,1 -norm imposed on the feature selection matrix W ensures the sparseness in rows, making it particularly suitable for feature selection. According to the objective function, our method simultaneously promotes the quality of pseudo labels by exploiting the complementarity of diﬀerent views and explores the complex correlation between the selected features and the multi-view consensus cluster structure.

Generalized Multi-view Unsupervised Feature Selection

3

473

Optimization

In this section, we propose an iterative updating algorithm to solve the optimization problem of gMUFS. The objective function in Eq. (7) is diﬃcult to resolve with respect to {U(v) }Vv=1 , U(∗) and W, therefore, the alternating optimization is introduced. Firstly, the objective function is rewritten as: min

U(1) ,··· ,U(V ) U(∗) ,W

V

(v) T

Tr(U

(v)

L

(v)

U

) + γ W2,1 + α

v=1

V

DA(U(v) , U(∗) )

v=1

V 2 2 T η (v) T (v) U − I ), (8) + βIND(XT W, U(∗) ) + (U(∗) U(∗) − I + U 2 F F v=1

where η > 0 is the parameter for orthogonality condition. In practice, η should be large enough to ensure the orthogonality satisﬁed. For convenience, we deﬁne L(U(1) , · · · , U(V ) , U(∗) , W) =

V

T

Tr(U(v) L(v) U(v) ) + α

v=1

V

DA(U(v) , U(∗) )

v=1

2 2 V (v) T (v) η (∗) T (∗) + U ). U U − I U − I + βIND(XT W, U(∗) ) + γ W2,1 + ( 2 F

F

v=1

(9) • Update W by ﬁxing U(1) , · · · , U(V ) and U(∗) : Similarly to existing method [4] and for optimization convenience, we employ inner product kernel for HSIC. Accordingly, the subproblem should minimize the following function: L(W) = Tr(WT (γG − βM)W), where G is a diagonal matrix with elements deﬁned as Gii = (∗)

(∗) T

T

(10) 1 2||wi ||2

and

M = XHU U HX . To avoid trivial solution, we constrain W with WT W = I. Then the above problem is actually similar to the objective of spectral clustering. Therefore, the solution for W is the ﬁrst P eigenvectors (corresponding to smallest P eigenvalues) of the matrix γG − βM. It is noteworthy that, since the dependence is measured in kernel space, the dimensionalities of U(∗) and XT W need not to be the same, i.e., P = C. Therefore, our method is more ﬂexible than others [14,17]. • Update U(v) by ﬁxing W and U(∗) : We introduce the multiplicative updating rules [13]. Speciﬁcally, since U(v) 0, we can solve the problem by intro(v) ducing Lagrange multiplier matrix Φ = [φij ] with φij corresponding Uij . Then, the Lagrange function is as follows: 2 T T T η Tr (U(v) L(v) U(v) ) + αDA(U(v) , U(∗) ) + U(v) U(v) − I + Tr(ΦU(v) ). 2 F (11)

474

Y. Liu et al.

Setting the derivative of Lagrange function with respect to U(v) to be zero, we can get (12) 2NU(v) − 2ηU(v) + Φ = 0, T

T

where N = L(v) −αU(∗) U(∗) +ηU(v) U(v) . According to the Karush-Kuhn(v) Tucker (KKT) condition, i.e., φij Uij = 0, we can get the updating rule as follows: (v) )ij (v) (v) (ηU Uij ← Uij . (13) (v) (NU )ij • Update U(∗) by ﬁxing U(1) , · · · , U(V ) and W: Similarly, we obtain the following update rule: (∗)

(∗)

Uij ← Uij T

where Q = ηU(∗) U(∗) − α

V

(ηU(∗) )ij , (QU(∗) )ij

(v) (v) T U v=1 U (v) (∗)

(14)

− βHXT WWT XH. Similarly T

to the work [14], we normalize U and U to ensure (U(v) U(v) )ii = 1 and T (U(∗) U(∗) )ii = 1 after the above steps in Eqs. (13) and (14). We initialize each U(v) with standard spectral clustering corresponding to the v th view and U(∗) is initialized by averaging these U(v) s.

4

Experiments

In this section, we conducted extensive experiments to evaluate the proposed gMUFS. Following previous unsupervised feature selection approaches [3,23], we also report performances of diﬀerent methods in terms of clustering. The experiments are conducted on 7 real-world datasets. For WIDE1 , we extract 5 types of features, i.e., color histogram (64), color autocorrelogram (144), edge direction histogram (73), wavelet texture (128) and block-wise color moments (225), where the numbers in parentheses indicate the dimensionality of each view. For MSRCv1 [22] and Caltech101-7 [5], 5 types of features are as follows: HOG (100), GIST (521), LBP (256), SIFT (210/441), and CENT (1302). For Flickr2 and Oxford3 , 4 types of features are extracted, i.e., SIFT(200), GIST(512), LBP(59) and PHOG(680). For action recognition datasets Still DB [11] and Willow4 , 3 types of features are used, i.e., Sift Bow (200), Color Sift Bow (200) and Shape Context Bow (200). 4.1

Experiment Setup

We compare our method with several state-of-the-art unsupervised feature selection methods on clustering task. AllFeatures concatenates all types of features 1 2 3 4

http://lms.comp.nus.edu.sg/research/NUS-WIDE.htm. https://www.ﬂickr.com/. http://www.robots.ox.ac.uk/∼vgg/data/oxbuildings/. http://www.di.ens.fr/willow/research/stillactions/.

Generalized Multi-view Unsupervised Feature Selection

475

for clustering. NDFS [14] performs feature selection within a joint framework of nonnegative spectral analysis and 2,1 -norm regularized regression. UDFS [23] exploits local discriminative information and feature correlation simultaneously. SPEC [24] selects features with spectral regression. MCFS [3] utilizes spectral regression with 1 -norm to select features. AUMFS [6] and MSSFL [21] exploit both intra-view and inter-view information to jointly select features. Speciﬁcally, beyond comparing with multi-view feature selection methods AUMFS and MSSFL, to comprehensively compare diﬀerent algorithms, we perform feature selection for each view by using NDFS, UDFS, SPEC and MCFS with the performance of best view reported. Furthermore, we conduct multi-view feature selection by concatenating all types of features for these methods: AllFeatures, NDFS, UDFS, SPEC, and MCFS. Following previous work, we set k = 5 for all the datasets to specify the size of neighborhoods and construct the aﬃnity graph. We tune the parameters for all methods with the grid search strategy from {10−6 , 10−4 , ..., 104 , 106 }. For gMUFS, NDFS, we set η = 108 to insure the orthogonality satisﬁed [14]. The number of the selected features is set as the value from {10, 20, ..., 100}, while from {10, 20, · · · , 50} when the dimensionality of is smaller than 100, reporting the best results. Due to the randomness of K-means clustering employed, we repeat each experiment 20 times with random initialization and, the average results with standard deviations are reported. 4.2

Experiment Results and Analysis

As shown in Tables 1 and 2, we report the quantitative results in terms of Accuracy (ACC) and Normalized Mutual Information (NMI) for diﬀerent methods. SV and MV indicate single-view and multi-view methods, respectively. First, it is observed that the results by directly concatenating all views are signiﬁcantly better than the performance of using each single view. This conﬁrms the importance of integrating multiple views. Second, although using all views, the performances of the traditional single-view methods are obviously worse than AUMFS, Table 1. Clustering results (ACC% ± std) of diﬀerent algorithms. Method

WIDE

MSRCv1

Caltech101 Flickr

Oxford

Willow

NDFS

25.2±1.0

59.7±4.8

62.1±3.6

25.7±0.9

24.9±1.1

26.2±1.0

31.5±1.5

UDFS

24.4±1.0

53.5±4.5

60.6±3.1

25.4±1.0

24.3±1.1

26.4±0.7

31.0±1.2

SPEC

24.6±1.3

46.2±4.7

56.5±3.6

24.4±0.9

23.4±1.1

24.5±0.9

30.0±1.9

MCFS

24.3±1.2

51.5±5.0

56.1±3.1

25.3±0.9

24.1±0.9

25.2±1.0

30.4±1.3

MV All-Feat 26.0±1.1

42.0±1.3

69.3±4.0

27.5±1.1

23.2±0.8

23.4±0.8

30.4±1.6 32.4±1.6

SV

Still DB

NDFS

28.2±1.3

57.2±5.1

65.5±5.7

28.0±1.4

27.9±1.9

28.7±1.0

UDFS

25.6±1.4

62.6±3.0

58.2±5.0

26.8±1.0

22.5±1.1

29.1±2.0 30.3±1.6

SPEC

23.4±0.8

57.6±4.8

45.9±4.4

23.7±1.3

21.1±1.3

26.8±1.5

29.7±0.8

MCFS

24.2±1.3

67.9±5.0

58.2±4.2

23.2±0.8

24.4±0.6

25.9±1.3

31.0±1.5

MSSFL

28.4±1.7

56.8±5.8

72.0±5.0

29.3±1.5

28.0±1.2

27.4±1.0

31.7±1.3

AUMFS 28.5±1.3

53.7±0.3

58.0±3.5

27.4±0.9

26.8±1.4

28.8±1.1

31.4±2.3

Ours

29.7±1.4 78.9±3.8 75.6±3.5

30.0±1.3 29.1±0.8 28.4±1.0

33.0±1.7

476

Y. Liu et al. Table 2. Clustering results (NMI% ± std) of diﬀerent algorithms.

SV

Method

WIDE

MSRCv1

Caltech101 Flickr

Oxford

Willow

NDFS

14.6±0.8

51.3±3.6

54.7±2.8

14.9±0.6

12.7±1.0

7.6±0.6

10.7±0.8

UDFS

13.9±0.7

45.8±3.4

52.8±2.7

13.9±0.5

11.5±0.6

6.5±0.6

10.4±0.8

SPEC

12.9±0.7

38.7±4.0

46.9±2.6

12.7±0.6

10.5±0.7

6.0±0.5

9.6±1.1

MCFS

13.5±0.7

43.7±3.4

48.9±2.9

14.5±0.5

11.3±0.6

6.5±0.6

9.7±1.1

18.0±1.1

40.0±1.6

67.4±2.6

15.5±0.6

14.7±0.5

4.0±0.4

10.9±1.5

16.7±1.5

50.1±5.5

62.5±4.9

16.5±0.8

15.2±0.6

9.2±0.5

13.1±0.9

UDFS

15.0±0.7

55.3±3.0

55.0±3.3

15.6±0.5

11.6±0.5

8.6±0.8

12.5±0.5

SPEC

11.4±0.7

49.8±3.6

34.6±4.0

12.1±0.7

7.9±0.8

8.0±0.6

11.5±0.7

MCFS

12.8±1.1

62.8±2.4

50.5±4.1

12.9±0.4

14.3±1.1

7.6±0.6

12.7±1.2

MSSFL

18.4±0.8

48.7±5.5

60.5±2.4

18.8±0.6

16.1±1.0

8.3± 0.5 13.8±1.0

AUMFS 16.0±1.1

47.8±1.3

52.9±2.8

16.2±0.5

15.6±1.1

9.9±0.6 12.0±1.0

MV AllFeat NDFS

19.8±1.0 69.1±3.7 67.7±1.8

Ours

19.3±0.5 16.2±1.0 8.7±0.4

Still DB

14.4±1.0

0.4

0.4

0.3

0.3

0.3

0.2 0.1 0

ACC

0.4

ACC

ACC

MSSFL and ours. This is principally because these approaches could not explore the complementarity among multiple views by simply feature concatenation. Third, the proposed method, gMUFS, achieves the best performance on 6 out of 7 datasets, which empirically proves the eﬀectiveness of jointly exploiting multiview representations and exploring the complex correlation between the selected features and cluster structure. We provide the parameter sensitiveness analysis in Fig. 1. By ﬁxing the value of one parameter (with 1 in our experiments), we tune the other two parameters. The results demonstrate that our method is relatively robust to the three parameters, α, β and γ, since promising results could be expected with wide ranges. It is noteworthy that, compared with single-view unsupervised feature selection methods [14,23], although our multi-view method introduces one more parameter α for handling multi-view correlation, it is very robust and easy to tune in practice. We empirically study the property of convergence of our optimization algorithm. According to Fig. 2, our algorithm could converge within 10 iterations, which validates the eﬀectiveness of the proposed optimization algorithm.

0.2 0.1 0

10e−6 10e−4 0.01 1 100 10e4 10e6

β

0.01 10e−4 10e−6

1

γ

100

10e6 10e4

0.2 0.1 0

10e−6 10e−4 0.01 1 100 10e4 10e6

α

0.01 10e−4 10e−6

1

γ

100

10e6 10e4

10e−6 10e−4 0.01 1 100 10e4 10e6

β

Fig. 1. Parameter sensitivity evaluation on Still DB.

0.01 10e−4 10e−6

1

α

100

10e6 10e4

Generalized Multi-view Unsupervised Feature Selection 5

Objective Function Value

Objective Function Value

9

x 10

6

5

4

3

2

1

5

10

15

20

Iteration Number

25

30

8

x 10

12

Objective Function Value

9

7

4.5 4 3.5 3 2.5 2 1.5 1 0.5 0

477

5

10

15

20

Iteration Number

25

30

x 10

10

8

6

4

2

5

10

15

20

25

30

Iteration Number

Fig. 2. Convergence curves on Oxford, Caltech and Still DB.

5

Conclusion

In this work, we have developed a novel multi-view unsupervised feature selection approach, which jointly exploits complementarity of multiple views and explores general correlation between the selected features and underlying cluster structure. Beneﬁting from the complementarity of diﬀerent views, underlying cluster structure can be well identiﬁed and, subsequently, Hilbert-Schmidt independence criterion (HSIC) is employed to address more general dependencies between the selected features and the pseudo cluster labels. Extensive experimental results on real-world datasets demonstrate the eﬀectiveness of our model. For simplicity and eﬃciency, we adopted inner product kernel for HSIC and, in the future we will take more kernels into account for better performance. Acknowledgments. This work was supported in part by National Natural Science Foundation of China (Grand No:61602337, 61732011, 61702358, 61402323).

References 1. Bach, F.R., Jordan, M.I.: Kernel independent component analysis. JMLR 3, 1–48 (2002) 2. Belkin, M., Niyogi, P.: Laplacian eigenmaps and spectral techniques for embedding and clustering. In: NIPS, pp. 585–591 (2002) 3. Cai, D., Zhang, C., He, X.: Unsupervised feature selection for multi-cluster data. In: SIGKDD, pp. 333–342 (2010) 4. Cao, X., Zhang, C., Fu, H., et al.: Diversity-induced multi-view subspace clustering. In: CVPR, pp. 586–594 (2015) 5. Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. Comput. Vis. Image Underst. 106(1), 59–70 (2007) 6. Feng, Y., Xiao, J., Zhuang, Y., Liu, X.: Adaptive unsupervised multi-view feature selection for visual concept recognition. In: Lee, K.M., Matsush*ta, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012. LNCS, vol. 7724, pp. 343–357. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37331-2 26 7. Gretton, A., Bousquet, O., Smola, A., Sch¨ olkopf, B.: Measuring statistical dependence with Hilbert-Schmidt norms. In: Jain, S., Simon, H.U., Tomita, E. (eds.) ALT 2005. LNCS (LNAI), vol. 3734, pp. 63–77. Springer, Heidelberg (2005). https:// doi.org/10.1007/11564089 7

478

Y. Liu et al.

8. Guyon, I., Elisseeﬀ, A.: An introduction to variable and feature selection. JMLR 3, 1157–1182 (2003) 9. Han, D., Kim, J.: Unsupervised simultaneous orthogonal basis clustering feature selection. In: CVPR, pp. 5016–5023 (2015) 10. He, X., Cai, D., Niyogi, P.: Laplacian score for feature selection. In: NIPS, pp. 507–514 (2006) 11. Ikizler, N., Cinbis, R.G., Pehlivan, S., et al.: Recognizing actions from still images. In: ICPR, pp. 1–4 (2008) 12. Kumar, A., Rai, P., Daume, H.: Co-regularized multi-view spectral clustering. In: NIPS, pp. 1413–1421 (2011) 13. Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788 (1999) 14. Li, Z., Yang, Y., Liu, J., et al.: Unsupervised feature selection using nonnegative spectral analysis. In: AAAI, vol. 2, pp. 1026–1032 (2012) 15. Naikal, N., Yang, A.Y., Sastry, S.S.: Informative feature selection for object recognition via sparse PCA. In: ICCV, pp. 818–825 (2011) 16. Niu, D., Dy, J.G., Jordan, M.I.: Iterative discovery of multiple alternativeclustering views. IEEE T-PAMI 36(7), 1340–1353 (2014) 17. Qian, M., Zhai, C.: Robust unsupervised feature selection. In: IJCAI, pp. 1621– 1627 (2013) 18. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE T-PAMI 22(8), 888–905 (2000) 19. Tang, B., Kay, S., He, H.: Toward optimal feature selection in naive Bayes for text categorization. IEEE T-KDE 28(9), 2508–2521 (2016) 20. Wang, H., Nie, F., Huang, H.: Identifying quantitative trait loci via group-sparse multitask regression and feature selection: an imaging genetics study of the ADNI cohort. Bioinformatics 28(2), 229–237 (2011) 21. Wang, H., Nie, F., Huang, H.: Multi-view clustering and feature learning via structured sparsity. In: ICML, pp. 352–360 (2013) 22. Winn, J., Jojic, N.: LOCUS: learning object classes with unsupervised segmentation. In: ICCV, vol. 1, pp. 756–763 (2005) 23. Yang, Y., Shen, H.T., Ma, Z.: L2, 1-norm regularized discriminative feature selection for unsupervised learning. IJCAI 22(1), 1589 (2011) 24. Zhao, Z., Liu, H.: Spectral feature selection for supervised and unsupervised learning. In: ICML, pp. 1151–1157 (2007) 25. Zhao, Z., Wang, L., Liu, H.: Eﬃcient spectral feature selection with minimum redundancy. In: AAAI, pp. 673–678 (2010) 26. Zhu, P., Hu, Q., Zhang, C., et al.: Coupled dictionary learning for unsupervised feature selection. In: AAAI, pp. 2422–2428 (2016)

Performance Anomaly Detection Models of Virtual Machines for Network Function Virtualization Infrastructure with Machine Learning Juan Qiu(B) , Qingfeng Du(B) , Yu He, YiQun Lin, Jiaye Zhu, and Kanglin Yin Tongji University, Shanghai 201804, China juan [emailprotected], du [emailprotected]

Abstract. Networking Function Virtualization (NFV) technology has become a new solution for running network applications. It proposes a new paradigm for network function management and has brought much innovation space for the network technology. However, the complexity of the NFV Infrastructure (NFVI) impose hard-to-predict relationship between Virtualized Network Function (VNF) performance metrics (e.g., latency, throughput), the underlying allocated resources (e.g., load of vCPU), and the overall system workload, thus the evolving scenario of NFV calls for adequate performance analysis methodologies, early detection of performance anomalies plays a significant role in providing high-quality network services. In this paper, we have proposed a novel method for detecting the performance anomalies in NFV infrastructure with machine learning methods. We present a case study on the open source NFV-oriented project, namely Clearwater, which is an IP Multimedia Subsystem (IMS) NFV application. Several classical classifiers are applied and compared empirically on the anomaly dataset which is built by ourselves. Considering the risk of over-fitting issue, the experimental results show that neutral networks is the best anomaly detection model with the accuracy over 94%.

Keywords: NFV

1

· Performance anomaly detection · Machine learning

Introduction

The paradigm of Network Function Virtualization (NFV) has immediately been an emerging paradigm which is a new vision of the network that takes advantage of advances in dynamic cloud architecture, Software Deﬁned Networking (SDN), and modern software provisioning techniques. The topic about NFV bottlenecks analysis and relevant hardware and software features for high and predictable performance have been already highlighted in the Group Speciﬁcation (GS) by European Telecommunications Standards Institute (ETSI) industry Speciﬁcation (ISG) Network Functions [1]. c Springer Nature Switzerland AG 2018 V. K˚ urkov´ a et al. (Eds.): ICANN 2018, LNCS 11140, pp. 479–488, 2018. https://doi.org/10.1007/978-3-030-01421-6_46

480

J. Qiu et al.

The purpose of this paper aims to detect performance anomalies by modeling the various performance metrics data collected from the virtual machines of the NFV platform. We conduct an experiment with an open source NFVoriented project, namely Clearwater, which has been designed to support massive horizontal scalability and adopts popular cloud computing design patterns and technologies, to demonstrate how the proposed method can be applied to the detection performance anomalies. The main contributions of this paper are as follows: 1. Present an approach on how to build the performance anomaly dataset for NFVI. 2. Put forward an approach for detecting performance anomalies in NFVI with machine learning models. The paper is organized as follows: The next section discusses the related works; Methodology and implementation are presented in Sect. 3, then we conduct a case study on Clearwater in Sect. 4; Sect. 5 concludes and provides the conclusion.

2

Related Works

Reliability studies for NFV technology including performance and security topics are also hot research areas for both academia and industry. In order to guarantee high and predictable performance of data plane workloads, a list of minimal features which the Virtual Machine (VM) Descriptor and Compute Host Descriptor should contain for the appropriate deployment of VM Images over an NFV Infrastructure (NFVI) are presented [1]. NFV-Bench [2] is proposed by Domenico et al. to analyze the faulty scenarios and to provide joint dependability and performance evaluations for NFV systems. Bonaﬁglia et al. [3] provides a (preliminary) benchmark of the widespread virtualization technologies when used in NFV, which means when they are exploited to run the so-called virtual network functions and to chain them in order to create complex services. Priyanka et al. presents the design and implementation of a tool, namely NFVPerf [4], to monitor performance and identify performance bottlenecks in an NFV system. NFVPerf runs as part of a cloud management system like OpenStack and sniﬀs traﬃc between NFV components in a manner that is transparent to the VNF. Anomaly detection is an important data analysis task that detects abnormal data from a given dataset, it is an important data mining research problem and has been widely studied in many ﬁelds. It can usually be solved by statistics and machine learning methods [5–8]. In recent years, anomaly detection literature in NFV has also begun to emerge. Michail-Alexandros et al. [9] presented the use of an open-source monitoring system especially tailored for NFV in conjunction with statistical approaches commonly used for anomaly detection, towards the timely detection of anomalies in deployed NFV services. Domenico et al. [10] proposed an approach on an NFV-oriented Interactive Multimedia System to detect problems aﬀecting the quality of service, such as the overload, component

Performance Anomaly Detection Models for NFVI

481

crashes, avalanche restarts and physical resource contention. EbAT [11] is an automated online detection framework for anomaly identiﬁcation and tracking in data center systems. Fu [12] proposed a framework for autonomic anomaly detection on cloud systems and the proposed framework could select the most relevant among the large number of performance metrics. We have been actively participating in the OPNFV1 Yardstick project2 . Especially, we continuously and deeply involved in the HA Yardstick framework architecture evolution, and the fault injection techniques used in the this paper are based on our previous research works [13,14]. Recently we are participating in the OPNFV Bottlenecks project3 which is a testing project aims to ﬁnd system bottlenecks by testing and verifying OPNFV infrastructure in a staging environment before committing it to a production environment. Most cloud operators identify performance bottlenecks by monitoring hardware resource utilization, or other application-speciﬁc metrics obtained from instrumenting the application itself. In this paper, we are trying to detect performance anomalies by modeling the various performance metrics data collecting from the virtual machines of the NFV platform.

3

Methodology and Implementation

3.1

Classification Problem

The performance anomaly detection method studied in this paper is based on the classiﬁcation methods. The essence of the anomaly detection problem is to train and get a detection model by using the performance metrics data collected from the virtual machine in the NFV infrastructure layer. The virtual machine state characterized by the performance metrics collected in real time is divided into multiple classes based on the anomaly detection model.

Fig. 1. The training and testing processes of anomaly detection model

1 2 3

https://www.opnfv.org/. https://wiki.opnfv.org/display/yardstick/Yardstick/. https://wiki.opnfv.org/display/bottlenecks/Bottlenecks/.

482

J. Qiu et al.

As shown in Fig. 1, given the training performance metric samples: T = k T {(x1 , y1 ) , (x2 , y2 ) , ..., (xk , yk )} ∈ (Rn × Y ) where Xi = [x1i , x2i , ..., xni ] ∈ Rn is the input vector, the component of the vector represents the performance metrics. yi is the output which is the accordingly anomaly label for xi and (xi , yi ) represents a sample of the training set and k is the sample size of the training set. For the multi classiﬁcation problem, we need not only to determine whether there is an anomaly, but also to further determine which kind of anomaly it belongs to. yi ∈ Y = {1, 2, ..., c} , i = 1, 2, ..., k, where c is the size of anomaly classes. We agree on that yi = 1 means normal status, the other value yi represent abnormal status. thus the solution is to explore a decision function in Rn : y = f (x) : Rn → Y and this function could be used to infer the corresponding value Ynew of any new instance Xnew . It performs detection with localization of an anomalous behaviour by assigning one class label to each anomalous behaviour depending on its localization. Machine learning is a famous ﬁeld to be extremely relevant for solving classiﬁcation problems. with respect to the machine learning models that we aim to build for detection classiﬁers, samples of labeled monitoring data are needed to train them to discern diﬀerent system behaviours. There are a large pool of classiﬁcation-based techniques available, we will try to introduce some of the well known classiﬁers in this paper, such as support vector machines (SVM), KNearest Neighbors (KNN), Random Forests, Decision Tree and Neural Networks (NN). The measures of classiﬁcation eﬃciency could be built from a confusion matrix that could provide results of counting correctly and incorrectly detected instances for each class of events. The confusion matrix, also known as an error metrics is a speciﬁc table layout that allows visualization of the performance of a classiﬁer in the ﬁeld of machine learning. In a binary classiﬁcation task, the terms ‘positive’ and ‘negative’ refer to the classiﬁer’s prediction, and the terms ‘true’ and ‘false’ refer to whether that prediction corresponds to the external judgment (sometimes known as the ‘observation’). Given these deﬁnitions, the confusion matrix could be formulated as Table 1. Table 1. Confusion matrix Actual class (observation) Predicted class (expectation)

TP (True Positive) correct result FN (False Negative) Missing result

FP (False Positive) Unexpected result TN (True Negative) correct absence of result

accuracy, precision and F − measure are the well known performance mea+T N sures for machine learning models. Intuitively, accuracy = T P +FT PP +T N +F N is easy to understand, that is, the proportion of correctly categorized samples accounted for all samples. Generally speaking, the higher the accuracy, the betP ter the classiﬁer. precision = T PT+F P is the ability of the classiﬁer not to label

Performance Anomaly Detection Models for NFVI

483

P as positive a sample that is negative, and recall = T PT+F is the ability of the N 2 precision×recall classiﬁer to ﬁnd all the positive samples. The Fβ = 1 + β β 2 precision+recall (Fβ and F1 measures) can be interpreted as a weighted harmonic mean of the precision and recall. A Fβ measure reaches its best value at 1 and its worst score at 0. With β = 1, Fβ and F1 are equivalent, and the recall and the precision are equally important.

3.2

Implementation

Performance Anomaly Detection Framework. We implement an anomaly detection framework which includes a system perturbation module, a cloud platform monitoring module and a data processing and analysis module. The perturbation module generates workload and faultload to simulate performance issues or bottlenecks. At the same time, the monitoring module can collect relevant performance data, it performs the monitoring process according to the Key Performance Indicator (KPI), the goal of monitoring is to gather data samples from the target system via performance counters which is so-called monitoring metrics, then the anomaly datasets could be built. As shown in Table 2, the anomaly dataset is composed of three parts of data, the performance metrics, the anomalous behavior labels and the miscellaneous features. Schema = {M etrics ∪ AnomalyLabels ∪ M iscF eatures}, where M etrics are composed of the speciﬁc performance metrics such as cpu usage, memory usage. The AnomalyLabels imply the type of a performance anomaly, the value of ‘1’ represents the underlying anomaly happens, and ‘0’ represents no such anomaly happens. The dataset also contain some miscellaneous features such as location where the VNF located, and the timestamp feature of the record. Finally, the data processing and analysis module is responsible for creating models that are trained oﬄine for performance anomaly detection based on the anomaly dataset. Table 2. The schema of the anomaly dataset

Bottlenecks Simulation. In order to better engage in the research of NFV performance anomaly detection, performance anomalies and bottlenecks could be simulated by the perturbation module as implemented in Algorithm1, and the performance related data in the NFVI layer could be collected by the data monitoring module. Both workload and faultload could be generated by the perturbation module.

484

J. Qiu et al.

Algorithm 1. Bottlenecks injection controller Input: vm list, bottleneck type list, injection duration, user count, duration 1: timer = start timer() 2: while timer < duration do 3: sip simulate(user count, duration) 4: bottleneck type = random (bottleneck type list) 5: vm = random (vm list) 6: injection = new injection (bottleneck type) 7: inject(vm, injection duration) 8: sleep(pause) 9: end while

Performance Metric Model. Classiﬁcation-based techniques highly reply on expert’s domain knowledge of the characteristics of performance issues or bottlenecks status. The work in this paper particularly focuses on the identiﬁcation of performance anomalies from monitoring data of VMs OSs of the NFVI such as CPU consumption, disk I/O, and memory consumption. A classic Zabbix4 OS monitoring template5 is adopted as the performance metric model in this paper.

4

Case Study

4.1

Experimental Environment Setup

The testbed is built on one powerful physical server DELL R730 which is equipped with 2x Intel Xeon CPU E5-2630 v4 @ 2.10 GHz, 128 G of RAM and 5 TB Hard Disk. The vIMS under test is the Clearwater project which is an opensource implementation of an IMS for cloud computing platforms. The Clearwater application is installed on the commercialized hypervisor-based virtualization platform (VMware ESXi). 10 components of Clearwater are individually hosted in a docker container on a virtual machine(VM), and the containers are managed by Kubernetes. Particularly there is an attack host for injecting bottlenecks into the Clearwater virtual hosts, a tool for the fault injection runs on the inject host, and the Zabbix agents are installed on the other hosts, ﬁnally the performance data of each virtual host could be collected by the agent when the faultload and workload are injected. An open source tool SIPp6 is used as the workload generator for IMS. Fault injection techniques could be applied to bottlenecks simulation refers to the Algorithm 1 presented in the previous section. 4 5 6

https://www.zabbix.com/. https://github.com/chunchill/nfv-anomaly-detection-ml/blob/master/data/ Features-Description-NFVI.xlsx. http://sipp.sourceforge.net/.

Performance Anomaly Detection Models for NFVI

485

The monitoring agent could collect the performance data from each virtual host for each round, the timestamp would be record in the log ﬁle once there is a bottleneck injection, so that the performance data could be labeled with related injection type according to the injection log. Finally, the performance dataset could be built for data analysis in the next section. 4.2

Experimental Results

There were three kinds of bottlenecks in the data: CPU bottlenecks, memory bottlenecks, I/O bottlenecks, in addition, if there is no bottleneck injection, the data is labeled as ‘normal’, and we extracted a total of 3693 records from the experiment, including 2462 with normal class, 373 with CPU bottlenecks class, 266 with memory bottlenecks class and 592 with I/O bottlenecks class. The schema of a record consists of two identiﬁcation ﬁelds (host, timestamp), 45 monitoring metrics feature ﬁelds, and 4 labels (normal, CP U bottleneck, memory bottleneck, and I/O bottleneck). We used the following machine learning classiﬁers to perform comparative experiments: Neural Networks, Combined Neural Networks with SVM, K-Nearest Neighbors, Linear SVM, Radial Basis Function (RBF) SVM, Decision Tree and Random Forests. Table 3. Accuracy comparison results of machine learning classifiers Models

Training set Testing set

NN

0.94

0.90

NN+SVM

0.93

0.89

KNN

0.92

0.87

Linear SVM

0.80

0.83

RBF SVM

0.80

0.83

Decision Tree

0.77

0.80

Random Forrest 0.90

0.89

As shown in the comparison results in the Table 3, the eﬀect of the neural networks is the best for both in training set and testing set. Table 4 shows the speciﬁc results of the neural networks. As the epoch history trend of neural network learning shown in Fig. 2, we can see that the trend of accuracy and loss on the training set and the validation set is almost the same, indicating that there is no over-ﬁtting situation in the training process. It is proved that the eﬀect of neural networks is ideal and eﬀective to detect the performance anomalies. All of the experiment artifacts are available on this github repository7 , including the fault injection tools, datasets and the python codes. 7

https://github.com/chunchill/nfv-anomaly-detection-ml.

486

J. Qiu et al. Table 4. The results by neural network Accuracy on training set: 0.94 Labels

Precision Recall F1-Score

Normal

0.97

0.95

0.96

cpu

0.90

0.93

0.92

Memory

0.91

0.85

0.88

I/O

0.87

0.95

0.91

avg/total 0.94

0.94

0.94

Accuracy on testing set: 0.90 Labels

Precision Recall F1-Score

Normal

0.96

0.92

0.94

cpu

0.81

0.90

0.86

Memory

0.86

0.78

0.82

I/O

0.75

0.88

0.81

avg/total 0.91

0.90

0.90

Fig. 2. The accuracy and loss trend of Neural Networks for both training set and validation set

5

Conclusion

This paper have proposed a machine learning based performance anomaly detection approach for NFV-oriented cloud system infrastructure. Considering that it is diﬃcult for researchers to obtain comprehensive and accurate abnormal behaviors data in a real NFV production environment, system perturbation technology to simulate faultload and workload is presented, and the monitoring module

Performance Anomaly Detection Models for NFVI

487

is integrated into the anomaly detection framework to monitor and evaluate the platform, it is responsible for constructing anomaly dataset consisting of abnormal labels and multi-dimensional monitoring metrics. Finally, the eﬀective machine learning models are ﬁtted by training the statistical learning model on the anomaly dataset. The experiment results show that machine learning classiﬁers could be eﬀectively applied to solve the performance anomalies problem, and the neural networks model is the best detection model with the precision over 94%. Acknowledgement. This work has been supported by the National Natural Science Foundation of China (Grant No. 61672384), part of the work has also been supported by Huawei Research Center under Grant No. YB2015120069. And we have to acknowledge the OPNFV project, because some of the ideas come from the OPNFV community, we have obtained lots of inspiration and discussion when we involved in the activities on OPNFV projects Yardstick and Bottlenecks.

References 1. ETSI GS NFV-PER 001. https://www.etsi.org/deliver/etsi gs/NFV-PER/. Accessed 1 July 2018 2. Cotroneo, D., De Simone, L., Natella, R.: NFV-bench: a dependability benchmark for network function virtualization systems. IEEE Trans. Netw. Serv. Manag., 934– 948 (2017) 3. Bonafiglia, Roberto, et al.: Assessing the performance of virtualization technologies for NFV: a preliminary benchmarking. In: European Workshop on Software Defined Networks (EWSDN), pp. 67–72. IEEE (2015) 4. Naik, P., Shaw, D.K., Vutukuru, M.: NFVPerf: Online performance monitoring and bottleneck detection for NFV. In: International Conference on Network Function Virtualization and Software Defined Networks (NFV-SDN), pp. 154–160. IEEE (2016) 5. Liu, D., et al.: Opprentice: towards practical and automatic anomaly detection through machine learning. In: Proceedings of the Internet Measurement Conference, pp. 211–224. ACM (2015) 6. Li, K.-L., Huang, H.-K., Tian, S.-F., Wei, X.: Improving one-class SVM for anomaly detection. In: IEEE International Conference on Machine Learning and Cybernetics, vol. 5, pp. 3077–3081 (2003) 7. Shanbhag, S., Gu, Y., Wolf, T.: A taxonomy and comparative evaluation of algorithms for parallel anomaly detection. In: ICCCN, pp. 1–8 (2010) 8. Yairi, T., Kawahara, Y., Fujimaki, R., Sato, Y., Machida, K.: Telemetry-mining: a machine learning approach to anomaly detection and fault diagnosis for space systems. In: Second International Conference on Space Mission Challenges for Information Technology(SMC-IT), p. 8. IEEE (2006) 9. Kourtis, M.A., Xilouris, G., Gardikis, G., Koutras, I.: Statistical-based anomaly detection for NFV services. In: International Conference on Network Function Virtualization and Software Defined Networks (NFV-SDN), pp. 161–166. IEEE (2016) 10. Cotroneo, D., Natella, R., Rosiello, S.: A fault correlation approach to detect performance anomalies in virtual network function chains. In: IEEE 28th International Symposium on Software Reliability Engineering (ISSRE), pp. 90–100 (2017)

488

J. Qiu et al.

11. Wang, C., Talwar, V., Schwan, K., Ranganathan, P.: Online detection of utility cloud anomalies using metric distributions. In: Network Operations and Management Symposium (NOMS), pp. 96–103. IEEE (2010) 12. Fu, S.: Performance metric selection for autonomic anomaly detection on cloud computing systems. In: Global Telecommunications Conference (GLOBECOM), pp. 1–5. IEEE (2011) 13. Du, Q., et al.: High availability verification framework for OpenStack based on fault injection. In: International Conference on Reliability, Maintainability and Safety (ICRMS), pp. 1–7. IEEE (2016) 14. Du, Q., et al.: Test case design method targeting environmental fault tolerance for high availability clusters. In: International Conference on Reliability, Maintainability and Safety (ICRMS), pp. 1–7. IEEE (2016)

Emergence of Sensory Representations Using Prediction in Partially Observable Environments Thibaut Kulak and Michael Garcia Ortiz(B) SoftBank Robotics Europe - AI Lab, Paris, France [emailprotected], [emailprotected]

Abstract. In order to explore and act autonomously in an environment, an agent can learn from the sensorimotor information that is captured while acting. By extracting the regularities in this sensorimotor stream, it can build a model of the world, which in turn can be used as a basis for action and exploration. It requires the acquisition of compact representations from possibly high dimensional raw observations. In this paper, we propose a model which integrates sensorimotor information over time, and project it in a sensory representation. It is trained by preforming sensorimotor prediction. We emphasize on a simple example the role of motor and memory for learning sensory representations.

1

Introduction

Autonomous Learning for Robotics aims to endow agents with the capability to learn from and act in their environment, so that they can adapt to previously unseen situations. An agent can learn from this interaction by building compact representations of what it encounters in its environment, using information captured from a high dimensional raw sensory input and motor output. Theories on sensorimotor prediction state that an agent learns the structure of its world by learning how to predict the consequences of its actions [2,12]. The sensorimotor approach proposes to learn sensor representations and motor representations by identifying the regularities in the sensorimotor stream. However, these regularities are hard to capture: a robotic agent acts and perceives in an environment which is usually partially observable (limited ﬁeld of view), noisy and ambiguous. The sensory information is not suﬃcient to know the exact state of the agent in its environment (similar sensory states can originate from diﬀerent situations in the environment). This is in particular true for navigation tasks where an agent can observe several occurrences of very similar portions of the scenes (wall, corners) at diﬀerent locations in the environment (e.g. in a maze). For these reasons, we need representations that can help disambiguate the observations and the state of the agent. In the case of an autonomous agent, without labeled data, unsupervised learning allows to learn compression for diﬀerent data streams [6,13,16]. These representations, based on the statistics of the data, reduce the dimensionality c Springer Nature Switzerland AG 2018 V. K˚ urkov´ a et al. (Eds.): ICANN 2018, LNCS 11140, pp. 489–498, 2018. https://doi.org/10.1007/978-3-030-01421-6_47

490

T. Kulak and M. G. Ortiz

of the sensory stream, but do not inform the agent on the modalities of its potential actions in its environment, which is related to the problem of grounding knowledge in the experience of an agent [5]. In order to build representations, a classic approach is to learn forward internal models [3]: learning to predict the sensory consequences of actions. For instance a forward model of physics is learned for a real-world robotic platform in [1]. Recently, [4] proposed to build world models through learning forward models, and use them to train policies in diﬀerent Reinforcement Learning environments. The authors of [9] present a complete overview of the current methods for learning representations in robotics. In this paper, we propose to learn sensory representations using principles from sensorimotor prediction (or, forward models) and to study the properties of the learned representations. We show, on a navigation scenario, that using motor information as well as a short-term memory leads to sensory representations that correspond to richer classes of sensory stimuli encountered in the environment. Recent work also propose to learn sensory representation by sensorimotor prediction [4,17], and show that the representations learned could be successfully used for navigation or control tasks. In this paper we are interested in studying the nature of the representations that are learned.

2

Sensorimotor Predictive Model

We train a forward model, named Recurrent Sensorimotor Encoder (RecurrentSM-encoder), shown in Fig. 1, and composed of three subnetworks: (i) A sensory encoding subnetwork takes as input the sensory state st and outputs an encoded sensory state zts . It is composed of hidden layers followed by a stacked Long short-term memory (LSTM) network, which role is to provide a form of memory about the previous sensor states. (ii) A motor encoding subnetwork, which is a classical dense network composed of hidden layers, taking as input the motor command mt and outputting the encoded motor command ztm . (iii) zts and ztm are concatenated to form the encoded sensorimotor vector ztsm , used as an input for a dense network, which outputs a prediction of the next sensory state sˆt+1 . We use several baselines (see Fig. 2) to evaluate the role of motor information and memory: the Sensorimotor Encoder (SM-encoder), doesn’t have a memory, the Recurrent Sensory Encoder (Recurrent-S-encoder) doesn’t have motor input, and the Sensory Encoder (S-encoder) doesn’t have memory or motors. We train the proposed networks using a loss to minimize the prediction error: L2 =

T −1

2

(ˆ st+1 − st+1 )

t=1

where T is the size of the learning batch.

(1)

Emergence of Sensory Representations Using Prediction

491

Fig. 1. Recurrent Sensorimotor Encoder

Sensory Encoder

Recurrent Sensory Encoder

Sensorimotor Encoder

Fig. 2. Architectures of the baselines

3

Experimental Setup

Our simulated agent (inspired from the Thymio-II robot [14]) is equipped with 5 distance sensors, evenly separated between −0.6 and 0.6 rad, with their range limited to 10 units of distance. The agent controls its translation forward (direction of the middle laser) and its rotation. One motor command (d, r) is the succession of a translation d and a rotation r. It is a planar agent moving without friction, and there is no noise on its distance sensors. We created 3 environments of size 50 units, shown on Fig. 3: Square is a square without walls or obstacles. Room1 additionally contains one vertical wall and one horizontal wall. Room2 contains one horizontal and three vertical walls. The agent moves by random translations forward and random rotations, while avoiding collisions with the walls. At each timestep, if one distance sensor value is π π , π+ 10 ) radians (U denotsmaller than 1 unit, the agent rotates by r ∼ U(π− 10 ing the uniform distribution). If not, the agent moves forward by d ∼ U(0, 1) units, and rotates by r ∼ U(− π6 , π6 ) radians. Figure 4 displays the trajectory of the agent during 10 000 steps in the diﬀerent proposed environments. We generated a sequence of 1 000 000 timesteps for each environment (each point has 5 distance sensors values and 2 motor commands), split as such: the ﬁrst 80% for training, the next 10% for validation, and the last 10% for testing.

492

T. Kulak and M. G. Ortiz

(a) Square

(b) Room1

(b) Room2

Fig. 3. The diﬀerent environments created.

(a) Square

(b) Room1

(c) Room2

Fig. 4. Trajectories in the environments (10 000 points)

In Fig. 5 we reconstructed, for diﬀerent situations, what the agent perceives based on its sensors. Note that the agent doesn’t have access to the position and angles of its distance sensors, it only receives as input a 5-dimensional real vector.

(a) Perceiving nothing

(b) In front of a line

(c) In front of a corner

(d) With a wall at its left

(e) In front of a wall’s end

Fig. 5. Examples of diﬀerent sensory stimuli perceived by the agent. The 5 red dots represent the distance perceived by the agent, projected in top-view. (Color ﬁgure online)

4 4.1

Results Numerics

Our models are trained with the Adam optimizer [8] (learning rate of 0.001). The training is stopped if the loss on the validation set doesn’t decrease by 5% for 10 consecutive epochs. We use a mini-batch size of 64, and ReLUs for the activation functions. We choose arbitrarily the sensory representation space to be 10-dimensional and the motor representation space to be 5-dimensional. The number and size of layers in the diﬀerent architectures are as follow: In

Emergence of Sensory Representations Using Prediction

493

SM-encoder, the sensory encoding and motor encoding subnetworks have 3 hidden layers of size 16, 32 and 64, while the prediction subnetwork has one layer of size 128. S-encoder is identical to the SM-encoder, without the motor encoding subnetwork. In Recurrent-SM-encoder, the sensory encoding and motor encoding subnetworks have 1 hidden layer of size 16, while the prediction subnetwork has one layer of size 128. The (stacked) LSTM has 3 layers with 32 units at each layer, and a truncation horizon of 20. Recurrent-S-encoder is identical to Recurrent-SM-encoder, without the motor encoding. 4.2

Sensorimotor Prediction Results

We report in Table 1 the L2 prediction error of the models trained on the Square environment, and tested on the three environments. First we verify that models using motor information largely outperform those without, which makes sense because motors are necessary to predict the next sensory state. We also see that models using a memory perform better compared to their memoryless counterpart, conﬁrming that a memory is useful for accurate sensorimotor prediction. Finally, we note that the Recurrent-SM-encoder model performs best. It is to be expected, as it beneﬁts from additional information. We veriﬁed that these observations hold when trained on Room1 and Room2. Table 1. Sensorimotor prediction L2 error of the models trained on Square tested on the test dataset of the three environments. Model

Square Room1 Room2

S-encoder

0.0374

0.0430

0.0729

SM-encoder

0.0056

0.0145

0.0257

Recurrent-S-encoder

0.0359

0.0407

0.0697

Recurrent-SM-encoder 0.0024 0.0105 0.0181

4.3

Representation Spaces

We plot on Fig. 6 the representation spaces learned by our models, projected on the ﬁrst two principal components extracted with a Principal Component Analysis (PCA) [7]. We color-code those spaces by the minimum value of the 5 lasers, as this gives information about the distance to the wall the agent perceives. We observe that the models without motors group states where the agent doesn’t see anything with states where the agent sees a wall from a very short distance, because its behavior (avoiding collision, see Sect. 3) makes it experience sensory transitions from seeing a wall very close to seeing nothing. Without access to motor commands, the model brings those states close to each other, while in reality those states are fundamentally diﬀerent. We see that the portion of the representation space corresponding to the agent perceiving nothing is larger with the Recurrent-SM-encoder than with the Recurrent-S-encoder.

494

T. Kulak and M. G. Ortiz

(a) Sensor

(b) S-encoder

(c) RecurrentS-encoder

(d) SM-encoder

(e) RecurrentSM-encoder

Fig. 6. Representation spaces learned on the Square environment, colored by the minimum value of the lasers. (Color ﬁgure online)

We can interpret it as memory and the information about motor commands helping to create diﬀerent states for points where the agent doesn’t see anything. 4.4

Clusters Extraction

We cluster the sensory representation spaces learned for each model, and visualize the activation of the diﬀerent clusters in the environments, in order to estimate if the sensory encoding learns spatial features. We sample random sensorimotor transitions and use a kMeans algorithm [10] to extract 20 clusters from each sensory representation space. We plot for each cluster the ground truth position and orientation of 500 random data points associated with this cluster. We show on Fig. 7, as a baseline, the 20 clusters extracted from the S-encoder representation space. We see that there are clusters corresponding to diﬀerent distances/angles to the wall. As there is no memory in this model all of the conﬁgurations when the agent doesn’t perceive anything are in the same cluster. We see on Fig. 8 that the Recurrent-SM-encoder representation space trained on the Square environment contains clusters corresponding to diﬀerent distances to a wall, and also a cluster corresponding to corners. We observe that we have diﬀerent clusters corresponding to an absence of visual stimuli, but at diﬀerent distances from a wall (when the wall is behind the agent). LSTM provides the agent with a memory of previous events, and it contains a form of spatial information. However this memory is short-term as it is relative to the previous wall that has been seen, and there is no global notion of position in the environment. We show on Fig. 9 the clusters extracted from the Recurrent-SM-encoder model trained on the Room1 environment. We observe that in addition to clusters similar to those appearing in Square environment, there is now a cluster corresponding to wall’s ends. We note, however, that when training on Room2, the cluster corresponding to wall’s ends is not visible with 20 clusters extracted. We hypothesize that the layout causes the agent to be stuck in the diﬀerent rooms, reducing the number of appearance of wall’s ends in the database. 4.5

Robustness to Testing Environment

In this experiment, we evaluate if the representations learned in one environment transfer to other environments. We train the Recurrent-SM-encoder as well as

Emergence of Sensory Representations Using Prediction

495

Fig. 7. S-encoder representation space clusters

Fig. 8. Recurrent-SM-encoder representation space clusters

Fig. 9. Recurrent-SM-encoder representation space clusters, trained on Room1

496

T. Kulak and M. G. Ortiz

(a) Clusters in Square

(b) Transferred to Room1

(c) Transferred to Room2

Fig. 10. Transferring some Square clusters

(a) Clusters in Room1

(b) Transferred to Square

(c) Transferred to Room2

Fig. 11. Transferring some Room1 clusters

Emergence of Sensory Representations Using Prediction

497

our clustering algorithm on one environment, then apply the learned representations and clusters in other environments. We show the transfer of some clusters of interest learned on Square on Fig. 10. We show on Fig. 11 the transfer of a few clusters of interest learned on Room1 to other environments. We observe that the representations learned in one environment can be used in other environments, with diﬀerent spatial layouts. This is to be expected as the LSTM only captures and retains short-term information, which represents sensorimotor transitions, but do not represent diﬀerent spatial layouts of the environments.

5

Conclusion

In this paper we proposed to use an unsupervised learning method based on sensorimotor prediction that allows an agent to acquire sensory representations by integrating sensorimotor information using recurrent neural networks. We observed that our model extracts classes of interaction with the environment that seem qualitatively meaningful, and which contain temporal information through short-term memory of previous experiences. In particular we veriﬁed that the motor commands and memory are very beneﬁcial to learn sensory representations through prediction. We note that the clusters of the sensory representation are similar to particular cells observed in mammals, such as distance, orientation, and border cells [11]. We noticed that the representation learned on an environment can be used in other environments with diﬀerent spatial layouts. We used a generic approach, inspired from recent proposals about the nature and emergence of autonomy and intelligence through sensorimotor prediction [2]. It uses only raw data, and requires (in our simple experiment) very few engineering biases. In future works we want to investigate whether it scales to more complex environments and sensory streams, and if it can be applied on a robotic platforms in a real human environment. One interesting possible extension would be to use the representations to learn a map of the environment. We plan to investigate how to build a graph where the nodes would correspond to particular activations of the representation, and the edges would correspond to motor commands necessary to transition from one representation to the other. We want to study the compression of this graph to obtain compact spatial representations, as proposed in [15,17]. In general, the proposed approach deals with very low level processing of sensorimotor streams in order to build meaningful representations. The usefulness of these representations, and how they can integrate in a cognitive architecture, would have to be demonstrated. We plan to use the learned representations in a Reinforcement Learning task. On the one hand, the success rate at the task gives a clear quantitative evaluation. On the other hand, it will allow us to evaluate the beneﬁts of learning representations in terms of generalization, abstraction, and transfer of knowledge across diﬀerent environments.

498

T. Kulak and M. G. Ortiz

References 1. Agrawal, P., Nair, A.V., Abbeel, P., Malik, J., Levine, S.: Learning to poke by poking: experiential learning of intuitive physics. In: Advances in Neural Information Processing Systems, pp. 5074–5082 (2016) 2. Friston, K.: The free-energy principle: a uniﬁed brain theory? Nat. Rev. Neurosci. 11(2), 127–138 (2010) 3. Ghahramani, Z., Wolpert, D.M., Jordan, M.I.: An internal model for sensorimotor integration. Science 269, 1880–1882 (1995). http://citeseerx.ist.psu.edu/viewdoc/ summary?doi=10.1.1.57.74 4. Ha, D., Schmidhuber, J.: World models (2018). https://worldmodels.github.io 5. Harnad, S.: The symbol grounding problem (1990). http://cogprints.org/3106/ 6. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006) 7. Hotelling, H.: Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 24(6), 417 (1933) 8. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 9. Lesort, T., D´ıaz-Rodr´ıguez, N., Goudou, J.F., Filliat, D.: State representation learning for control: an overview. ArXiv e-prints, February 2018 10. Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982) 11. Moser, M.B., Rowland, D.C., Moser, E.I.: Place cells, grid cells, and memory. Cold Spring Harb. Perspect. Biol. 7(2), a021808 (2015) 12. O’Regan, J.K., No¨e, A.: A sensorimotor account of vision and visual consciousness. Behav. Brain Sci. 24(5), 939–973 (2001) 13. Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015) 14. Riedo, F., R´etornaz, P., Bergeron, L., Nyﬀeler, N., Mondada, F.: A two years informal learning experience using the Thymio robot. Adv. Auton. Mini Robot. 101, 37–48 (2012) 15. Stachenfeld, K.L., Botvinick, M.M., Gershman, S.J.: The hippocampus as a predictive map. Nat. Neurosci. 20(11), 1643–1653 (2017). https://doi.org/10.1038/ nn.4650 16. Tenenbaum, J.B., De Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Science 290(5500), 2319–2323 (2000) 17. Wayne, G., et al.: Unsupervised predictive memory in a goal-directed agent. CoRR abs/1803.10760 (2018). http://arxiv.org/abs/1803.10760

Signal Detection

Change Detection in Individual Users’ Behavior Parisa Rastin1,2(&), Guénaël Cabanes1,2, Basarab Matei1,2, and Jean-Marc Marty1,2 1

LIPN-CNRS, UMR 7030, Université Paris 13, 99 Avenue J-B. Clément, 93430 Villetaneuse, France {rastin,cabanes,matei}@lipn.univ-paris13.fr 2 Mindlytix, 10 Rue Pergolèse, 75116 Paris, France

Abstract. The analysis of a dynamic data is challenging. Indeed, the structure of such data changes over time, potentially in a very fast speed. In addition, the objects in such data-sets are often complex. In this paper, our practical motivation is to perform users proﬁling, i.e. to follow users’ geographic location and navigation logs to detect changes in their habits and interests. We propose a new framework in which we ﬁrst create, for each user, a signal of the evolution in the distribution of their interest and another signal based on the distribution of physical locations recorded during their navigation. Then, we detect automatically the changes in interest or locations thanks a new jump-detection algorithm. We compared the proposed approach with a set of existing signal-based algorithms on a set of artiﬁcial data-sets and we showed that our approach is faster and produce less errors for this kind of task. We then applied the proposed framework on a real data-set and we detected different categories of behavior among the users, from users with very stable interest and locations to users with clear changes in their behaviors, either in interest, location or both. Keywords: Time series Users proﬁling

Change detection Signal-based approaches

1 Introduction With the current progress of technology, the amount of recorded data is perpetually increasing and the need of fast and efﬁcient analysis algorithms is more important than ever. One of the major challenge in data mining is the detection of change in dynamic data-sets. Indeed, as new data are constantly recorded, the structure of the data-set can vary over time. This phenomenon is known as “concept drift” [9, 18]. One direct application, which is our practical interest in this paper, is the detection of change in users’ behavior and interest based on data recorded during their online navigation. This task is known as “user proﬁling” and has a high economic importance for companies in the ﬁeld of online advertising. Proﬁling tasks aim at recognizing the “mindset” of users through their navigation on various websites or their interaction with digital “touch points” (varying ways that a brand interacts and displays information to prospective customers and current customers). It intervenes in the international market for © Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11140, pp. 501–510, 2018. https://doi.org/10.1007/978-3-030-01421-6_48

502

P. Rastin et al.

“programmatic advertising” tasks, by assigning a proﬁle to users connecting to a site that can offer advertising, so that the displayed advertising corresponds best to the needs of the users. Indeed, being able to detect when a user changes his interest or when he moves to another city or country is very important to adjust the advertising strategy regarding this user. These proﬁles are computed from a very large database of internet browsing which lists URL sequences or touch points visited by a large number of people. Each URL of a “touch point” is characterized by contextual and semantic information. In this context, each user is described as a time series of URL categories and physical locations. The URL categories are computed using a clustering approach adapted to complex data [16, 17]. The locations are recorded using geolocation information collected during the user’s navigation but are restricted to a series of postal codes. The detection of changes in time series involves the extraction of “stable” periods, separated by usually short period of variation. There are therefore two main strategies: either the algorithm focuses on detecting the different period of stability in the time series, or it focuses on detecting the period of variation [1, 3, 4, 6, 10, 11]. Detecting stability or hom*ogeneity is related to the task of data stream clustering. The detection of variation in the series can be related to signal analysis approaches. In both case, the time series must be segmented into several time windows that will be compared to ﬁnd either similarities or variations [9, 18]. In this paper we consider a sliding time windows with a step of one day, in order to obtain for each window a distribution of location or interest. Most clustering approaches are not adapted to distributional data and cannot be applied without costly adaptations. However, it is not difﬁcult to compute the pairwise dissimilarities between adjacent windows, using an adapted metrics, in order to produce a signal representing the variations in each user behavior. The main challenge in this case is to discriminate meaningful variation in the signal to random noise. The usual approach is to apply a smoothing function to the signal in order to retain only the signiﬁcant variations [2, 5, 7, 12]. The main advantages of such algorithms are the computation speed and the absence of user-deﬁned parameter, which are usually difﬁcult to tweak. We propose in this paper a new signal-based approach, described in Sect. 2, adapted to the proﬁling task. This algorithm is based on a multi-scale smoothing of the computed signal, allowing a better elimination of non-signiﬁcant variations in the signal. We then tested the algorithm on simulated data to validate its quality in comparison to traditional approaches; results are presented in Sect. 3. Finally, we applied the proposed framework on a real industrial data-set, as shown in Sect. 4. A conclusion is given in Sect. 5.

2 General Framework The detection of change in behavior of users is a very interesting information which can help marketing companies to send and sell the right product to users based on their needs. In this work we used the users’ geographic location to detect changes in their geographic habits, and the users’ navigation logs to detect variation in their interests. We ﬁrst created a signal for each user based on distributions representing his behavior.

Change Detection in Individual Users’ Behavior

503

Similarities between distributions are computed by the Jenson-Shannon metric. Then, by using a change detection algorithm, we detected the dates where there was a change in the user’s behavior. 2.1

Signal Computation

In order to detect changes in the users’ behavior, we applied a change detection algorithm described below. This algorithm detects unusual “jumps” in a signal characterizing behavioral variations. To construct such signal, were a change in behavior is characterized by a jump, we deﬁned the distribution of labels or postal codes in the ﬁrst time-windows as the reference behavior. Then, the window is shifted one day at a time, in order to produce a series of distribution. For example, if in a time window a user has been detected in France 7 times in Strasbourg (Postal code 67000) and 3 times in Nancy (postal code 54000), the distribution for this user and this time window will be {67000: 70%, 54000: 30%}. The signal is created from the dissimilarities between the distributions in the sliding time window and the distribution of reference. The signal thus obtained represents the evolution of the differences with respect to the reference window and makes it possible to detect signiﬁcant changes in distributions: a move or a change of interest. The similarity between two probability distributions (reference window and shifted windows) is computed by a metric called Jensen-Shannon divergence [8, 15]. It is based on the Kullback-Leibler divergence, with some notable (and useful) differences, including that it is symmetric and it is always a ﬁnite value. The Jensen-Shannon divergence (JS) is a symmetrized and smoothed version of the Kullback-Leibler divergence D(P k Q) between two discrete distributions. It is deﬁned by 1 1 JSðPjjQÞ ¼ DðP jj M Þ þ DðQ jj M Þ 2 2 Where M ¼ 12 ðP þ QÞ. For discrete probability distributions P and Q, the Kullback-Leibler divergence from P to Q is deﬁned [14] to be DKL ðP jj QÞ ¼

X

PðiÞ log

QðiÞ X PðiÞ ¼ : PðiÞ log PðiÞ Q ðiÞ i

Note that any zero probabilities in P or Q are ignored in the computation, meaning that two totally different distributions will have a JS value of 1. The proposed approach has been tested on the artiﬁcial data-set for validation, then applied on the real data-sets to analyze the changes in users’ behaviors. 2.2

Proposed Multi-scale Change Detection Algorithm

Algorithm 1 describe the multi-scale change detection approach. The idea is: an iterative smoothing process eliminates random fluctuations in the signal, then unusually high variations are detected. The signals are piece-wise continuous functions having

504

P. Rastin et al.

discontinuities at some locations xi, i.e., v(x+i ) 6 = v(x−i ). For that type of functions, there exist many approaches to locate the singularities. These can either be signal based (i.e., one detects large amplitude variations using an appropriate threshold) or multiscale coefﬁcients based. Again, we consider that vjk are the averages of some function v discretized on the intervals Ij,k= 2−j[k, k + 1[. In multi-scale coefﬁcients-based approach, a strategy to detect the singularities at level j is based on a criterion that uses the ﬁrst or the second order differences of vj. In these approaches, the jump singularities detection is carried out at each level independently. Algorithm 1. Changes Detection in Behavior Signal

1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

Input: Signal vector v of length N . Output: List of detected changes. Initialize j = [log2 (N )] Initialize global list of jumps Lg = ∅ while j > [log2 (N )] − 4 do Smoothing for i ← 1, length(v j ) do vkj−1 =

j

j

v2k−1 +v2k 2

Initialize local list of jumps Le = ∅ Compute cost function dv based on ﬁrst order of ﬁnite diﬀerences: for k ← 1, length(v j−1 ) do j−1 | dvkj−1 = |Δvkj−1 | + |Δvk+1

11: Compute local maxima of cost function dv: 12: for k ← 1, length(vkj−1 ) do j−1 j−1 j−1 j−1 , dvk−1 , dvk+1 , dvk+2 ) then 13: if dvkj−1 > max(dvk−2 14: Lj ← Lj + {k} 15: j ←j−1 [log2 (N )] 16: Define Lg as the intersection of all Lj : Lg = ∩j=[log (N )]−4 Lj 2

However, we propose a strategy to locate the jump singularities at a given level j by taking into account the detection at other levels. We detect intervals Ij,k potentially containing a jump singularity as those containing the local maxima of dvjk= |Δvjk| + |Δvjk+1| where Δ is the ﬁrst order ﬁnite difference operator Δvjk= vjk−1 − vjk, and where vj is obtained by successive averaging of vJ, J is considered at the ﬁnest level of discretization. We then compute the number Nj of singularities at level j, and we deﬁne jmax as the largest level j such that Nj−1 = Nj. We also deﬁne the level jmin as the smallest j such that Nj= Njmax. A singularity detected in Ij,k for jmin < j < J is called admissible if there exists a singularity inside Ij+1,2k or Ij+1,2k+1. This deﬁnition implies that admissible singularities make up chains when j varies. In that context, the singularities at level j must be separated by more than 2 intervals at the ﬁnest level.

Change Detection in Individual Users’ Behavior

505

3 Experimental Validation In this section we will present the experimental protocol we used to validate the proposed algorithm. To validate the quality of our algorithm in a controlled environment, we tested the proposed algorithm on artiﬁcial data-sets and compared the results with the quality of the state-of-the-art algorithms. The computations are tested on a Windows 10 (x64) machine, 16G RAM, with a dual-core CPU clocked at 2.50 Ghz (i52450 M). 3.1

Artiﬁcial Data-Sets

To generate the artiﬁcial data-set we considered three categories of behavior: The user’s behavior changes over time into a totally new behavior, the user’s behavior changes over time into a partially different behavior, and the user do not change its behavior. We generated 10000 signals for each of these categories of users. To construct a signal, we ﬁrst generated two sets of 1 to 5 random labels each, representing the possible behaviors before and after the change. Only one set is created to simulate the absence of change and to simulate partial change we forced the two set to share 1 or 2 labels. We simulated a period of two months. A hundred random time-stamps were generated over this period. Each time-stamp were associated to a label from the ﬁrst or the second set, depending on a randomly chosen date of change.

Fig. 1. Example of simulated signals of user’s behaviors. The arrows indicate the detected changes (if any).

Figure 1a is an example of simulated signal for a user who expressed a full change of behavior. The horizontal axis is the time-stamp (days) and the vertical axis is the JS dissimilarity for the reference window. As you see, the JS increases from the 22th to the 29th day. Then the signal keeps a value of 1 from the 30th day onward, as there is no intersection between the reference distribution and the distributions from the 30th day. In the case of a partial change, the user express new behaviors in addition to some of its previous. For example, a user who relocate into a new house but keep the same work in its previous location. Figure 1b shows such case. This time, the signal never reaches 1 as there are still some similarities before and after the changes. The change is nonetheless correctly detected by the algorithm. The last case is when there is no detection of change for a user. In this case there is no signiﬁcant different between the

506

P. Rastin et al.

reference window and the shifted windows and the signal stays steady over the time. The example showed in Fig. 1c demonstrates that the similarity computed by the Jensen-Shannon divergence is low. The signal created is stable all over the time with no notable change. This case can describe users who keep a regular activity without any notable variations. 3.2

Experimental Results

To demonstrate the effectiveness of the proposed approach, we evaluated its performances in terms of computation time. In addition, to validate the quality of detected changes, we computed the means of absolute differences between the detected and the predicted date of change and we compared it to a set of state-of-the-art algorithms: Jump penalty, PWC bilateral, Robust jump penalty, Soft mean-shift, Total variation, Robust TVD and Medﬁltit (see [13]). Table 1 presents the results of comparison between the proposed algorithm and the 7 state-of-the-art algorithms. This table has 6 columns which describe 3 different categories: the ﬁrst, third and ﬁfth columns describe respectively the time computed for detection of no change, detection of full change and the partial change in user’s behavior. As you see, the proposed algorithm has the minimum value in comparison to the other algorithms: it is (usually by far) the fastest approach for this task. Moreover, the error of the proposed method is the lowest among all. This is especially true for stable signals, when there is no change to detect. In that case, the multi-scale approach performs very well at smoothing the whole signal and removing all random variation. Overall the quality of the proposed approach is very satisfying and it should be able to deal with real data in more complex applications. Table 1. Experimental results Algorithm

No change Full change Partial change time (s) error time (s) error time (s) error Proposed 0.85 2.78 0.94 1.67 0.87 2.07 Jump penalty 29.4 14.33 25.26 4.11 31.93 4.47 PWC bilateral 83.87 14.31 12.83 3.77 18.71 4.19 Robust jump penalty 8.43 14.26 93.48 4.02 90.63 4.57 Soft mean-shift 8.57 16.57 21.99 3.6 21.5 4.56 Total variation 45.55 13.12 103.44 3.27 116.8 4.02 Robust TVD 7016.69 14.82 4405.04 3.8 4390.13 4.61 Medﬁlt.it 1.32 13.92 0.98 2.95 1.29 3.93

4 Application To follow the real changes in individual interest based on the data provided by our project partner, Mindlytix, we used a data-set of the navigation log of 142794 users giving for each user a list of time-stamps associated to the URL visited at this time, over a period of 30 days. Based on the result of the URLs clustering presented in the previous section (using the contextual similarity), each URL were substituted by a

Change Detection in Individual Users’ Behavior

507

cluster label. This step allows a user navigating between different URLs from the same topics to be considered having stable interest. The time windows for this data-set is ﬁxed to 5 days, meaning that the distribution of URLs’ labels visited during a 5 days period deﬁnes a user behavior. Finally, to follow the change in users’ physical location habits, Mindlytix provided a dataset of geolocations (postal codes) associated to timestamps over a period of 74 days for 598 users. The objective for these data is to be able to detect when a user relocates to a different location or spend some time outside its usual area. Here, we chose a size of 10 days for the time windows to avoid detecting very short trips and unusual displacements. 4.1

Geolocation

In this section we will describe the results obtained on the geolocation data-sets provided by Mindlytix. We analyzed the change behavior for 598 users during 74 days. In the signal creation step we used a window with a size of 10 days. We observe some variety in the signals, but there are still some characteristic patterns. Figures 2a and b illustrates some examples of signals characteristic of a clear relocation. In Fig. 2a the Jensen-Shannon dissimilarity increases sharply for two days, stays stable for three days, then again rises suddenly. Two changes are detected, the ﬁrst being a partial change. This kind of signal can be interpreted as a move in two steps, with a period where the user spend time in both locations before moving deﬁnitively. Figure 2b is another example for relocation of a user and it is a good example of simple change in the user’s location. However, this time we observe a small period where the user spends some time in its previous location.

Fig. 2. Example of obtained signals during a user’s relocation or a temporary displacements (trip). The arrows indicate the detected changes.

508

P. Rastin et al.

Users from another category do not move at all, neither to relocate nor to go to trip or vacations during the recorded period. The dissimilarity between the reference window and the shifted windows stays low all the time. Another interesting case is when the user leaves for a vacation or work for some time, before returning to the place he/she used to live. Figures 2c and d shows two examples for this case. As you see in Fig. 2c, around the 10th day the user starts to move. The dissimilarity between the reference window and the shifted window rises sharply until the 15th day. Then, this dissimilarity decreases rapidly to reach the same distribution as the reference window. It means that this user spent 10 days (the size of the time windows) in another place before coming back. Another example is presented in Fig. 2d, which shows a clear example of a user leaving for a 3-week travel and return to his/her initial place. In both examples, the two changes are correctly detected. 4.2

Individual Interest

To follow the change of users’ interest we used users’ navigation log information. We have the URLs visited during 30 days for 142794 users. Each URL have been associated to a cluster in the previous section, and the user’s navigation can be expressed into a distribution of visited clusters varying over time.

Fig. 3. Examples of stable interest, change in interest and temporary change in interest of users, based on their navigation logs.

Change Detection in Individual Users’ Behavior

509

Figures 3a to c illustrate the behavior of users who do not change their interest during one month. All three ﬁgures the signal is either stable or with only minor variations (undetected by the algorithm). Figure 3d to f is an example of results for the detection of change in individual interest, where the users change their interest over time. As can be seen, in Figs. 3d the signal of this user remains stable for 14 days, then starts to rise sharply as the user start to navigate in other categories of URLs. In Figs. 3e and f, the change is more gradual before reaching a state of interest fully different from the window of reference. These three ﬁgures are typical examples of the different pattern of change in a user’s interest. A third category of observed behavior is a group of users who change their interest for a limited period and then return to their initial interest. Figures 3g to i illustrate this type of users. As you see, these signals go up and stay stable over a period of time and then go down. It means that the dissimilarity between the reference window and the shifted windows increase for a period of time, but at the end of the recorded period the distribution of visited categories of URL returns to a distribution similar to the distribution of reference. Figure 3i shows a particular example of temporary change, were the user return to its initial interests in several steps.

5 Conclusion In this paper, we proposed a new multi-scale algorithm of change detection to analyze the change in individual behavior of users based on their navigation and geolocation data. We ﬁrst created, for each user, a signal of the evolution in the distribution of online user’s interest and another signal based on the distribution of physical locations recorded during their navigation. Then, by using the signal-based jump detection algorithm, changes in interest or locations were detected automatically. We detected different scenarios: during the analyzed period, some users kept the same behavior, some had a clear change in their behaviors and some showed a change in their behavior which lasted only a short period of time. Experimental tests performed on simulated signals showed that the proposed approach is faster and makes less errors for this task than state of-the-art algorithms.

References 1. Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams. In: Proceedings of the 29th International Conference on Very Large Data Bases, VLDB 2003, VLDB Endowment, vol. 29, pp. 81–92. (2003) 2. Arandiga, F., Cohen, A., Donat, R., Dyn, N., Matei, B.: Approximation of piecewise smooth functions and images by edge-adapted (ENO-EA) nonlinear multiresolution techniques. Appl. Comput. Harmonic Anal. 24(2), 225–250 (2008). Special Issue on Mathematical Imaging – Part II 3. Bifet, A.: Adaptive stream mining: pattern learning and mining from evolving data streams. In: Proceedings of the 2010 Conference on Adaptive Stream Mining: Pattern Learning and Mining from Evolving Data Streams, pp. 1–212. IOS Press, Amsterdam (2010)

510

P. Rastin et al.

4. Cao, F., Estert, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise, pp. 328–339. Society for Industrial and Applied Mathematics (2006) 5. Chan, T.F., Zhou, H.M.: ENO-wavelet transforms for piecewise smooth functions. SIAM J. Numer. Anal. 40(4), 1369–1404 (2002) 6. Chen, Y., Tu, L.: Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2007, pp. 133–142. ACM, New York (2007) 7. Claypoole, R.L., Davis, G.M., Sweldens, W., Baraniuk, R.G.: Nonlinear wavelet transforms for image coding via lifting. IEEE Trans. Image Process. 12(12), 1449–1459 (2003) 8. Dagan, I., Lee, L., Pereira, F.: Similarity-based methods for word sense disambiguation. In: Proceedings of the Eighth Conference on European Chapter of the Association for Computational Linguistics, EACL 1997, pp. 56–63 (1997). Association for Computational Linguistics, Stroudsburg 9. Gama, J.: Knowledge Discovery from Data Streams, 1st edn. Chapman & Hall/CRC, Boca Raton (2010) 10. Han, J.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco (2005) 11. Last, M.: Online classiﬁcation of nonstationary data streams. Intell. Data Anal. 6(2), 129– 147 (2002) 12. Lipman, Y., Levin, D.: Approximating piecewise-smooth functions. IMA J. Numer. Anal. 30(4), 1159–1183 (2009) 13. Little, M.A., Jones, N.S.: Generalized methods and solvers for noise removal from piecewise constant signals, i. background theory. Proc. Roy. Soc. A 467(2135), 3088–3114 (2011) 14. MacKay, D.J.C.: Information Theory, Inference and Learning Algorithms. Cambridge University Press, New York (2002) 15. Manning, C.D., Schutze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999) 16. Rastin, P., Matei, B.: Prototype-based clustering for relational data using Barycentric coordinates. In: Proceeding of the International Joint Conference on Neural Networks (IJCNN), IJCNN 2018 (2018) 17. Rastin, P., Zhang, T., Cabanes, G.: A new clustering algorithm for dynamic data. In: Hirose, A., Ozawa, S., Doya, K., Ikeda, K., Lee, M., Liu, D. (eds.) ICONIP 2016. LNCS, vol. 9949, pp. 175–182. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46675-0_20 18. Silva, J.A., Faria, E.R., Barros, R.C., Hruschka, E.R., de Carvalho, A.C., Gama, J.: Data stream clustering: a survey. ACM Comput. Surv. 46(1), 13–31 (2013)

Extraction and Localization of Non-contaminated Alpha and Gamma Oscillations from EEG Signal Using Finite Impulse Response, Stationary Wavelet Transform, and Custom FIR Najmeddine Abdennour1(&), Abir Hadriche1,2(&), Tarek Frikha3(&), and Nawel Jmail4(&) 1

Department of Electronics and Telecommunication, Institute of Computer Science and Multimedia. ISIMG, University of Gabes, Gabes, Tunisia [emailprotected], [emailprotected] 2 REGIM Lab, ENIS, Sfax University, Sfax, Tunisia 3 CES Lab, ENIS. Sfax University, Sfax, Tunisia [emailprotected] 4 MIRACL Lab, Sfax University, Sfax, Tunisia [emailprotected]

Abstract. The alpha and gamma oscillations derived from EEG signal are useful tools in recognizing a cognitive state and several cerebral disorders. However, there are undesirable artifacts that exist among the electrophysiological signals which lead to unreliable results in the extraction and localization of these accurate oscillations. We introduced, three ﬁltering techniques based on Finite Impulse Response ﬁlters FIR, Stationary Wavelet transform SWT method and custom FIR ﬁlter to extract the non-contaminated (pure) oscillations and localize their responsible sources using the Independent Component Analysis ICA technique. In our obtained results, we compared the effectiveness of these ﬁltering techniques in extracting and localizing of non-contaminated alpha and gamma oscillations. We proposed here the accurate technique for the extraction of pure alpha and oscillations. We also presented the accurate cortical region responsible of the generation of these oscillations. Keywords: EEG signal Source localization

Oscillation FIR SWT Custom FIR

1 Introduction In order to study the human brain activity, we relied on analyzing electrophysiological signals; among this recording technique the electroencephalogram EEG signal remains one of the reliable ways to investigate the neurons activity response and their impact on our daily tasks, conscious state and medical disorders. Based on the EEG frequency variation, this physiological signal is generally classiﬁed into ﬁve waves: delta band (0.5–4 Hz), theta waves (4–7.5 Hz), alpha (8–13 Hz), beta (14–26 Hz) and gamma © Springer Nature Switzerland AG 2018 V. Kůrková et al. (Eds.): ICANN 2018, LNCS 11140, pp. 511–520, 2018. https://doi.org/10.1007/978-3-030-01421-6_49

512

N. Abdennour et al.

(30–45 Hz) [1]. The alpha waves are generally located in the occipital area, considered as the most important cortical waves, it reveals the states of relaxation, awareness and absence of concentration. For the gamma waves they are much more identiﬁed as active level of cognition state and mostly used for conﬁrmation of serval neurological diseases and malfunctions [2], especially in epilepsy. The extraction of these frequency bands in a pure way was and remains a challenging task notably when the EEG recorded frequencies covers a wide range (from 0.5 Hz up to 45 Hz and above). With a variety of different ﬁltering techniques, [3–6], the consensus ﬁltering method remain in negotiation versus several constraints: the signal to noise ratio, the overlapped level, the width of spikes and oscillations…. An effective separation of cortical frequency band would produce non-contaminated oscillatory activities (neurons generators) with a much better analysis of the responsible sources and generators of these activities.

2 Filtering Techniques 2.1

Finite Impulsive Response (FIR): Kaiser Window

The Finite Impulse Response ﬁlter is a classical technique that conservers both the causality and stability aspects. The FIR is preferred then Inﬁnite Impulsive Response IIR (difﬁcult to implement mainly for the instability in higher orders) [7–9]. In fact, the FIR is always applied with windowing method. Hence, we used the Kaiser window to control the passband ripples stability with a smother manner [10]. The Kaiser window (Kaiser function in Matlab), deﬁnes the window shape by the b parameter. In our study, we settled the ﬁlter order to N = 100, the passband frequencies Fc1, Fc2 respectively set to 8 and 12 Hz for the extraction of the alpha wave and for the gamma wave were set to 30 and 46 Hz. (ﬁr1 function in matlab), and the b window parameter to 3. 2.2

A Custom Designed FIR Filter Derived from Parcks-MacClellan Algorithm

The Parcks-MacClellan algorithm is as fundamental way to design Equiripple FIR ﬁlters [11], based on the Chebyshev approximation [12]. The main advantage of this ﬁlter is its ability to minimize errors both in passband and stopband frequencies [10]. We deﬁned in our study, the ﬁlter order to N = 100, the stop and pass weights to Wstop1 = 100, Wpass = 80 and Wstop2 = 120. The passband and stopband frequencies were the same as the FIR ﬁlter settings for both alpha and gamma waves extractions. 2.3

Stationary Wavelet Transform (SWT)

The stationary Wavelet Transform SWT is a wavelet transform ﬁlter based on the Discrete Wavelet Transform (DWT) with the advantage, of surpassing binary decimation step, of the wavelet transform [3, 13] that allows a retention of the real signal

Extraction and Localization of Non-contaminated Alpha and Gamma Oscillations

513

properties. The SWT also has a better performance than the Classical Wavelet Transform (CWT) by overcoming the frequency bands overlapping. The SWT has been also proven very useful in EEG signal analysis [3, 14]. In fact, this technique, decomposes a signal s(t) at each scale j and step k, then project it on the mother