autonlab.org

Discovering Complex Anomalous Patterns

Research Project funded by the National Science Foundation under Grant NSF IIS-0911032

Project Description

Many of the most interesting and valuable discoveries that can be made from data arise not from the evaluation of single records, but from identifying a set of records that are anomalous in some interesting way. Together they may indicate for example the emergence of a disease outbreak or new patterns of criminal activity. One can view pattern discovery as an interactive process between data analysis algorithms and human users who have expertise in the domain. This research will develop an integrated framework of probabilistic methods to interact with the user in detecting, characterizing, explaining, and learning anomalous patterns over groups of records. The focus is on the many situations where the data (and the probabilistic patterns to be discovered) are not appropriate for using other existing techniques, such as graph mining or frequent sets. The proposed methods will search over arbitrary subsets of records and evaluate their correspondence to known, potentially very complex, probabilistic patterns, or their failure to match baseline data under various learned statistical models. These methods will assist the user in understanding and modeling the discovered, previously unknown anomalies to be identifiable as a known pattern when encountered in the future.

Project Summary

Project Page at NSF

Project Personnel

Faculty

Graduate Students

Research Staff

Journal Publications and Book Chapters Resulting from Project Activities

Dubrawski A. (2010). Detection of Events in Multiple Streams of Surveillance Data. In Infectious Disease Informatics and Biosurveillance, Eds. D. Zeng, H. Chen, C. Castillo-Chavez, W. Lober, and M. Thurmond. Springer-Verlag, 2010. In press.  http://www.springer.com/public+health/book/978-1-4419-6891-3

Gopalakrishnan V, Lustgarten J, Visweswaran S, Cooper GF. (2010). Bayesian rule learning for biomedical data mining. Bioinformatics 26:668-675, 2010.  http://bioinformatics.oxfordjournals.org/content/26/5/668.abstract

Neill, DB and Cooper, GF. (2010). A multivariate Bayesian scan statistic for early event detection and characterization. Machine Learning, p. 261, vol. 79, 2010.  http://www.cs.cmu.edu/~neill/papers/MBSS.pdf

Neill, DB. (2010a). Fast subset scan for spatial pattern detection, Journal of the American Statistical Association, Submitted, under revision.

Neill, DB. (2010b) Fast Bayesian scan statistics for multivariate event detection and visualization, Statistics in Medicine, 2010. Accepted.

Oliveira, D., Neill, DB., Garrett JH. Jr, and Soibelman, L. (2010). Detection of patterns in water distribution pipe breakage using spatial scan statistics for point events in a physical network, Computing in Civil Engineering, Accepted. http://scitation.aip.org/getabs/servlet/GetabsServlet?prog=normal&id=JCCEXX000001000001000048000001&idtype=cvips&gifs=yes&ref=no

Waidyanatha, N., Sampath, C., Dubrawski, A., Prashant, S., Ganesan, M., Gow, G. (2010b). Affordable system for rapid detection and mitigation of emerging diseases. International Journal on E-Health and Medical Communications. In press.

Weerasinghe, C., Waidyanatha, N., Dubrawski, A., Baysek, M., Ganesan M. (2010). T-Cube Web Tool for rapid detection of disease outbreaks in India and Sri Lanka. Sri Lankan Journal of Biomedical Informatics. In press.

Reviewed Conference Publications Resulting from Project Activities

Chen L., Dubrawski A., Dunham, A., Huckabee, M., Kelley L. (2009a) Using Network Diagrams in Support of Food Safety Investigations. Proceedings of the 2009 International Society for Disease Surveillance Annual Conference. Available online at www.syndromic.org.

Chen L., Dubrawski A., Sorokina, D. (2009b) Multivariate Analysis for Predicting Risk of Microbial Contamination of Food. Proceedings of the 2009 International Society for Disease Surveillance Annual Conference. Available online at www.syndromic.org.

Cooper GF, Hennings-Yeomans P, Visweswaran S, Barmada M. (2010) An efficient Bayesian method for predicting clinical outcomes from genome-wide data. Proceedings of the Fall Symposium of the American Medical Informatics Association (November, 2010).

Dubrawski A., Chen L., Sarkar P. (2009a) Efficient Visualization of Dynamic Networks in Food Safety Analysis. Proceedings of the 2009 International Society for Disease Surveillance Annual Conference. Available online at www.syndromic.org.

Dubrawski A., Sabhnani M., Fedorka-Cray P., Kelley L., Gerner-Smidt P., Williams I., Huckabee M., Dunham A. (2009c). Discovering Possible Linkages between Food-borne Illness and the Food Supply Using an Interactive Analysis Tool. Proceedings of the 2009 International Society for Disease Surveillance Annual Conference. Available online at www.syndromic.org.

Neill DB. (2009) Fast subset sums for multivariate Bayesian scan statistics, Proceedings of the 2009 International Society for Disease Surveillance Annual Conference.  Available online at www.syndromic.org.

Sabhnani M., Dubrawski A., Schneider J. (2009a) Detection of Disjunctive Anomalous Patterns in Multidimensional Data. Proceedings of the 2009 International Society for Disease Surveillance Annual Conference. Available online at www.syndromic.org.

Sabhnani M., Dubrawski A., Waidyanatha N. (2009b) T-Cube Web Interface for Real-time Biosurveillance in Sri Lanka. Proceedings of the 2009 International Society for Disease Surveillance Annual Conference. Available online at www.syndromic.org.

Speakman, S. and Neill, DB. (2009). Fast graph scan for scalable detection of arbitrary connected clusters, Proceedings of the 2009 International Society for Disease Surveillance Annual Conference. Available online at www.syndromic.org.

Sverchkov Y, Cooper GF. (2010). Spatial cluster detection using two-dimensional dynamic programming. (submitted and currently under review by a conference).

Waidyanatha, N., Prashant, S., Ganesan, M., Dubrawski, A., Chen, L., Baysek, M., Careem, M., Damendra, P., Kaluarachchi, M. (2010a) Real-Time Biosurveillance Pilot in India and Sri Lanka, IEEE-Healthcom 2010, Lyon, France, July 2010.  http://lirneasia.net/wp-content/uploads/2009/11/Waidyanatha_eAsia2009_web_paper.pdf

Waidyanatha, N., Sampath, C., Dubrawski, A., Sabhnani, M., Chen, L., Ganesan, M., Vincy, P. (2010c) T-Cube Web Interface as a tool for detecting disease outbreaks in real-time: A pilot in India and Sri Lanka. IEEE-RIVF 2010 International Conference on Computing and Telecommunication Technologies, Hanoi, Vietnam, November 2010.

Xiong, L., Chen, X., Huang, T-K., Schneider, J. and Carbonell. J. (2010). Temporal Collaborative Filtering with Bayesian Probabilistic Tensor Factorization. Proceedings of SIAM Data Mining Conference (SDM), Columbus, OH, April 2010.  http://www.cs.cmu.edu/~xichen/images/Xi Chen SDM 2010.pdf

Zhang, Y., Schneider, J. and Dubrawski, A. (2010a) Learning Compressible Models. Proceedings of SIAM Data Mining Conference (SDM), Columbus, OH, April 2010.  http://www.ml.cmu.edu/current_students/DAP_zhang.pdf

Zhang, Y. and Schneider, J. (2010b). Projection Penalties: Dimension Reduction without Loss, International Conference on Machine Learning (ICML), Haifa, Israel, June 2010.  http://www.icml2010.org/papers/481.pdf

Other Publications Resulting from Project Activities

Sabhnani, M. (2010). Disjunctive Anomaly Detection: Identifying Complex Anomalous Patterns. Ph.D. Thesis Proposal. Machine Learning Department, Carnegie Mellon University, August 2010.

Sarkar, P. (2010). Tractable Algorithms for Proximity Search on Large Graphs. Ph.D. Thesis, Machine Learning Department, Carnegie Mellon University, May 2010.  http://reports-archive.adm.cs.cmu.edu/anon/ml2010/CMU-ML-10-107.pdf

This material is based upon work supported by the National Science Foundation, grant NSF IIS-0911032. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Our test applications and external partners (in no particular order):

Predictive Analytics Group, Chicago Police Department, http://www.chicagopolice.org/

We look at spatio-temporal mapping of crime records in order to detect complex patters which may be predictive of future crime activity.

Ottawa Heart Institute, Canada, http://www.ottawaheart.ca/

We try our pattern detection methods on public health surveillance data collected through a few projects throughout Canada in order to reliably identify emerging epidemics.

LIRNEAsia, Sri Lanka, http://lirneasia.net/, and the Health Department of Wayamba Province, Sri Lanka

We built an analytic system for public health surveillance which includes our complex anomalous pattern detection technology to identify emerging trends in demand for health services due to escalating infectious, non-infectious and chronic diseases. We use field data collected through that systems in experimental evaluations of algorithms developed in the scope of our project.

Rural Technology and Business Incubator of the Indian Institute of Technology, Madras, India, http://www.rtbi.in/

A variant of the system mentioned above is also being deployed and tested in Tamil Nadu state of India.

Systems Lifecycle Integrity Management initiative, Headquarters, United States Air Force, http://www.af.mil/

Anomalous patterns of maintenance and logistics activity involving fleets of aircraft may be and often are indicative of emerging crises in supply which may limit availability of equipment and increase costs of operations due to expediting. Our anomalous pattern detection technology is helping to detect such crises in their early stages in order to limit their negative impact.

Department of Astronomy, University of Washington, http://www.astro.washington.edu/,  and Department of Physics and Astronomy, Johns Hopkins University, http://physics-astronomy.jhu.edu/

We support scientific discovery by enabling detection of anomalous groups of celestial objects in telescope observations and in astrophysical simulation data.

Department of Civil and Environmental Engineering, Carnegie Mellon University, http://www.ce.cmu.edu/

We support analyses of water distribution systems by detecting clusters of apparent pipe breakages in water consumption data.

MIMIC II (Multiparameter Intelligent Monitoring in Intensive Care) database http://www.physionet.org/mimic2/

This database contains comprehensive clinical data from tens of thousands of Intensive Care Unit (ICU) patients. We use an extract from it as a primary test-bed for the analytic algorithms we develop.

Outreach activities

We are actively involved in disseminating results of our work, with the particular attention to reaching out to audiences of possible future collaborators outside of the area of computer science. The resulting presentations are listed below in the chronological order.

  1. Interactive Analysis of Multiple Streams of Data in Public Health and Food Safety Applications. Dubrawski, A. and Kelly L., PulseNet/OutbreakNet Annual Meeting, Snowbird, UT, September 23, 2009.
  2. Browsing and Analysis of Multidimensional Crisis Data at Interactive Speeds. Dubrawski, A. International Conference on Crisis Mapping, Cleveland, OH, October 16th, 2009.
  3. Trade-offs between Agility and Reliability of Predictions in Dynamic Social Networks Used to Model Risk of Microbial Contamination of Food, Dubrawski, A. Lawrence Livermore National Laboratory, October 2009.
  4. Fast subset sums for multivariate Bayesian scan statistics. Neill, D.B. International Society for Disease Surveillance Annual Conference, Miami, FL, December 2009.
  5. Interactive Analysis of Multidimensional Data for Adverse Event Detection, Dubrawski, A. University of Peradeniya, Sri Lanka, December 21st, 2009.
  6. Fast graph scan for scalable detection of arbitrary connected clusters. Speakman, S. and Neill, D.B. International Society for Disease Surveillance Annual Conference, Miami, FL, December 2009.
  7. Bayesian Outbreak Detection and Characterization. Cooper, G., International Society for Disease Surveillance Webinar on Applications of Bayesian Statistics for Biosurveillance, January 28, 2010.
  8. Analytics and Business Intelligence Current Trends, Opportunities, and Challenges. Dubrawski, A., The Emergent Technologies Program, CMU Tepper School of Business, February 2010.
  9. Fast subset scanning for multivariate event detection. Neill, D.B. ENAR 2010 Annual Meeting, New Orleans, LA, March 2010.
  10. GIVAS Analytics: Conceptual, Technical and Strategic Opportunities. Dubrawski, A. United Nations Global Impact and Vulnerability Alert System Blue Sky Thinkers Workshop, Bellagio, Italy, April 2010.
  11. An Efficient Bayesian Method for Predicting Clinical Outcomes from Genome-Wide Data. Cooper, G., Department of Human Genetics Spring Seminar Series. April 16, 2010.
  12. Temporal Collaborative Filtering with Bayesian Probabilistic Tensor Factorization. L. Xiong, SIAM Data Mining Conference (SDM), Columbus, OH, April 2010.
  13. Learning Compressible Models, Zhang, Y. SIAM Data Mining Conference (SDM), Columbus, OH, April 2010.
  14. Rapid Detection of Emerging Crises. Dubrawski A., The 11th Annual International Conference on Digital Government Research, Puebla, Mexico, May 2010.
  15. Projection Penalties: Dimension Reduction without Loss, Zhang, Y. International Conference on Machine Learning (ICML), Haifa, Israel, June 2010.
  16. Fast Generalized Scan for Anomalous Pattern Detection. McFowland III, E. 16th Conference for African American Researchers in the Mathematical Sciences, Baltimore, MD, June 2010.
  17. T-Cube Web Interface in RTBP: A Review of R&D Challenges, Dubrawski, A., RTBP Workshop, Chennai, India, July 7th 2010.
  18. Fast subset sums for scalable Bayesian detection and visualization. Neill, D.B. Fifth International Workshop on Applied Probability, Madrid, Spain, July 2010.
  19. Fast generalized subset scan for anomalous pattern detection. McFowland III, E., Speakman, S., and Neill, D.B. INFORMS 2010, Austin, TX, November 7th 2010.
  20. Discovering Complex Anomalous Clusters using Disjunctive Anomaly Detection Algorithm, Sabhnani, M., Dubrawski A., and Schneider J., INFORMS 2010, Austin, TX, November 7th 2010.
  21. Scalable Detection of Anomalous Patterns with Connectivity Constraints. Speakman, S., McFowland III, E., and Neill, D.B. INFORMS 2010, Austin, TX, November 7th 2010.
Copyright 2010, Carnegie Mellon University, Auton Lab. All Rights Reserved.