International Journal of Research in Science and Technology. (IJRST) 2018, Vol. No. 8, Issue No. IV, Oct-Dec e-issn: , p-issn: X

Similar documents
Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

CHAPTER IV PREPROCESSING & FEATURE EXTRACTION IN ECG SIGNALS

Quick detection of QRS complexes and R-waves using a wavelet transform and K-means clustering

CNV PCA Search Tutorial

The North Carolina Health Data Explorer

Numerical Integration of Bivariate Gaussian Distribution

CHAPTER 5 WAVELET BASED DETECTION OF VENTRICULAR ARRHYTHMIAS WITH NEURAL NETWORK CLASSIFIER

An Edge-Device for Accurate Seizure Detection in the IoT

Cognitive Neuroscience History of Neural Networks in Artificial Intelligence The concept of neural network in artificial intelligence

SHORETEL APPLICATION NOTE

Analyzing Spammers Social Networks for Fun and Profit

Minimum Feature Selection for Epileptic Seizure Classification using Wavelet-based Feature Extraction and a Fuzzy Neural Network

Heart Rate Calculation by Detection of R Peak

Large Scale Predictive Analytics for Chronic Illness using FuzzyLogix s In-Database Analytics Technology on Netezza

Detection of pulmonary abnormalities using Multi scale products and ARMA modelling

Warfarin Help Documentation

COMP9444 Neural Networks and Deep Learning 5. Convolutional Networks

Big Data Anonymisation Techniques

Neural Network based Heart Arrhythmia Detection and Classification from ECG Signal

An Efficient Service Rating Forecasting by Exploring Social Mobile User s Geographical Locations

ONE MONITORING SOLUTION TO MEASURE MORE. Zephyr Performance Systems for academics and researchers

Characterization of 3D Gestural Data on Sign Language by Extraction of Joint Kinematics

Assignment Question Paper I. 1. What is the external and internal commands and write 5 commands for each with result?

Error Detection based on neural signals

Assessment of Reliability of Hamilton-Tompkins Algorithm to ECG Parameter Detection

BlueBayCT - Warfarin User Guide

Visual and Decision Informatics (CVDI)

Mammogram Analysis: Tumor Classification

Chapter 1: Managing workbooks

From Biostatistics Using JMP: A Practical Guide. Full book available for purchase here. Chapter 1: Introduction... 1

An IoT-based Drug Delivery System for Refractory Epilepsy

Cleveland Clinic Home Monitoring Pilot

CIM1309: vcat 3.0: Operating a VMware Cloud

New York State Immunization Information System (NYSIIS) New York State Department of Health. Digital Government: Government to Business (G to B)

Modeling Sentiment with Ridge Regression

Cortex Gateway 2.0. Administrator Guide. September Document Version C

Development of novel algorithm by combining Wavelet based Enhanced Canny edge Detection and Adaptive Filtering Method for Human Emotion Recognition

A COMPREHENSIVE APPROACH TO MANAGING DIABETES

Australian Journal of Basic and Applied Sciences

Contour Diabetes app User Guide

Customer Guide to ShoreTel TAPI- VoIP Integrations. March

Epileptic seizure detection using EEG signals by means of stationary wavelet transforms

Abstract. Optimization strategy of Copy Number Variant calling using Multiplicom solutions APPLICATION NOTE. Introduction

Blood Glucose Monitoring System. Copyright 2016 Ascensia Diabetes Care Holdings AG diabetes.ascensia.com

D9995 and D9996 ADA Guide to Understanding and Documenting Teledentistry Events

Removal of Baseline wander and detection of QRS complex using wavelets

EPILEPTIC SEIZURE DETECTION USING WAVELET TRANSFORM

Hemodynamic Analysis System

Incorporation of Imaging-Based Functional Assessment Procedures into the DICOM Standard Draft version 0.1 7/27/2011

Further Mathematics 2018 CORE: Data analysis Chapter 3 Investigating associations between two variables

Process Mining to enhance security of Web information systems

Six Sigma Glossary Lean 6 Society

Unit 1 Exploring and Understanding Data

A Sleeping Monitor for Snoring Detection

2016 Environment, Health and Safety (EHS) Research Vision & Agenda Matthew Littlefield & Peter Bussey

Research Methods in Forest Sciences: Learning Diary. Yoko Lu December Research process

Unsupervised MRI Brain Tumor Detection Techniques with Morphological Operations

Complex just became comfortable.

Stepwise method Modern Model Selection Methods Quantile-Quantile plot and tests for normality

Agent-Based Models. Maksudul Alam, Wei Wang

DMA will take your dental practice to the next level

Mapping fear of crime dynamically on everyday transport: SUMMARY (1 of 5) Author: Reka Solymosi, UCL Department of Security & Crime Science

Classification of ECG Data for Predictive Analysis to Assist in Medical Decisions.

ITIL v3 Service Management as a Practice

Cognizanti. You Don t Join Us; We Join You. Insurance/Health & Wellness VOLUME

An Artificial Neural Network Architecture Based on Context Transformations in Cortical Minicolumns

Wavelet Decomposition for Detection and Classification of Critical ECG Arrhythmias

SISG 2018 Module 12. Adrienne Stilp. July 19, 2018

PHONOCARDIOGRAM SIGNAL ANALYSIS FOR MURMUR DIAGNOSING USING SHANNON ENERGY ENVELOP AND SEQUENCED DWT DECOMPOSITION

Predicting Breast Cancer Survivability Rates

FACT SHEET Q Fever and the Australian Q Fever Register

11. NATIONAL DAFNE CLINICAL AND RESEARCH DATABASE

Nov versus Fam. Fam 1 versus. Fam 2. Supplementary figure 1

II. NORMAL ECG WAVEFORM

Chapter 1: Exploring Data

A Common-Sense Framework for Assessing Information-Based Counterterrorist Programs

Section 6: Analysing Relationships Between Variables

Version 3 Last Revision Date March Smoking in the Workplace Policy

Innovation. SAGA Project. June 2010 Saga Prefectural Government, Japan

ISSN: ISO 9001:2008 Certified International Journal of Engineering and Innovative Technology (IJEIT) Volume 2, Issue 10, April 2013

ECG Beat Recognition using Principal Components Analysis and Artificial Neural Network

Problem set 2: understanding ordinary least squares regressions

OPERATIONALIZING EHS MANAGEMENT

Mammogram Analysis: Tumor Classification

Applying Data Mining for Epileptic Seizure Detection

Cleveland State University Department of Electrical and Computer Engineering Control Systems Laboratory. Experiment #3

IT and Information Acceptable Use Policy

Performance and Saliency Analysis of Data from the Anomaly Detection Task Study

Predicting Heart Attack using Fuzzy C Means Clustering Algorithm

Extraction of Unwanted Noise in Electrocardiogram (ECG) Signals Using Discrete Wavelet Transformation

Global Harmonization Task Force SG3 Comments and Recommendations ISO/DIS 9001: 2000 and ISO/DIS 9000: 2000 And Revision of ISO and 13488

Master of Science in Management: Fall 2018 and Spring 2019

Data Management System (DMS) User Guide

The rise of neurosocial networks

Premature Ventricular Contraction Arrhythmia Detection Using Wavelet Coefficients

A Deep Learning Approach to Identify Diabetes

People and Communities Board. Six principles for engaging people and communities. Definitions, evaluation and measurement

THE data used in this project is provided. SEIZURE forecasting systems hold promise. Seizure Prediction from Intracranial EEG Recordings

Section 3 Correlation and Regression - Teachers Notes

HST-582J/6.555J/16.456J-Biomedical Signal and Image Processing-Spring Laboratory Project 1 The Electrocardiogram

Transcription:

CLOUD FILE SHARING AND DATA SECURITY THREATS EXPLORING THE EMPLOYABILITY OF GRAPH-BASED UNSUPERVISED LEARNING IN DETECTING AND SAFEGUARDING CLOUD FILES Harshit Yadav Student, Bal Bharati Public School, Dwarka, New Delhi ABSTRACT As increasingly more enterprises are deploying cloud file-sharing services, this adds a new channel for potential insider threats to company data and IPs. In this paper, we introduce a two-stage machine learning system to detect anomalies. In the prelim stage, we anticipate the entrance logs of cloud record sharing administrations onto relationship diagrams and use three complementary graph-based unsupervised learning methods: OddBall, PageRank and Local Outlier Factor (LOF) to generate outlier indicators. In the second stage, we outfit the exception pointers and present the discrete wavelet change (DWT) strategy, and propose a method to utilize wavelet coefficients with the Haar wavelet capacity to distinguish anomalies for insider danger. The proposed system has been deployed in a real business environment, and demonstrated effectiveness by selected case studies. Keywords discrete wavelet transform; Haar wavelet; insider threat detection; cloud file-sharing; graphbased unsupervised learning I. INTRODUCTION Enterprise file sharing, whether on the cloud (public, hybrid, or private) or on-premises, enables individuals to share files from mobile devices and PCs. Sharing can happen between people (for example, partners and customers) within or outside the organization, as well as between applications. The offerings enable modern user productivity and collaboration scenarios for the creation of a digital workplace [Basso et al., 2016]. However, cloud file-sharing creates a new channel for insider threats, including the leakage and exfiltration of business strategies, customer information, personal identifiable information (PII), company IPs, source code, etc. Depending on the vendors that provide enterprise file sharing services, certain levels of security and data protection in cloud services are offered. These include password protection, data encryption, data loss prevention, disaster recovery management, and security analytics. 39

However, most of these are rule-based, and generate a lot of alerts with a high false positive rate. In addition, many of the alerts are related or duplicated, which overwhelms the capacity of the security operations center (SOC) to investigate all of them. In this examination, we build up a two-arrange machine learning framework to naturally recognize client abnormal conduct dependent on the entrance logs of cloud document sharing frameworks and feed a little specific gathering of clients to the SOC for examination. The architecture of our proposed machine learning system is shown in Fig. 1. Fig. 1. Architecture of machine learning system for cloud file-sharing outlier detection Specifically, we use graph-based OddBall [Akoglu et al., 2010, Akoglu et al., 2015] and PageRank [Page et al., 1999] algorithms, as well as the density-based Local Outlier Factor (LOF) algorithm [Breunig et al., 2000]. Each algorithm will generate an outlier indicator for every user based on their behavior relative to the others on a daily basis. The reason we use three different algorithms is that they may capture different characteristics as shown in Section 5. Then we accumulate the daily outlier indicators to create time series for each user and use it to detect outliers from a temporal perspective. To use the benefits of multiple complementary indicators, and avoid the weaknesses of using a single indicator, we follow the suggestions of [Zimek et al., 2014], and adopt the average function to form an ensemble of outlier indicators from Oddball, PageRank, and LOF. In order to identify suspicious insiders from the ensemble score time series, we propose to use the wavelet transform method, which is inspired by [Bilen and Huzurbazar, 2002; Grané and Veiga, 2010]. To the best of our insight, we have not seen research applying this method to the abnormal file sharing problem in the cyber security insider threat detection domain. In summary, our main contributions are: We create a machine learning system to detect behavioral outliers in cloud file-sharing for the purpose of malicious insider detection. 40

We project the audit trail data into relationship graphs and apply three complementary unsupervised learning algorithms to generate outlier indicators by comparing different users behaviors at a particular point of time. We introduce a wavelet transform approach to take into account users temporal behavior as well as the severe scale, and to form a single wavelet-based risk score. The rest of the paper is sorted out as pursues. Section II briefly describes DWT and the Haar wavelet funcion, which is the core method in our study. Section III shows experimental results. Section IV concludes the whole paper. II. WAVELET TRANSFORM A. Discrete Wavelet Transform A wavelet transform maps a one-dimensional function into a two-dimensional function (e.g. time and scale ). Formally, in a wavelet transform analysis application, the first step is to adopt a wavelet function, or mother wavelet. At that point, fleeting examination is performed with a contracted, high-recurrence variant of the mother wavelet, while recurrence investigation is performed with an enlarged, low-recurrence adaptation of a similar wavelet. Because the original function can be represented in terms of a wavelet expansion, data operations can be performed using just the corresponding wavelet coefficients. DWT refers to wavelet transforms for which the wavelets are discretely sampled, and in our study, we focus on the DWT. The generic form of a one-dimensional DWT is shown in Fig. 2 (from goo.gl/wm1eqz). Here an original function X(n) is passed through a high and low frequency filters, h and g, respectively. Both h and g are self-orthogonal and are orthogonal to one another. Then we apply DWT by a down- sample factor of 2, and we get the level 1 coefficients d11(n) from high frequency filter, and d10(n) from low frequency filter g. Multiple levels of the wavelet transform are made by repeating the filtering and decimation process on the low filter outputs only (i.e. di0(n)). The process is typically carried out for a finite number of levels K. The resulting coefficients di1(n), i {1 K} and dk0(n) are called wavelet coefficients. 41

Fig. 2. Discrete wavelet transform In our problem, each user has multiple time series from different outlier indicators (i.e., OddBall, PageRank and LOF), and we generate an ensemble of outlier indicators with the average function. Then based on the one-dimensional ensemble time series we apply DWT to detect outliers. B. The Haar Wavelet Function Wavelets are defined by the wavelet function ƒ(t) (i.e. the mother wavelet) and scaling function φ(t) (also called the father wavelet). The ƒ(t) is a function used to divide a given function into different scale components, with some particular properties, such as: wave-like oscillation, integral value is zero, and the integral value of its square format is unity. II. EXPERIMENT RESULTS AND DISCUSSIONS In our experiment, we use 3 months of Symantec Box cloud file-sharing access log data. The input dataset has 22,016 observations from 688 unique users. In this section, we will elaborate our results and provide discussions. A. Correlations between Outlier Indicators and ensemble Ideally all three outlier indicators (i.e. OddBall, LOF, and PageRank) would be positively correlated, giving us more confidence in them. Table I shows the correlation matrix. For OddBall and PageRank, the correlation coefficient value is about 0.2703. The other two pairs are smaller, but still positive. The weak correlations between OddBall and LOF scores implies that combining these outlier indicators may yield better detection performance. Therefore, we take the arithmetic ensemble of arithmetic averaging Oddball, LOF, and PageRank outlier indicators and create a new time series. This new time series generates stronger correlation with each of the individual outlier indicators. The DWT will be applied to the new time series to generate the final outlier score. 42

TABLE I. CORRELATION MATRIX B. Time Series Patterns of Outlier Indicators and Wavelet Risk Score In this section, we randomly sample two users and look into details of the trending relationship between the ensemble outlier indicator and the Wavelet risk score to make sure that wavelet risk score is able to detect changes in the ensemble outlier indicator time series and effectively flag potential insider threat. Fig. 3. Trend plots of Ensemble score and Wavelet risk score (over 32 days) Fig. 3 shows the two users daily ensemble values, and the Wavelet risk scores over a 32-day period. Note that Ensemble series use the vertical axis on the left, while the Wavelet risk score uses the vertical axis on the right. The figure shows that the Wavelet risk scores are closely correlated with Ensemble scores and the Wavelet risk scores show big spikes when the ensemble value increases significantly. However, the more important characteristic is that the wavelet risk score can detect subtle changes in the ensemble outlier indicator, for example for user 1, the ensemble outlier indicator looks like having a regular fluctuation during days 21-25 as in the 43

other time intervals (the green line of the top chart in Fig 3), but there is a much clearer indication of the wavelet score (the red line of the top chart in Fig 3). And later investigation proves that this is a significant detection. It is this that demonstrates the benefit of using DWT to generate wavelet risk score for potential insider threat detections. By adjusting the wavelet risk score threshold, we can send a very small number of alerts for SOC to investigate. In summary, the Wavelet risk score from the first level coefficients of the Haar DWT is able to capture the patterns in both the outlier indicator feature space and temporal space and we propose to use that to flag risky users. III. CONCLUSIONS AND DISCUSSIONS In this paper, we propose a novel two-organize machine learning framework which use cloud record sharing access information to consequently recognize client atypical conduct for potential insider risk. In the first stage, we project the access log data onto user/file and user/user relationship graphs and apply three unsupervised algorithms to generate outlier indicators. In the second stage, we ensemble the outlier indicators and introduced the DWT with the Haar wavelet function to model users temporal behavior for insider threat detection. Our experiment results demonstrate that the first level coefficients of the DWT with the Haar wavelet is able to effectively capture the temporal patterns of the user behaviors. As for future work, we will incorporate more data sources, including but not limited to, active directory logs, windows logon logs, physical security logs, and printer logs, to better detect insider threats holistically. 44