Sunday, June 28, 2020
ANALYSIS OF ADULT DATA SET USING ARTIFICIAL INTELLIGENCE TECHNIQUES - 5225 Words
ANALYSIS OF ADULT DATA SET USING ARTIFICIAL INTELLIGENCE TECHNIQUES (Coursework Sample) Content: ANALYSIS OF ADULT DATA SET USING ARTIFICIAL INTELLIGENCE TECHNIQUES 1.1 PROBLEM DESCRIPTION The given task is to resolve a data set by applying different analysis models on it. In the data set I have chosen, the problem/prediction task is to determine whether income exceeds $50K per year. 1.2 DATA DESCRIPTION The data set chosen for analysis of different Artificial Intelligence algorithms is "Adult Data Setà ¢Ã¢â ¬Ã . It is extracted from the 1994 census bureau database. The extraction was done by Barry Becker. The description of the data set is given in the file "adult.names" of the data folder. The data folder provides two sets with the same type of data "adult data" and "adult testà ¢Ã¢â ¬Ã . The former is used for training and the latter is used for testing. The total number of records in the file "adult.data" is 32561. The records in the file "adult.test" are 16281.Following are the properties of the dataset:Data set characteristics: MultivariateAttribute Characteristics: Categorical, IntegerNumber of instances: 48842Number of Attributes: 14Missing Values: IncludedClass variable is 50K, =50K whereas the attributes of the data set are as follows:Age, workclass, fnlwgt(final weight), education, education-num, marital status, occupation, relationship, race, sex, capital-gain, capital-loss, hours per week and native country. REASONS FOR CHOOSING THIS DATA SET: * Size of the data is 40,000 * Contains missing values * 14 attributes 1.3 APPROACH 1.3.1 SOFTWARES USED:The data set was tested and analyzed on the following two data analytics software. 1, KNIME: Konstanz Information Miner is an open source data analytics, reporting and integration platform. KNIME integrates various components for machine learning and data mining through its modular pipelining concept. A graphical user interface allows assembly of nodes for data preprocessing, for modeling and data analysis and visualization. KNIME is used in areas like CRM customer data analysis, business intelligence and financial data analytics. 2, WEKA: It is a data analytics platform, a software written in Java, that contains a collection of visualization tools and algorithms for data analysis and predictive modeling, with graphical user interfaces. It supports standard data mining tasks, like data preprocessing, clustering, classification, regression etc.We used the two above mentioned softwares for data analysis by applying the four different algorithms. Other softwares that can be used for this purpose are Rapid Miner, R-Programming, Orange, NLTK etc. 1.3.2 DISCUSSION OF MODELS APPLIED: These four models are applied for the analysis of data. 1, NAÃÆ'à VE BAYES: NaÃÆ'à ¯ve Bayes is a simple probabilistic classifier based on applying Bayes Theorem. It is a technique for constructing classifiers, models that assign class labels to problem instances. NaÃÆ'à ¯ve Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable. 2, DECISION TREE: A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs and utility. It is a way to display to an algorithm. Decision trees are commonly used in operations research, specifically in decision analysis to help identify a strategy most likely to reach a goal. 3, K-MEANS ALGORITHM: K-Means is a method for cluster analysis in data mining. It aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. 4, ARTIFICIAL NEURAL NETWORK: In machine learning, ANN is a statistical learning model inspired by biological neural networks (the central nervous system of the brain). It is used to estimate functions that depend on a large number of inputs and are generally unknown. ANNs are presented as systems of interconnected neurons which send messages to each other. The connections have numeric weights that can be tuned based on experience, making neural nets adaptive to inputs and capable of learning. Analysis based on these four models is explained later. 1.4 DATA PREPROCESSING: The following nodes are used for data preprocessing in the four different models applied. 2981325171450 1, FILE READER: Two files are used for loading the test data file as well as training data file. 368617546101002, STRING REPLACER: In training data, end salary is 50K. But in testing data it is 50K.. To convert K. to K I used the string replacer. It is used for pattern/value replacement. 4114800-4762503, MISSING VALUE: This node handles missing values found in the cells of the input table by performing any of these operations. Do nothing, remove row, min max or mean, most frequent or fix value. In our dataset I removed all the existing rows that contained any missing values. 33807402600325 4, CATEGORY TO NUMBER: This node made new columns after converting every string to an integer. 352425055810155, COLUMN RENAME: Learner requires column have values in size double. Column rename changes the name or type of each entity. I changed the type by using column rename. Integers were converted into double. 3162300-2578106, NORMALIZER: This node normalizes all values of numeric columns using any of these three methods: Min-max normalization, Z-score normalization and normalization by decimal scaling. I have used min-max normalization to normalize our data. 311467527051007, COLUMN FILTER: It removes the columns that are not required. I excluded string columns as they were converted into double. 332422559150258, COLOR MANAGER: It helps in giving unique colors to each cluster. 2.1 EXPERIMENTAL RESULTS AND ANALYSIS 2.1.1 EXPERIMENTAL SETUP We used Knime to create and to get the results of each model. I then used Weka to find out the attributes that made little contribution towards predicting the class. * NAIVE BAYES The file reader reads the data and is followed by a missing values node. Using this node, I removed any information that had missing values from our data set. This was followed by the Normalizer node which was used to normalize all values into a range from 0-1. This whole model was then learned by the Learner node. The entire procedure was carried out on the test data as well except that a String Replacer node was used. This node was used as in the training set, Salary Ãâà had values with a à ¢Ã¢â ¬ÃÅ"K' while in the test set, it had values with a à ¢Ã¢â ¬ÃÅ"K.'. Therefore, to standardize this difference and to make sure there were no errors in the evaluation, this node was used. The Predictor node was then used which used the normalized data of the test set, and the learned training model as inputs. This was followed by a Scorer node which displayed the Overall Accuracy and the Confusion Matrix of the model. Fig 1.1 displays the accuracy statistics obtained by the Naive Bayes scorer. Fig 1.1 * DECISION TREE The Decision Tree model followed the same path as the NaÃÆ'à ¯ve Bayes model, except that it used the Decision Tree Learner and Predictor (which performed the same task as the NaÃÆ'à ¯ve Bayes Learner and Predictor). The String Replacer node was also used and Ãâà a scorer was used to display the Confusion Matrix and Overall Accuracy of the model. Another difference between the Decision Tree Model and the NaÃÆ'à ¯ve Bayes model is that I did not apply a Normalizer to our main model as the actual algorithm does not require a normalizer. However, I tested the model with a normalizer as well to compare the difference in our results with and without it. Fig 1.2 shows the accuracy statistics obtained. Fig 1.2 * K-MEANS The main difference between the K-Means model and the other three models was that I did not place the Normalizer, and Learner and Predictor nodes. Instead, I placed a K-Means node which outputs the cluster centers for a predefined number of clusters. The algorithm uses Euclidean distance on the attributes selected. This was followed by a Color Manager and the Scatter Plot node which creates a scatterplot of two selectable attributes. Each datapoint is displayed as a dot at its corresponding place, dependent on its values of the selected attributes. The dots are displayed in the color defined by the Color Manager. An Interactive Table node was also placed which displays the entire data in a table format. Fig 1.3displays the scatter plot and the interactive table obtained. Fig 1.3 Fig 1.4 displays the designs of the three models discussed so far. Fig 1.4 * ARTIFICIAL NEURAL NETWORK The model starts of the same as the previous two models discussed. However, after the Missing Values Node, a Category to Number Node was placed to convert all the Categorical data into Quantitative data, since this is one of the major requirements of the ANN model. Next, the data was passed through the Column Rename node which was used to convert Integer columns to Double type as Learner requires that columns have values in size double. Data was then passed through the Normalizer Node which was followed by MLP Learner and Predictor nodes, respectively. Data was then passed through the Column Filter which removed unrequired columns, for example, columns that were converted from String to Double. Finally, the data was passed through a Scorer and a Numeric Scorer. The Numeric Scorer calculated the values displayed in Fig 1.5 between the actual and predicted v...
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.