Book Notes: Data Mining For Dummies

  2017-11-23


Introduction

written for
  business users
  heard a little about dm

ch01: Catching the Data-Mining Train

Getting Real about DM
  Not your professor's statistics
  value of dm
    ex of information
      retailer loyalty program
        which customers are likely to spend 
          a lot
          a little
        based on information gathered from first visit
      manufacturer: accidental release of toxic materials
        prevent dangerous accidents
      insurance: 
        an office processes certain claim types more quickly
        right place for best practices
      advertising
        which ad works better?
          one that has female face, or male face
          many images vs. a few
          same copy but different layouts
Doing what DM do
  focusing on business
  understanding their time
    most time: data preparation
  process
    crisp-dm
      business understanding
      data understanding
      data preparation
      modeling
      evaluation
      deployment
  making models
    report 
      can show
        sales down by region, channel
        declines are widespread?
      cannot
        why sales declined
        what actions to take
    model
      factors that impact sales
      actions that increase sales
  understanding mathematical models
  putting information into action
Discovering Tools and Methods
  visual programming
  working quick
  testing

ch02: A Day in Your Life As data Miner

Starting your day off right
Understanding your business
understanding your data
  project
    predict land ownership
  explore data and document it
    ranges of variables
    missing values
    not all variables are documented
    ex
      variable name | type | missing | range/summary | min | max | median
      taxkey | integer | 30 | histogram |
    describe variable summaries
      ex
        variable | description
        bi_viol 
          description: unknown
          type: string
          range: x to xxx
          missing: 0
          assessment: not good for modeling. all cases have same value. reason unknown
          next: won't use in this project
        taxkey
          assessment: small number of cases missing. some have less than 10 digits. probably due to leading zeros
          next: clean this variable
        ca_class
          assess: good
Deriving new variables
  flow
    read csv -> select atributes -> filter examples -> cut -> generate attributes -> select attributes -> sample -> write csv
  filtering out cases
    condition: no missing
  cutting variables down to five characters
    attribute filter type: subset
    attributes: ...
    first character: 1
    last character index: 5
  functions for generating new variables
    attribute name: ntlocal
    function expressions:
      if ( geo_zip_code == owner_zip, 0, 1 )
  discard variables no longer needed
  balancing the data
    sample size per class
      class | size
      yes | 4000
      no | 4000
Modeling Your Data

ch03 Teaming Up To Reach Your Goals

Nothing could be finer than to be a data miner
  You can be a data miner
    ex: 2
      public safety: NY Fire Department
        identify factors that put buildings at risk for fire
        output: risk score for 300K buildings
        use: inspect building with risk
      retail
        amazon.com
        individualize product recommendations
        test functional and cosmetic aspects of website
      medical and survey research
        smoking
        identify messages that effectively discourage youth from smoking
  Using the knowledge you have
Data Miners Play Nicely with Others
  Cooperation is a necessity
  Oh, the people you'll meet
Working with Executives
  Greetings and elicitations
  Lining up your priorities
  Talking data mining with executives

ch04 Learning the Laws of Data Mining

1. law: business goals
  business objectives are the origin of every data mining solution
2. law: business knowledge
  business knowledge is central to every step
3. law: data preparation
  data preparation is more than half of every dm process
4. law: right model
  right model for a given application can only be discovered by experiment
  NFL-DM: no free lunch for data miner  what is a model
    mathematical relation
      represents a pattern
        observed in data
  Tom Khabaza
5. law: pattern
  there are always patterns
  successful exploration begins with a goal
    cook, peary -> north pole
6. law: amplification
  dm amplifies perception
    enables you to understand your business better than without it
    like a magnifier or microscope
7. law: prediction
  prediction increases information locally by generalization
  ex
    customer enters store
    how much will customer spend?
    you don't know him
      best estimate: average amount
    he heads for electronics
      estimate: higher
8. law: value
  value of dm results is not determined by accuracy or stability of predictive models
  dm
    don't fuss over theory
    uses testing rather than statistical theory to justify
  stats
    fuss over theory
    accuracy and stability important
9. law: change
  all patterns are subject to change

ch05: Embracing the Data-Mining Process

Whose standard is it, anyway?
  crisp-dm standard
    iterative
    with smaller cycles
  documenting your work
Business Understanding 
  identifying your business goals
    problem that management wants to address
    business goals
    constraints (limitations)
    impact (how problem and possible solutions fit in with the business)
  deliverables
    background
      2-3 paragraphs
      ex
        our client, regional planning commission, wants to influence property use
          only advisory
          no independent power
        best opportunity: when property changes hands
          best time: before property is about to change ownership
        factors believed to indicate change of ownership:
          nonlocal ownership, code violations, foreclosure...
    business goals
      broader goal than dm project
      ex
        increase sales from ad campaign by 10%
    business success criteria
      how the results will be measured
      get quantitative criteria
  assessing your situation
    go fact-finding
    explain issues 
    deliverables
      inventory of resources
        people, data, software
      requirements, assumptions, and constraints
        schedule for completion
        legal and security obligations
        requirements for acceptable finished work
      risks and contingencies
        risk for delay
          contingency plan for them
      terminology
        business terms, dm terms
      costs and benefits
  defining your data-mining goals   
    deliverables
      data-mining goals
        models, reports, presentations, datasets
      data-mining success criteria
        define in quantitative terms
          model accuracy or predictive improvement
        if qualitative:
          who will make the assessment
  producing your project plan
    deliverables
      project plan
        step by step action plan
        schedule
        for each step
          required resources, 
          inputs (data), 
          outputs (model, data, report)
          dependencies
      initial assessment of tools and techniques
Data Understanding
  gathering data
    deliverables
      initial data collection report
        verify that
          you acquired data or
          gained access to data
            tested data access process
            verified data exists
        work needed
          outline data requirements
            types of data
            with details
              time range, data formats
          verify data availability
            if some data unavailable
              how to substitute it
              narrowing the scope
              gathering new data
          define selection criteria
            data sources you will use
            which tables, fields etc.
        you must actually obtain data
          import it to dm tool
          make trials
          possible issues
            limits on cases/memory
            inability to read data formats
            imperfections of data
  describing data
    deliverables
      data description report
        sourtce and formats of data
        number of cases
        number and descriptions of fields
        suitabilityof data for dm goals
  exploring data
    examine data more closely
    data exploration
      range of values, distributions
    deliverables
      data exploration report
        distributions, summaries, data quality problems
  verifying data quality
    deliverables
      data quality report
        data you have
        minor, major quality issues
        possible remedies
          ex: alternative data resource
Data Preparation
  5 tasks
    selecting data
      deliverable
        rationale inclusion and exclusion
          what data will be used or not
            reasons based on
              relevance
              data quality
              technical issues
              suitability of data formats
    cleaning data
      deliverables
        data cleaning report
          in excruciating detail
          every decision and action to clean data
    constructing data
      deliverables
        derived attributes
          new fields constructed
            how, why
        generated records
          new cases (rows) 
            how, why
    integrating data
      deliverables 
        merged data
          how performed
    formatting data
      deliverables
        reformatted data
          how performed
Modeling
  most liked
  4 tasks
    selecting modeling techniques
      deliverables
        modeling technique
          specify the technique
        modeling assumptions
    designing tests
      how well model works
      avoid overfitting
      holdout data
        not used during model-training process
      deliverables
        test design
          not elaborate
    building models
      deliverables
        parameter settings
        model descriptions
          describe model
            type of model (linear, neural)
            variables used
          how it is interpreted
          difficulties encountered
        models
    assessing models
      deliverables
        model assessment
        revised parameter settings
Evaluation
  3 tasks
    evaluating results
      deliverables
        assessment of results
          did you reach the business goals?
        approved models
    reviewing the process
      spot issues overlooked
      how to improve process?
      deliverables
        review of process report
          outline review process
          findings and concerns for immediate attention
            steps overlooked or should be revisited
    determining the next steps
      recommendations for next move
      deliverables
        list of possible actions
        decision
Deployment
  4 tasks
    planning deployment
      deliverables
        deployment plan
          steps required
            instructions
    planning monitoring and maintenance
      deliverables
        monitoring and maintenance plan
    reporting final results
      deliverables
        final report
          summary: entire project 
            assemble all reports
            add overview
        final presentation
    reviewing project
      deliverables
        experience documentation report

ch06: Planning for Data Mining Success

Setting the Course with Formal Business Cases
  business case
    to justify costs, prepare business case
    what is
      outlines business problem
      proposed plan to address it
      benefits and costs
    helps you too
      clarifies thinking
  Satisfying the boss
  Minimizing your own risk
Building Business Cases
  Elements of the business case
    elements
      background
        what organizations is involved?
        what is its business?
      problem statement
        what is wrong?
        when did it start?
        whom does it affect?
        is the cause known?
        is this a common or unusual type of problem?
      action alternatives
        what solutions are suggested?
        benefits and costs?
      preferred action
        best?
        make your case
      connection of preferred action with strategic goals
    benefits
      expected benefits
      mechanism
        how will action cause benefits?
      metrics?
        how to measure benefit
    costs
      costs
      cost of taking no action
  Putting it in writing
    executive summary
      for all cases of 3+ pages
  The basics on benefits
Avoiding the Failure Option

ch07: Gearing Up with the Right Software

Putting DM Tools in Perspective
Evaluating Software

ch08: Digging into Your Data

Focusing on a Problem
Managing Scope
Using Your Organization's Own Data
  data collected from common business activities
    research
      competitor product information
      experimental and test data
    manufacturing
      process data
      procurement records
      production records
      inspection and test records
    marketing
      competitor marketing information and sales data
      campaign data
      marketing cost data
    sales
      sales activity
      sales data
      customer information
    fulfillment
      packaging records
      shipping records
      shipping complaints
    customer service
      customer interaction records
      product and service complaints
      service issues
    technical support
      support requests
      product problem reports
      design and other product suggestions
    training
      staff training records
      customer training records
      certification and other credentialing records
    accounting 
      bills
      payments 
      audit records
      taxes collected and paid

ch09: Making New Data

Loyalty Programs
  loyalty program
    agreement between business and customers
    customers allow business to track purchases
    business offers rewards
  your data bonanza
    data elements in retail sector
      customer location
      products purchased
      combinations of products purchased together
      prices paid
      list of everyday prices 
      coupon or other discount offer used
      time
      detailed product descriptions
      pages/products viewed
      time on site
      timing of site visits
      product reviews and information sharing
      referrals
      offers or ads customer viewed
      social network details - people customer knows
    what is important to a particular decision maker?
      how to figure it out?
      learn executive's responsibilities
      find out metrics most important to his survival
      mine data for clues what actions could increase sales
        ex: loyalty programs
          characteristics of customers 
            who buy large
            increase the amount
          growing customer segments
          combinations of products bought together
          promotions that work better than others
          marketing channels more cost-effective
          shopper behavior patterns (instore and online) that affect sales
          unexpected factors that influence sales
    warehouse clubs
      people pay to be a member
Testing
  experimenting in direct marketing
    most common application for experiments
    names:
      A/B tests
      split tests
    ex
      retailer sends emails to customers who haven't purchased in 24 hours
      changes to email message improve response?
  direct marketing
    everything is direct marketing if it is an action per person
Microtargeting to win elections
  microtargeting
    organized survey research, testing
      to deliver personalized campaign messaging
  Treating voters as individuals
  Enhancing voter data
    new information
      demographics
      occupation
      memberships
      home, auto ownership
      permits
      magazine subscriptions
  Developing your own test data
Surveying the Public Landscape
Getting into the Field
One Challenge, Many Approaches

ch10: Ferreting Out Public Data Sources

Exploring Public Data Sources
  www.fedstats.gov
    find agency by name or subject
  www.data.gov
    not a data source
    information about what data is available
  most popular data sources
    climate data online
    consumer complaint database
    noaa national weather service
    federal student loan program data
    state education data profiles
    social media monitoring metrics
    food access
    trade in goods and services
    campus crime data
    dropout and completion of schools
  Bureau of Economic Analysis
    www.bea.gov
    part of Department of Commerce
      12 agencies
    data sources
      balance of payments
      foreign direct investment
      gdp
      industry data
  Bureau of Justice Statistics
    www.bjs.gov
      data sources
        crime and victims
        drugs
        criminal offenders
        courts
        corrections
        employment 
  Bureau of Labor Statistics
    www.bls.gov
    data sources
      compensation
      consumer expenditures
      workers
      cost trends
      foreign labor
      unemployment
      productivity
      wages
  Bureau of Transportation Statistics
    www.rita.dot.gov/bts
    data sources
      airlines
      commodity flow
      freight
      travel
  Census Bureau
    www.census.gov
    part of Commerce
    lives of Americans
    data sources
      business ownershipg
      international trade
  Economic Research Service
    www.ers.usda.gov
    data
      biotech
      agribusiness
      crops
      trade
  Energy Information Administration
    www.eia.gov
    D of Energy
  Environmental Protection Agency
    www.epa.gov
  Ofice of Research, Analysis and Statistics
    www.irs.gov/uac
    part of Internal Revenue Service IRS
      tax collection agency
    data
      tax
  National Agricultural Statistics Service
    NASS
    www.nass.usda.gov
  National Center for Education Statistics
    nces.ed.gov
  National Center for Health Statistics
  National Science Foundation
  Office of Management and Budget
  Office of Retirement and Disability Policy
Governments of world
  Offstats
    stats of world govs
  OECD
  US open gov portal
  UN
  EU
US State and Local govs
  freedom of information act
  pew charitable trusts
  us counties
    www.data.gov/counties
  us cities
    www.data.gov/cities

ch11: Buying Data

Peeking at Consumer Data
  Axciom
    aboutthedata.com
    individuals
    households
Beyond Consumer Data
Desperately Seeking Sources
  professional associations for making contacts
    marketing association
    list of data vendors: appendix c

ch12: Getting Familiar with Your Data

Importing Data
  in Knime
  in Weka
  procedure
    CSV reader
stats
  nodes > statistics

ch13: Dealing in Graphic Detail

Eyaballing variables with histograms
  RapidMiner
    data summary tool
    charts > bar chart
  relating variables with scatterplots
    mpg vs. horsepower
  interacting with scatterplots
    select area
    sampling randomly
    dataset created by selection in graph
    selection shapes:
      rectangle, free, polyline

ch14: Showing Your Data Who’s Boss

Rearranging Data
  variable order
sorting

ch15: Your Exciting Career in Modeling

ch16: Data Mining using Classic Statistical Methods

ch17: Mining Data for Clues

market basket analysis
  understanding the metrics
    diagnostics
    ranking
    lift

ch18: Expanding Your Horizons

Using meta models
  ensemble model
  using 2+ modeling techniques together
Widening your range
  tackling text
    text mining
    uses
      sentiment analysis
        ex
          paypal: will the customer close his account?
      classification
      entity extraction
        such as names or places
  detecting sequences
    ex
      shopper's sequence of actions in market
        dramatic effect on sales
        ways of enticing customers pick up something
      financial modeling
      intrusion detection
      genetics research
  working with time series
    uses
      sales, economic forecasting
      signal analysis
      astronomy
      epidmiology

ch19: Ten Great Resources for Data Miners

society of data miners
  www.socdm.org
kdnuggets
  kdnuggets.com
  news site
all analytics
  allanalytics.com
nytimes
forbes
  authors
    gil press
    piyanka jain
    naomi robbins
    lisa arthur
smartdata collective
  curated content
crisp-dm
nate silver
  fivethirthyeight.com
meta's analytics articles
bit.ly/metaarticles
gallery of statistics jokes

ch20: Ten Useful Kinds of Analysis That Complement Data Mining

business analysis
conjoint analysis
  think of product manager
    to attract customers
    what are features most appealing?
  role of conjoint analysis
    getting info about consumer preferences
design of experiments
marketing mix modeling
  which combination of media provides best value for your needs
  how to allocate spending
operations research
reliability analysis
  psychometrics
    consistency in measurement
  engineering
statistical process control
social network analysis
structural equation modeling
  what factors cause consumers to be satisfied
  how to influnece them to improve satisfaction
web analytics