COSC 4570/5010 - (Big) Data Mining

University of Wyoming - Department of Computer Science

Spring 2017

Course Information

Location: Enzi STEM 315
Meeting Time: M/W/F 8:00am - 8:50am
Office Hours: M 9:00-10:30 & T 9:30-11:00 By Appointment (bit.ly/dmb-ohs)

Instructor: Dr. Mike Borowczak
Office: Engineering 4071B
Email: mborowcz@uwyo.edu
Website: cs.uwyo.edu/~mborowcz/cosc-4570

Course Objectives and Learning Outcomes

This course explores data mining theories, tools and real-world applications. This course will consist of traditional lectures, flipped classroom activities, research surveys, mini-projects (homework), a final culminating project and exam.

Learning Objective:

Students will be able to:

Course Topics:

Required Texts and Materials

This data mining course uses the 2nd edition of Mining of Massive Datasets (MMDS) by J. Leskovec, A. Rajaraman and J. Ullman, which is available for free at http://www.mmds.org/. Print versions are also available. In addition to MMDS, the following texts are recommended supplemental resources: Computer Science Theory for the Information Age (CSFTIA) by J. Hopcroft and R. Kannan and The Elements of Statistical Learning (TESL) by T. Hastie, R. Tibshirani and J. Friedman. You are be expected to complete the assigned reading prior to class.

Data Science is a constantly evolving field, we’ll also use current and seminal papers, forum posts, documents and other work to ground our discussion - you’ll be expected to complete the assigned reading prior to class, otherwise our discussions will be rather one-sided. If I make a mistake, or if you have a question - ask - let’s get on the same page. I won’t have all the answers to all of your questions - in those scenarios - you can either 1) wait for me to find the answer or 2) find the answer and build up our community of knowledge. We’ll use Piazza for collaboration and discussions on class topics, homework, and projects. Our Piazza course site is: https://piazza.com/uwyo/spring2017/cosc4570_5010.

COSC 4570 homework requires the use of a computer, preferably your own, with a virtual machine player e.g VM Ware player (Windows/Mac) or KVM (Linux). The CS computer labs should have the needed virtual machine software, but it may be impractical to download/save VM images to those accounts - consider investing in a larger USB external drive to store your VM images. There is a chance, depending on funding, that we may use an external service (Qubole) to spin-up and host data science machines using Amazon Web-Services - if so, you’ll use your @UWYO email account for access.

Assessments

Participation (100 points)
Your attendance and participation in class are be measured through informal assessments; -3pt per absence after the 3rd ”unexcused” absence.
Gradiance Exam Practice (100 points - 10 @ 10 pts each)
This course uses Gradiance to automatically score and provide feedback on common questions related to the content of this course. You have unlimited attempts to solve each of the ten problem sets. While these problems sets are avaiable throughout the semester, it’s recommended that you complete the Gradiance problem sets prior to your programming homework assignments. Sign up online (www.newgradiance.com) using your @UWYO email and the following class token 81BD58 .
Programming Homework (300 points - 6 @ 50 pts each)
Each of the programming homework assignments covers one of the major topics of the course - a portion of most homework assignment require some amount of programming - generally in the language of your choice (except when the use of a particular language is beneficial).
Final Project & Presentation (400 points)
A self-selected, researched and implemented project and presentation leveraging data from real world partners (Safeway-Albertsons in 2017).
Final Exam (100 points)
Open note final, situated in a set of real-world data mining problems, Friday May 12th 8:00am – 10:00 am.

Grade Policies

Your grade is computed as a direct unweighted sum of the all the in-class participation, homework, mini-projects, final project, final presentation, and exam scores. The following point boundaries are used to determine final grades.


PointsLetter Grade
>899 A
800-899 B
700-799 C
600-699 D
<600 F

If necessary, all or any results can be curved. The curve can only ever be upwards (i.e., only ever in your favor). Average numerical grades are rounded to the nearest whole number (that is, 799.5 becomes 800 and a B, 799.4 becomes 799 and a B). I may relax these grade boundaries but only ever in you favor (i.e., it might be possible that the A grade boundary ends up being 880 instead of 900...).

A summary of your grades will be posted on UW’s WyoCourses site. Please review your scores and report any discrepancies to me.

Late Work

Late work is only accepted for credit 24 hours after the assignment due date . The student receive a maximum of 75% of the earned points for late work submitted within 24 hours of the due date. E.g. if an assignment is worth 25 points, is submitted 22 hours after the due date, and would have received 20 points if submitted on time, the late-score would be computed as 20
25 ×3
4 = 15
25

Late work that is submitted after the due date and prior to exam will remain ungraded until the end of the semester. At the end of the semester - the late work will only be graded, at the sole discretion of the instructor, if it affects the pass/failure of the course. The maximum course grade you can receive in this scenario is a C.

Miscellanea: Extra Credit, and Expectations

No separate extra credit assignments will not be offered or made available. Rather, individual homework assignments may contain an opportunity to gain extra credit.

Attendance/Participation Policies

It is expected that you attend class regularly, and your grade will be affected positively if you are present in class. As an active and engaged learner, you are expected to attend and arrive punctually to our scheduled classes. engagement throughout the class is critical to your ultimate learning. Your participation and attendance will contribute to 10% of your overall score.

  1. University-sponsored absences are cleared through the Office of Student Life;
  2. Student Health or your private physician may issue a statement giving the dates of students confinement whether in the home or hospital due to illness;
  3. Roads & Weather: if you regularly travel from outside of Laramie, please let me know now. If the University remains open, and the road conditions prevent you from attending physically, we can set up some web-based call given sufficient notice;
  4. If you have a conflict (expected or not), please let me know as soon as possible;
  5. After the third ”unexcused” absence, -3pt / class.

Academic Honesty

The University of Wyoming is built upon a strong foundation of integrity, respect and trust. All members of the university community have a responsibility to be honest and the right to expect honesty from others. Any form of academic dishonesty is unacceptable to our community and will not be tolerated. Teachers and students should report suspected violations of standards of academic honesty to the instructor, department head, or dean.

Any and all suspicions of academic dishonesty shall be investigated in accordance with UW Regulation 6-802 (www.uwyo.edu/generalcounsel/_files/docs/uw-reg-6-802.pdf). Evidence of academic dishonesty will result in one or more of the recommended sanction, in accordance with UW Regulation 6-802 6.A.

Academic Civility

”There are several misconceptions about intellectual diversity and academic freedom... ...the narrower concept of academic freedom does not mean the freedom to say anything that one wants. For example, freedom of speech does not mean that one can say something that causes physical danger to others. In a learning context, one must both respect those who disagree with one and also maintain an atmosphere of civility. Anything less creates a hostile environment that limits intellectual diversity and, therefore, the quality of learning.”
   Association of American Colleges and Universities
   Board of Directors Statement on Academic Freedom and Responsibility 12/21/05

Disability Support Services

If you have a physical, learning, sensory or psychological disability and require accommodations, please let me know as soon as possible. You will need to register with, and possibly provide documentation of your disability to University Disability Support Services (UDSS) in SEO, room 109 Knight Hall. You may also contact UDSS at (307) 766-6189 or udss@uwyo.edu. Visit their website for more information: www.uwyo.edu/udss.

Expectations

Student’s Role & Expectations

You are expected to treat all members of the class and your instructor with respect. Plan to attend class, take an active part in discussion or teamwork, and complete all readings and assignments by the deadlines listed in the syllabus.

Professor’s Role & Expectations

I will follow a professional code of behavior and responsibility. I will treat all members of the class with respect. I will attend class and take an active part in your learning. In each class I will ask: 1) What do I want you - my students - to learn? 2) How will you learn it? 3) What do I want you to do with the information? and 4) How will I assess your learning?

Syllabus Change Policy

This syllabus is only a guide for the course and is subject to change with advanced notice.1

Course Schedule

39 scheduled meetings, two weeks with no scheduled meetings in March, one for project work, another for spring break. The course breaks down into roughly six 2-3 week overarching topics including: an overview of data mining and statistics, similarity, clustering, dimension reduction, recommenders, links and graphs, streams and large scale machine learning. There will be 5-6 programming homework assignments, an overarching project with checkpoints throughout the semester, and auto-graded content knowledge assessment.







Monday

Tuesday

Wednesday

Thursday

Friday

Saturday







Jan 23rd 1

Intro
MMDS 1.1

24th

25th 2

Stats
MMDS 1.2+1.3.1

26th

27th 3

Uplevel
MMDS 1.3.2-1.3.6

28th







30th 4

MapReduce
MMDS 2-2.3

31st

Feb 1st 5

MapReduce
MMDS 2.4-2.5

2nd

3rd 6

Similarity: Jaccard
MMDS 3-3.2

4th







6th 7

Similarity: MinHash
MMDS 3.3

7th

8th 8

Similarity: LSH
MMDS 3.4

9th

10th 9

Similarity: Distances
MMDS 3.5

11th

HW1
Stats Due







13th 10

Similarity: SIFT, ANN vs LHS
MMDS 3.7

14th

15th 11

Clustering: Overview
MMDS 7.1

16th

17th 12

Clustering: Hierarchical
MMDS 7.2

18th







20th 13

Clustering: K-Means
MMDS 7.3

21st

Project Proposal Due

22nd 14

Clustering: CURE
MMDS 7.4

23rd

24th 15

DimRedux:
MMDS 11.1

25th

HW2
Hashing Due







27th 16

DimRedux: PCA
MMDS 11.2

28th

Mar 1st 17

DimRedux: SVD
MMDS 11.3

2nd

3rd 18

DimRedux: CUR
MMDS 11.4

4th







6th

Project Work Week

7th

Data Collection Due

8th

Project Work Week

9th

10th

Project Work Week

11th

HW3
Clustering Due







13th

Spring Break

14th

15th

Spring Break

16th

17th

Spring Break

18th







20th 19

Recommender:
Content-based
MMDS 9.1-9.2

21st

22nd 20

Recommender:
Filtering
MMDS 9.3

23rd

24th 21

Recommender:
Redux Revisted
MMDS 9.4

25th

HW4
Redux Due







27th 22

Recommender:
Netflix
MMDS 9.5

28th

Intermediate Report Due

29th 23

Links:
Page Rank
MMDS 5.1

30th

31st 24

Links:
Page Rank
MMDS 5.2-5.3

Apr 1st







3rd 25

Links:
Link Spam/Hubs
MMDS 5.4-5.5

4th

5th 26

Massive Graphs:
Social Net
MMDS 10-10.2

6th

7th 27

Massive Graphs:
Communities
MMDS 10.3-10.5

8th

HW5
Recommend Due







10th 28

Massive Graphs:
Simrank
MMDS 10.6

11th

12th 29

Massive Graphs:
Triangles
MMDS 10.7

13th

14th 30

Massive Graphs:
Neigborhood
MMDS 10.8

15th







17th 31

Streams:
Model
MMDS 4-4.1

18th

19th 32

Streams:
Sample + Filter
MMDS 4.2-4.3

20th

21st 33

Streams:
Count + Est
4.4-4.5

22nd

HW6
Graphs Due







24th 34

Streams:
Windows
MMDS 4.6-4.7

25th

Report Due

26th 35

Lg. Scale ML
Model
MMDS 12-12.1

27th

28th 36

Lg. Scale ML
Perceptrons
MMDS 12.2

29th

Poster Outline Due







May 1st 37

Lg. Scale ML
SVM
MMDS 12.3

2nd

3rd 38

Lg. Scale ML
NN+
MMDS 12.4-12.5

4th

5th 39

Poster & Presentation Due

6th







8th 40

9th

10th 41

11th

12th 42

Final Exam
@8AM

13th







Homework

Each assignment will include a specific grading rubric. Generally, you will be expected to turn in:

The preference for code submissions is a link to a public git/cvs/svn repository. Alternately, provide a zip file containing all code and dependencies (with a MAKEFILE if needed). Homework is due no later then 2PM (Mountain) on the given due date (generally Saturday).

HW1 - Statistics

This assignment will be available no later than January 27th and will be due on February 11th.

HW2 - Document Hashing

This assignment will be available no later than February 10th and will be due on February 25th.

HW3 - Clustering

This assignment will be available no later than February 24th and will be due on March 11th.

HW4 - Dimension Reduction

This assignment will be available no later than March 3rd and will be due on March 25th.

HW5 - Recommender System

This assignment will be available no later than March 21st and will be due on April 8th.

HW6 - Graphs

This assignment will be available no later than April 7th and will be due on April 22nd.

Research Project

Objective: Data mine real world data (sets), to provide something new to the community.

Overview

This course will provide you with an overview of Data Mining fundamentals, but in order to truly understand the nuances and complexity of Data Mining, you have to work on real data, solving real problems. This project enables you to have a real-world experience that you bring to an interview, your own research, or some personal project. As with any real-world endeavor, you must be able to effectively communicate your work to your peers (experts and non-experts alike).

You will work in teams of 2 (teams of n=1 are highly discouraged barring special permission/requirements that should be brought to my attention as soon as possible).

Deadlines

All project components, except for the poster presentation, are due no later then 11:59PM (Mountain) on the given due date (generally a Tuesday. except for the poster components). The poster presentation will be held during our final day of class. In the event of a weather calamity day, the exam period will be split to accommodate the poster presentations. Project guidelines and scoring rubric will be provided no later than February 4th.