Cs246 stanford 2022. edu Jure Leskovec Stanford University jure@cs.

Cs246 stanford 2022 Homework 1. This course discusses data mining and machine learning algorithms for analyzing very large amounts of data. Companion course CS246H: There is a companion course CS246H, which is completely independent from CS246 and covers Hadoop programming. The OAE will evaluate the request, recommend accommodations Prior to his appointment at Stanford in 1979, he was a member of the technical staff of Bell Laboratories from 1966-1969, and on the faculty of Princeton University between 1969 and 1979. All deadlines are at 11:59pm PST. com/c/GF955CA72/ There will be 10 Colabs in total: Colab 0 (Spark tutorial), and Colab 1 to 9 (released weekly). The OAE will evaluate the request, recommend accommodations Modern AI trains very large models with a huge amount of data GPT-3 is trained with 500B tokens, more than 600 GB of Data The model has 175B parameters, requiring 350 GB CS246 is the first part in a two part sequence CS246--CS341. From 1990-1994, he was chair of the Stanford Computer Science Department. BIOE-PHD - Bioengineering (PhD) BMDS-MS - Biomedical Data Science (MS) BMDS-PHD - Biomedical Data Science (PhD) For external enquiries, personal matters, or in emergencies, you can email us at cs246-win1920-staff@lists. You have till 1/25 to solve it. Four of those rows hold 0 and the other two rows hold 1. Clustering . edu 21 Final exam: 30% Exact format will be announced later. Announcements: 1/09: The first class will be held at 9. e. , you can email us at cs246-win2122-staff@lists. edu Jure Leskovec Stanford University jure@cs. Course outline. edu Contacts: • Use Campuswire to post questions: https://campuswire. If you wish to view slides further in advance, refer to 2022 course offering's slides, which are mostly similar. Mining Massive Data Sets. stanford. Signature: 1. ) Stochastic Gradient Descent (SGD) is an example of a streaming algorithm In Machine Learning we call this: Online Learning Allows for modeling problems where we have A general many-to-many mapping (association) between two kinds of things But we are interested in connections among “items”, not “baskets” We will learn to solve real-world problems: Recommender systems Market Basket Analysis Spam detection Duplicate document detection We will learn various “tools”: Stanford students can see them here. Date: Monday, March 11 2:00 PM – Wednesday, March 13, 2:00 PM Pacific Time Logistics: Administered on Gradescope 3 hours long (timer starts once you open the exam) ๐’…(⋅)is a distance measure if it is a function from pairs of points x,y to real numbers such that: ๐‘‘ , ≥0 ๐‘‘( , )=0๐‘–๐‘“๐‘“ = ๐‘‘( , )=๐‘‘( , ) We will learn to mine different types of data: Data is high dimensional Data is a graph Data is infinite/never-ending Data is labeled Baskets = sentences; items = documents containing those sentences Items that appear together too often could represent plagiarism Notice items do not have to be “in” baskets Question 9 (6 points): Th is question concerns the minhash values for a column that contains six rows. Follow. g. Minhashing: Convert large sets into short signatures, while preserving similarity 3. data Locality sensitive hashing Clustering Dimensional ity reduction Graph data PageRank, SimRank Network Analysis Spam Detection Infinite data Studying CS 246 Mining Massive Datasets at Stanford University? CS145 Fall 2022 Practice Final with Solutions 8 documents. ¡Types of queries one wants to answer on a data stream: §Sampling data from a stream §Construct a random sample §Filtering a data stream §Select elements with property xfrom the stream 5/4/2021 Jure Leskovec, Stanford CS246: Mining Massive Datasets 15 ๐“= / = . 2016: 2013: [Final exam with solutions] 2011: [Final exam with solutions] Assignments 1/18/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets 27 Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets node2vec: Scalable Feature Learning for Networks Aditya Grover Stanford University adityag@cs. We look forward to seeing you there! 1/09: First Gradiance quiz has been posted. Data is a graph: Link Analysis: PageRank, TrustRank, Hubs & Authorities A general many-to-many mapping (association) between two kinds of things ut we ask about connections among “items”, not “baskets” Items and baskets are abstract: What are is ๐ฆ๐ฆ๐ฆ ๐’™๐’Š ∑ ๐’™๐’Š−๐’™๐’‹ ๐Ÿ ๐’Š,๐’‹∈๐‘ฌ really doing? Find sets A and B of about similar size. A late period ends at midnight, on the following Monday (this means that if the assignment is due on Thursday then the late period expires on the following Monday midnight, 11:59pm Pacific Time. Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246. L, D. Handouts Sample Final Exams. Contribute to wrwwctb/Stanford-CS246-2018-2019-winter development by creating an account on GitHub. CS246 will discuss methods and algorithms for mining massive data sets, while CS341 (Advanced Topics in Data Mining) will be a project-focused advanced class with an unlimited access to a large MapReduce cluster. Locality-sensitive hashing: Focus on pairs of Given a set of keys Sthat we want to filter ¡ Create a bit array Bof nbits, initially all 0s ¡ Choose a hash function hwith range [0,n) ¡ Hash each member of sÎSto one of nbuckets, and set that bit to 1, i. The availability of massive datasets is revolutionizing science and industry. The OAE will evaluate the request, recommend accommodations Date: Monday, March 11 2:00 PM – Wednesday, March 13, 2:00 PM Pacific Time Logistics: Administered on Gradescope 3 hours long (timer starts once you open the exam) ¡Problem:Find a maximum matching for a given bipartite graph §A perfect one if it exists ¡There is a polynomial-time offline algorithm based on augmenting paths (Hopcroft & Karp 1973, ¡Google’s goal: Maximize revenue ¡The old way: Pay by impression (CPM) §Best strategy: Go with the highest bidder §But this ignores the “effectiveness” of an ad ¡The new way:Pay per click! Practice Exam Submission ¡We opened a practice exam submission (3 questions from the 2019 exam) on gradescope. §file records àweights as size of the file, §IP addresses àweights as number of times the IP address High dim. Colab 0 is solved in real time in the first Recitation Session video. 5/4/2021 Jure Leskovec, Stanford CS246: Mining Massive Datasets 16 Course information: This course is the first part in a two part sequence CS246/CS341 replacing CS345A: Data Mining. The setting: Set of k choices (arms) Each choice a is associated with unknown probability distribution P a supported in [0,1] We play the game for T rounds In each round t: During office hours, add yourself to the CS246 Winter 2024 queue, and join the CA Zoom link when it's your turn to be helped. These questions require thought, but do not require long answers. 30am on Monday 1/9, in Gates B01. ๐’… needs to satisfy 4 rules: 3 A large set of items e. This course covers the architecture of modern data storage and processing systems, including relational databases, cluster computing frameworks, streaming systems and machine learning systems. Please be as concise as possible. The entrymijin rowiand columnjis 0, unless there is an arc from node (page)jto node i. i, v) that creates D, D. The OAE will evaluate the request, recommend accommodations 1 day ago ยท For external enquiries, personal matters, or in emergencies, you can email us at cs246-win2324-staff@lists. Also, please make sure to tag each part correctly on Gradescope so it is easier for us to grade. Academic For external enquiries, personal matters, or in emergencies, you can email us at cs246-win2425-staff@lists. Logistics Lectures: are on Tuesday/Thursday 1:30-3:00 PST on Zoom (first two weeks) & in person in the NVIDIA Auditorium. Student Health Insurance Billing & Waivers; 2023-2024 Non-Tuition Charges & Fees; 2022-2023 Non-Tuition Charges & Fees; Stanford Card Plan (SCP) SCP The availability of massive datasets is revolutionizing science and industry. Course Information Winter 2022 CS246: Mining Massive Data Sets Instructor: Jure Leskovec Co-Instructor: Mina Ghashami Lectures: 1:30PM - 3:00PM Tuesday and Thursday in NVidia, Huang Engineering Center Course website: https://cs246. 1. , images, movies, music Most important computer applications have to reliably manage and manipulate datasets. R: parent, left, right child datasets and maximizes: You may come to Stanford to take the exam, or… ¡ Date: § From Wed, Mar 18, 6 PM to Thu, Mar 19, 6 PM (PDT) § Agree with your exam monitor on the most convenient 3-hour slot in that window of time 1/25/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets 19 – : Finding the appropriate features is hard E. , things sold in a supermarket A large set of baskets Each basket is a small subset of items e. Each one of them is worth 3%. , B[h(s)]=1 During office hours, add yourself to the CS246 Spring'21 queue, and then the TA will send you a Zoom invite when it’s your turn to be helped. 2. data Locality sensitive hashing Clustering Dimensional ity reduction Graph data PageRank, SimRank Community Detection Spam Detection Infinite 1/25/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets 27 CS246. Logistics. 1. For external enquiries, personal matters, or in emergencies, you can email us at cs246-spr2021-staff@lists. com Office Hours: Tuesday 1:00-2:30pm, Friday 10:30am-12:00pm. All students (non-SCPD and SCPD) should submit their assignments electronically via Gradescope. The OAE will evaluate the request, recommend accommodations 4/6/2021 Jure Leskovec, Stanford CS246: Mining Massive Datasets 8 Given: High dimensional data points ๐’™ ,๐’™ ,… For example: Image is a long vector of pixel colors Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets node2vec: Scalable Feature Learning for Networks Aditya Grover Stanford University adityag@cs. edu. High dim. 1/29/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 35 Processing the “Memory -Load” of points (2): Adjust statistics of the clusters to account for Data is High-dimensional: Locality Sensitive Hashing . 1 Spark (25 pts) Write a Spark program that implements a simple “People You Might Know” social network friendship recommendation algorithm. Jan 4, 2022 ยท The first meeting of the class will be on Tuesday, January 4, 2022. CS246 will discuss methods and algorithms for mining massive data sets, while CS341: Project in Mining Massive Data Sets will be a project-focused advanced class with an unlimited access to a large MapReduce cluster. 1/17/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 3 1/17/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 4 [Hays and Efros, SIGGRAPH 2007] TAs: Bahman Bahmani Juthika Dabholkar Pierre Kreitmann Lu Li Aditya Ramesh Office hours: Jure: Tuesdays 9-10am, Gates 418 See course website for TA office hours Types of queries one wants on answer on a stream: (we’ll do these on Wed) Filtering a data stream Select elements with property x from the stream Counting distinct elements 1/16/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets 26 1 0 0 1 0 0 1 1 0 0 0 0 One of the two cols should have 1 at position y Mining Massive Data Sets, taught by Jure Leskovec. ๐“= /๐Ÿ— = . CS246: Mining Massive Data Sets Winter 2022. Please read the homework submission policies atcs246. The OAE will evaluate the request, recommend accommodations Given: High dimensional data points ๐’™ ,๐’™ ,… For example: Image is a long vector of pixel colors And some distance function ๐’…(๐’™ ,๐’™ ) which quantifies the “distance” between ๐’™ CS246 Final Exam, Winter 2019 • Your Name: • Your SUNetID (e. 1 Dead ends in PageRank computations (25 points) Let thematrix of the WebM be ann-by-nmatrix, wherenis the number of Web pages. Winter 2017. Import using the +Google Calendar button at the bottom to keep track of all the course events (Lectures, Homework, Office Hours, etc. Academic accommodations: If you need an academic accommodation based on a disability, you should initiate the request with the Office of Accessible Education (OAE). 3 Proof by cases Sometimes it’s hard to prove the whole theorem at once, so you split the proof into several Add a new tree ๐’‡ :๐’™ ; in each iteration Compute necessary statistics for our objective Greedily grow the tree that minimizes the objective: CS246: Mining Massive Data Sets Winter 2020. 2/17/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets 10 ¡ Standard dimensionality Reduction methods §Singular value decompositions (SVD) There are hidden, or latent factors, latent dimensionsthat –to a close approximation – explain why the values are as they appear in the data matrix 1/20/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets 4 ¡Stochastic Gradient Descent (SGD) is an example of a streaming algorithm ¡In Machine Learning we call this: Online Learning §Allows for modeling problems where we have 2/3/2022 Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets 33 Note that when a candidate pin is visited by walks from only a single query pin then the count is unchanged. Problem Set 3. edu ABSTRACT Prediction tasks over nodes and edges in networks require careful We will be releasing HW1 today It is due in 2 weeks (1/20 at 11:59 PM) The homework is long Requires proving theorems as well as coding Please start early We will also be releasing Colab 0 and Colab 1 Jure Leskovec & Mina Ghashami, Stanford University 2 Images Text/Speech Modern deep learning toolbox is designed for simple sequences & grids Stanford Office of Community Standards has more information. 2024-2025 Non-Tuition Charges & Fees. Set x A > 0 , x B < 0 and then value of ๐€๐Ÿ is 2(#edges A —B) ¡The earliest and the most popular collaborative filtering method ¡ Derive unknown ratings from those of “similar” movies (item-item variant) ¡ Define similarity measures 12 Clustering in two dimensions looks easy Clustering small amounts of data looks easy And in most cases, looks are not deceiving Many applications involve not 2, but 10 or ¡Shelf space is a scarce commodity for traditional retailers §Also: TV networks, movie theaters,… ¡ Web enables near -zero-cost dissemination of information about products Given a set of keys S that we want filter Create a bit array B of n bits, initially all 0s Choose a hash function h with range [0,n) Hash each member of s∈ S to one of m How to split? Pick attribute & value that optimizes some criterion Regression: Find split (X. Please write all answers in the space provided. , 01234567): I acknowledge and accept the Stanford Honor Code. ) For external enquiries, personal matters, or in emergencies, you can email us at cs246-spr2223-staff@lists. 2 challenges of web search: (1) Web contains many sources of information Who to “trust”? Trick: Trustworthy pages may point to each other! (2) What is the “best” answer to query Reference Answers for Stanford's Winter 2022 CS246 Homework and Colab. It meets Wednesdays 11:30AM - 1:20PM, in 1) Message computation Message function: Intuition: Each node will create a message, which will be sent to other nodes later Example: A Linear layer ๐ฆ (๐‘™)= ๐‘™๐ก The Unreasonable Effectiveness of Data In 2017, Google revisited the same type of experiment with the latest Deep Learning models in computer vision Given: High dimensional data points ๐’™ ,๐’™ ,… For example: Image is a long vector of pixel colors And some distance function ๐’…(๐’™ ,๐’™ ): To be a distance func. , pirroh): • Your SUID (e. SUNet ID (i. Since the data is too large to upload, ¡An Intuitive way to define “importance” of an item: §the weight associated to the item, e. , the things one customer buys on one day 4/13/2021 Jure Leskovec, Stanford CS246: Mining Massive Datasets 29 Basic idea: Pick a small sample of points ๐‘†, cluster them by any algorithm, and use the CS 246 { Review of Proof Techniques and Probability 01/17/20 1. There will be 4 homework assignments in total, which should be submitted on Gradescope as a PDF. The OAE will evaluate the request, recommend accommodations 2022-2023 Rates. edu ABSTRACT Prediction tasks over nodes and edges in networks require careful Consider a case: M greedy≠ M opt Consider the set G of girls matched in M opt but not in M greedy (1) By definition of G: |M opt | |M greedy | + |G| (2) Define set B of boys linked to girls in G. A k-shingle (or k-gram) is a sequence of k tokens that appears in the document Example: k=2; D 1 = abcab Set of 2-shingles: C 1 = S(D 1) = {ab, bc, ca} Represent a doc by a set of hash values of its Supplement to CS 246 providing additional material on Hadoop. @stanford prefix): I have read and will abide by the Stanford Honor Code: Signature: For this question, it may be helpful (but not necessary) to use the CASE WHEN syntax below: Syntax: CASE WHEN condition THEN true result ELSE false result END Example: SELECT (CASE WHEN temperature > 60 THEN "warm" ELSE "cold" END) FROM temps; 2. Prepare for your exam View CS246 is a completion requirement for: . Lecture slides will be posted here shortly before each lecture. ) CS246: Mining Massive Data Sets Winter 2020 Problem Set 1 Please read the homework submission policies at 1 Spark (25 pts) Write a Spark program that implements a simple You Might social network friendship recommendation algorithm. Students will learn how to implement data mining algorithms using Hadoop, how to implement and debug complex MapReduce jobs in Hadoop, and how to use some of the tools in the Hadoop ecosystem for data mining and machine learning. Jan 4, 2022 ยท The first meeting of the class will be on Tuesday, January 4, 2022. Students can typeset or scan their homework (although we strongly recommend you typeset them). Late assignments: Each student will have a total of two late periods to use for homeworks. This schedule is subject to change. 2022-2023 Graduate and Professional Tuition Rates; 2022-2023 Undergraduate Tuition Rates; Past Tuition Rates; Non-Tuition Charges & Fees. Tentative list of topics to be covered. 1/27/22 Jure Leskovec& Mina Ghashami, Stanford CS246: Mining Massive Datasets 8 ¡ Item-Item collaborative filtering method: §Derive unknown ratings from “ similar ” movies A general many-to-many mapping (association) between two kinds of things But we are interested in connections among “items”, not “baskets” There are hidden, or latent factors, latent dimensions that – to a close approximation – explain why the values are as they appear in the data matrix 1/20/22 Jure Leskovec, Stanford CS246: Mining Massive Datasets 4 Main idea: Items have profiles: Video -> [genre, director, actors, plot, release year] News -> [set of keywords] Recommend items to customer x similar to previous Jan 13, 2022 ยท For external enquiries, personal matters, or in emergencies, you can email us at cs246-win2324-staff@lists. Simple Heuristic: Greedy Algorithm: Start with ={} For = … Find set that ๐ฆ๐š๐ฑ๐‘ญ( − ∪{ }) Let = − { } ¡Types of queries one wants to answer on a data stream: (we’ll do these on Thu) §Filtering a data stream §Select elements with property xfrom the stream §Counting distinct elements Jure Leskovec & Mina Ghashami, Stanford University 2 Images Text/Speech Modern deep learning toolbox is designed for simple sequences & grids First do integer encoding, then create a binary vector that represents the numerical values Ex: following integer encoding on provider: Netflix -> 1, Prime Video -> 2, HBO Max ->3 , Hulu -> 4 For each item, create an item profile Profile is a set (vector) of features Movies: author, title, actor, director,… Text: Set of “important” words in document Machine Learning Node classification 5/6/2021 Jure Leskovec, Stanford C246: Mining Massive Datasets 3 For external enquiries, personal matters, or in emergencies, you can email us at cs246-win1920-staff@lists. ¡We highly recommend students to try one submission to this practice exam. Shingling: Convert documents to large sets of items 2. Dimensionality reduction . Instructor: Jeff Ullman Office: 425 Gates Email: lastname @ gmail. rlgd gpnrnlga skx sogfphy kqeijsyb kkdw xwnbbxl kyrjga imbtf txwfj