Big Data Analytics
Timeline
-
January 25, 2024Experience start
-
February 23, 2024Project summary
-
April 4, 2024Project data model
-
April 11, 2024Project presentation
-
April 11, 2024Experience end
Timeline
-
January 25, 2024Experience start
-
February 23, 2024Project summary
The project summary will be a 400-word abstract available as a Markdown (.md) document in a GitHub repository. The summary will report on project definition and model design. It will describe the dataset and its main characteristics (number and type of features), the research questions to be addressed in the project, the class of models to be applied to the dataset, and the algorithms that will be used. At least two algorithms must be used and compared.
-
April 4, 2024Project data model
The project data model will be delivered as a Jupyter notebook containing code and explanations to implement data preparation, model training and (optional) model evaluation. The project data model will be evaluated during the project clinics.
-
April 11, 2024Project presentation
The project presentation will be delivered during the last week of the course as a 6-10 minute presentation putting special emphasis on model evaluation and summarizing the other project milestones.
-
April 11, 2024Experience end
Categories
Machine learning Artificial intelligence Data analysis Data modelling Data scienceSkills
python data analytics researchA team of 3-5 students will implement a data-science project using Big Data technologies Apache Spark, Dask or scikit-learn.
- Project summary: The project summary will be a 400-word abstract available as a Markdown (.md) document in a public or private GitHub repository. The summary will report on project definition and model design. It will describe the dataset used in the project and its main characteristics (number and type of features), the research questions to be addressed in the project, the class of models to be applied to the dataset, and the algorithms that will be used. At least two algorithms must be used and compared.
- Project data model: The project data model will be delivered as a Jupyter notebook containing code and explanations to implement data preparation, model training and preliminary model evaluation.
- Final project presentation: The final project presentation will go through the final Jupyter notebook implemented for the project, putting special emphasis on model evaluation and summarizing the other project milestones.
Project timeline
-
January 25, 2024Experience start
-
February 23, 2024Project summary
-
April 4, 2024Project data model
-
April 11, 2024Project presentation
-
April 11, 2024Experience end
Timeline
-
January 25, 2024Experience start
-
February 23, 2024Project summary
The project summary will be a 400-word abstract available as a Markdown (.md) document in a GitHub repository. The summary will report on project definition and model design. It will describe the dataset and its main characteristics (number and type of features), the research questions to be addressed in the project, the class of models to be applied to the dataset, and the algorithms that will be used. At least two algorithms must be used and compared.
-
April 4, 2024Project data model
The project data model will be delivered as a Jupyter notebook containing code and explanations to implement data preparation, model training and (optional) model evaluation. The project data model will be evaluated during the project clinics.
-
April 11, 2024Project presentation
The project presentation will be delivered during the last week of the course as a 6-10 minute presentation putting special emphasis on model evaluation and summarizing the other project milestones.
-
April 11, 2024Experience end
Project Examples
In this assignment, students will work on a dataset to answer specific exploratory questions by applying one or more techniques seen in class: supervised learning, recommender systems, unsupervised clustering, frequent itemset mining, data stream analytics, graph analysis, and similarity search. Students will implement the project in Python, using Jupyter notebooks and a data analytics library among Apache Spark, Dask or scikit-learn.
As a participating organization, you’ll be asked to provide a particular dataset and a first set of related questions to be answered by the team using the dataset.
The expected project milestones are as follows:
- Project definition: students will summarize the project, including: (1) the dataset of interest, (2) the set of exploratory questions to be answered with the dataset, using techniques studied in class.
- Model design: students will choose a class of models in {supervised learning, recommender systems, unsupervised clustering, frequent itemset mining, data stream analytics, graph analysis, similarity search}. They will outline how the data model could be applied to the dataset to answer the exploratory question(s). They will research algorithms and techniques to implement this class of model.
- Data preparation: students will inspect the dataset, identify missing data, outliers, data types (categorical data in particular), and write Apache Spark or Dask programs to correct for potential issues.
- Model implementation: students will implement the model with Apache Spark, Dask or scikit-learn.
- Model evaluation: students will identify evaluation metrics for the model, implement, and discuss them.
Companies must answer the following questions to submit a match request to this experience:
Does the project include a dataset that the students will be able to access and analyze?
Timeline
-
January 25, 2024Experience start
-
February 23, 2024Project summary
-
April 4, 2024Project data model
-
April 11, 2024Project presentation
-
April 11, 2024Experience end
Timeline
-
January 25, 2024Experience start
-
February 23, 2024Project summary
The project summary will be a 400-word abstract available as a Markdown (.md) document in a GitHub repository. The summary will report on project definition and model design. It will describe the dataset and its main characteristics (number and type of features), the research questions to be addressed in the project, the class of models to be applied to the dataset, and the algorithms that will be used. At least two algorithms must be used and compared.
-
April 4, 2024Project data model
The project data model will be delivered as a Jupyter notebook containing code and explanations to implement data preparation, model training and (optional) model evaluation. The project data model will be evaluated during the project clinics.
-
April 11, 2024Project presentation
The project presentation will be delivered during the last week of the course as a 6-10 minute presentation putting special emphasis on model evaluation and summarizing the other project milestones.
-
April 11, 2024Experience end