Examples of courses from the first year
In this course you will learn how to use modern statistical software to analyze data and get useful information. Statistics can be divided roughly into “descriptive statistics” and “inferential statistics”.
Descriptive statistics summarizes and visualizes the observed data. It is usually not very difficult, but it forms an essential part of reporting (scientific) results. Inferential statistics tries to draw conclusions from the data that would hold true for part or the whole population from which the data is collected.
For instance, one group of patients may receive a control treatment and another group of patients may receive a new treatment.
One specific question is whether the new treatment is better than the control treatment.
Another question is whether the two treatments are (clinically) equivalent. The benefit of the new treatment may not be at the clinical outcome, but could lie somewhere else (e.g. it may be much cheaper or less invasive than the control treatment). In that case we would like the new treatment not to be worse.
These questions are mostly translated to statistical quantities like population means, medians, or proportions.
Another different example is when a production process is being replaced or enhanced. The improved process should typically show more consistent quality in the products that are produced with the new process. Correct decisions in those case are possible by making use of the theory of hypothesis testing.
The course includes an assignment in which you have to analyze data from a movie about a national disaster. You as data scientist should draw the correct conclusions so that one can take the right decision to save many lives!
Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. The idea is to build computer programs that sift through databases automatically, seeking regularities or patterns. Strong patterns, if found, will likely generalize to make accurate predictions on future data.
An explicit example of a data mining problem is building a spam filter. Such a spam filter automatically detects whether an email message is a spam message. Some mail software build a spam filter by asking the user to indicate for all the email messages in the Inbox whether they consider it to be spam or not. Based on a data mining algorithm, the mail software will then in the future automatically flag suspect mail messages as spam.
Such a procedure is known as supervised learning.
Another well-known data mining problem occurs in supermarkets. Customers receive a bonus card from supermarkets because such cards give valuable information on buying behavior (which is much more valuable than the discounts that customers receive through these bonus cards). Based on the information from the bonus cards, supermarkets know how much they have to order for products, but they can also cluster customers into groups with similar buying behavior so that supermarkets can make personalized offers or change the layout of the supermarket to increase their sales.
In this course you will learn about the data mining algorithms techniques that are being used in real-life examples like the above two examples. You will work in assignments on real-life data using the well-known Weka software so that you do not only learn how to apply the appropriate algorithms, but also learn important data science skills like cleaning, pre-processing and visualization.
This course has been developed specifically for this bachelor program, putting into practice the skills and methods learnt during the courses Data Mining, Data Statics and Data Science Research Methods.
The objective of the Data Challenge courses is to teach you how to perform large-scale data-driven analyses yourself. Real-life data sets from cooperating companies and organizations will be used in this course. The data challenges will become more and more challenging as the course advances.
An important element in this course is handling large datasets stored in various formats (files, relational databases, object databases, etc.), pre-processing the data and storing the analysis results in a suitable data format.
The aim of the first data challenge is deceptively simple: you will have to answer a number of questions from a “client” using an existing dataset. We will be trying to make this as “real” as possible: there is real data, there is a real client (represented by two of their employees, each with different backgrounds and aims), and you will really have to convince them. This also means that you will have to solve real problems: how do you deal with the large dataset? How do you know the data is valid? How was it collected? What is the actual aim of the client?
After taking the course, you will be able to:
• Independently apply and follow established data science research methods for a given problem and dataset.
• Access, process, and reason about large, complex datasets provided in various data formats.
• Independently find and familiarize yourself with programming languages, libraries, programs and software.
The data in data science is used by humans (e.g., analysts, economists, scientists) and often comes from humans (e.g., emails, videos, clicking behavior).
What are the thought processes behind these data?
How do we interpret data and how do we make decisions?
Are cognitive processes similar to computational processes?
In this course we will look at the human brain, mind, and behavior and ask the question how these can be modeled computationally. What does it take to build machines that think? How do data science techniques used to extract information relate to the way humans do this? You will learn about artificial intelligence, the human brain and computational neural networks, problem solving, reasoning, expertise, and creativity in light of information processing.
Explicit examples of such data are the data used by recommender systems in online web shops. If you buy online, you will get personalized messages like (“Other customers also bought … or “You may also be interested in …”).
Such recommendations come from analyzing data by customers combined with insights from cognitive science so that recommendations are presented in such a way that you are most likely to respond to them in a positive way.
During this course you will be asked to critically think about the topics being discussed and to write a research paper. You will finalize the course by writing an individual research paper in which you analyze a dataset of your choice in relation to one of the topics in the course .
What makes our joint Bachelor unique is that we combine the technical expertise needed to handle big data with perspectives from Law, Ethics, Economics, Humanities and the social sciences. To become a true all-round data scientist, a multifaceted understanding of ethics and law is crucial.
A data scientist trained today can expect to work not only in the private sector but also potentially with government and NGOs, and to be involved in applications of data science from business analytics to humanitarian emergencies. Data scientists are involved in journalism and public policy, they help create smart urban environments and help to solve problems in fields as diverse as cancer research to space travel. Even when they work exclusively for private companies, data scientists’ work has far-reaching implications for society and for our collective wellbeing.
The relation between ethics and data science
For these reasons, understanding how ethics relate to data science is important not only in order to make informed choices about what data and methods to use, but also in order to build successful solutions to real-world problems. For example, data scientists have accidentally produced crime prediction applications that are heavily biased against ethnic minorities; have become involved in mass surveillance in ways that pose risks to democratic processes, and have developed applications such as facial recognition and biometric systems that tend to discriminate against the poorest and most vulnerable. Just as medical students study ethics as an important element of their training, data scientists also need to understand the impacts of their work on individuals and society.
Google as an example
To take one commercial example, Google’s search service has recently been found to prioritize what has been termed ‘fake news’ – fictional, misleading or biased search results that impact users’ ability to make informed choices. Reports show that it is possible to game the search algorithm so that it shows biased results: for example a search for ‘did the Holocaust happen?’ was recently found to produce a page of Holocaust denial sites managed by the political far Right. In a different machine-learning problem, a flaw in Google’s Photos app led to it tagging black people as gorillas.
Big data reflect the biases and prejudice of the crowd
In order to do good work in data science, it is essential to have an understanding of the ways in which that work can go wrong. Solutions that rely on big data – the inputs of the crowd – also tend to reflect the biases and prejudice of that crowd, and unless social and ethical understanding supports scientific knowledge, data science may end up reinforcing inequality and unfairness.
The Data Science Ethics course
In the Data Science Ethics course, we investigate which ethical frameworks should guide data scientists as they aspire to produce innovative, profitable, and useful applications. We will ask what responsible data science consists of, using case studies from current practice that raise issues around values. The course has an emphasis on multidisciplinarity, responsibility to stakeholders and creating social value.
Issues covered may include:
- The implications of using personal data from individuals as a startup or an established company;
- The rights and wrongs of using hacked or leaked data;
- De-identification techniques and decision making frameworks for decreasing the risks of particular applications;
- Ethical and legal challenges related to ‘living laboratories’ in smart urban environments, and their implications on the individual and group level;
- The real-life difficulties of regulating data use in line with societal values.