LOD 2020 Big-Data Challenge
Our sponsor, Neodata Lab, will offer a prize of €2000 to the applicant who develops the most accurate algorithm to process the following problem.
Problem specification
The main goal is to segment/profile internet users, using the actions they made on web sites.
Users and actions
- Users are identified by a specific id;
- each user can perform different kinds of actions (action type) while they surf the internet:
- pageview (view a web page);
- impression (view an advertising on a web page);
- click (click on an advertising on a web page);
- conversion (reach the final goal of an advertising, e.g. buy a product on an e-commerce);
- each action is described by a set of action attributes, e.g. timestamp, device, location, url.
Segments
- a segment is a set of users that have something in common;
- “something in common” means that they all match a given list of conditions;
- therefore a segment can be considered as a rule/business logic, that is a set of conditions;
- conditions are defined using the action attributes, e.g. ” users viewing web pages with url containing the word ‘pizza’ “;
- conditions can also include the number of actions (frequency), e.g. ” users viewing web pages with url containing the word ‘pizza’ at least 5 times “;
- conditions can be combined in AND and in OR;
- all the actions of a given time period (longevity) are used to check the conditions.
Data
We provide the following initial data:
- 1k segments;
- 100M actions made by 10M users in 30 days.
We provide the following additional data:
- 10M new actions in 3 days, to be considered as produced in real-time.
Goal
There are two goals:
- Batch: to assign the users, who performed the 100M actions of the initial dataset, in the proper segments, if any (with the highest possible accuracy and the shortest time).
- Real-time: for each action provided as additional data, update in real-time the set of segments to which the user who performed that action belongs to or the ones from which he should be removed (if any), considering also all the actions made in the 30 days before.
Technical specifications
- use the standard Apache Hadoop frameworks;
- real time streaming: kafka
- processing: spark on hadoop cluster
- storage: hbase and hdfs
Notes
- A set of 1K user segment definitions are provided with complex rules, to make the solution of the problem similar to real cases.
- A data flow should be simulated using the additional data to provide the input to a Kafka stream at the time interval indicated by the timestamp of the actions.
IMPORTANT DATES:
- Submission of full system and abstract: Monday May 4, 2020
- Notification of Acceptance: Monday May 25, 2020
- Challenge presentation: July 19-23, 2020 (final date to be defined)
- Contact: lod2020challenge@neodatalab.com