Streams of Web user data activities are mostly discarded by current Web information systems. User location, devices, services and other sensors create specific information consumption profiles that should be identified by online services to better answer consumer needs. However, the scale of this data is too large to be archived or processed. Most of this data is only useful during a short period of time and is related to short-life events, far shorter than the time a batch and non-distributed data mining algorithm needs to process real-time large-scale data. For example, each tourist stays in Lisbon 2 to 2.5 days, which is a very short window of opportunity for recommending one of the many attractions or cultural events.
This project is a CMU|Portugal collaboration which
proposes to advance big data technology in the development of new information businesses and
services. Our long-term vision aims at making big data economically useful by realizing the full potential of big data analysis technologies in the design of innovative services for the end-consumer. A big data processing framework, with several cutting-edge technologies, will be released by the consortium.
Technological and Research Objectives
There are many opportunities to leverage big data to innovate services. Lisbon City Council, SAPO and Priberam, the non-academic partners in the consortium, will provide real-world consumer data: both language and behavioral data will be captured in online services and mobile apps. This data can be used to recommend a full-day of tourist activities, to detect the right consumer for a given promotion, or to monitor a brand reputation. In this context, the project will target two technological goals:
We will investigate media monitoring technology to track the popularity or reputation of entities on the Web. Knowing the right market value of a brand or a product is a valuable information with many uses.
The second technological objective concerns context-aware recommendation. We propose to innovate in this area by investigating new ways of inferring clues from the user context and by compiling a set of items to recommend to groups of users.
To fulfill these technological advancements, the team will release a framework with a unique set of characteristics to enable such services. Thus, the framework will leverage on the output of our four main research objectives:
Big data infrastructures: a new architectural design, capable of supporting incremental iterative computations over large-scale real-time streams of data, will allow the framework algorithms to scale.
Learning with Big Data: a novel of scalable online learning, distributed learning algorithms with weak supervision, will be fundamental to discover small trends in data and to model large-scale data.
Stream Data Filtering and Analytics: building on the previous point, we will research a scalable, distributed architecture for doing filtering, analysis, search, and inference on diverse, high-volume, real-time information streams.
Natural Language in the Social Web: groundbreaking NLP techniques, fast enough for practical use with large volumes of text, robust to domain-shift (e.g., Twitter or IMDB), and capable of performing fine-grained linguistic analysis.
The combination of these key research ingredients into a single big data processing framework will provide a unique advantage in the design of new services. This comprehensive set of objectives are well supported by experts in all areas.
This project aims at releasing a framework with a unique set of characteristics to enable such services. To make the FLARE framework, the team will be focused in four fundamental research tasks:
- Task 1. Big data infrastructures: the design of current distributed computing architectures, such as MapReduce, STORM, or S4, are still far from fulfilling every data processing requirement. In contrast to existing solutions, we argue that a new architectural design, capable of supporting incremental iterative computations over large-scale real-time streams of data, is fundamental to supporting distributed processing algorithms for big data.
- Task 2. Learning with Big Data: predicting consumer behaviors or detecting trends from large volumes of data, are examples of learning tasks that require scalable online learning or distributed learning algorithms. In addition, because annotated data is expensive, weakly supervised learning algorithms, capable of leveraging great amounts of raw data with minimal supervision, will be essential to achieve a robust performance in real-world data.
- Task 3. Analysis of diverse, real-time information streams: the velocity and variety of data demands new live indexing and search techniques. Descriptive statistics concerning data streams must be collected and used to filter data and create sub-streams for live indexing and analysis. The variety of big data (e.g., tweets, text reviews, ratings, etc.) can be explored to infer events supported by evidence from different sources. Thus, stream analytics and filtering are key to the design of Web information systems capable of exploring search over multiple real-time data streams.
- Task 4. Natural Language in the Social Web: universal techniques to process user text in different Social Web services is today a major challenge. Future NLP techniques for the Social Web should be fast enough for practical use with large volumes of text, should be robust to domain-shift (e.g., Twitter or IMDB), and should support multiple languages, without significant language-specific engineering efforts. Thus, an important part of this project relates to building a scalable NLP pipeline, capable of performing fine-grained linguistic analysis.
The technological goals will be built on top of the previous tasks:
- Task 5. Media Monitoring and Recommendation: media monitoring and recommendation will use the tools created in the previous tasks to exploit the value of large volumes of data.
The two final tasks will address evaluation, dissemination and the education activities of the project:
- Task 6. Evaluation and Assessment: evaluation and validating the proposed research will be a central part of the project. Task 6 aims at preparing the datasets and experimental test-bed that will allow the evaluation and fine tuning of the proposed solutions. We will use both existing and well established experimental datasets and real-time consumer data, provided by the research team industry partners.
- Task 7. Dissemination and Education Program: In terms of dissemination, the team will publish regularly and will establish technology transfer processes with the companies to realize the full potential of the produced research. The education program will support multiple activities to develop the potential of doctoral students and researchers in the area.
The comprehensiveness and ambition of the outlined goals is supported by a team of researchers with a long experience in the key areas of the project. FLARE gathers experts in cloud computing, machine learning, information retrieval and filtering, and natural language processing.
- Jamie Callan (Carnegie Mellon University, Language Technologies Institute )
- Joao Magalhaes (Universidade NOVA Lisboa, NOVA-LINCS)
- Nuno Preguiça (Universidade NOVA Lisboa, NOVA-LINCS)
- Pavél Calado (Universidade de Lisboa, INESC-ID)
- Bruno Martins (Universidade de Lisboa, INESC-ID)
- Alexandre Francisco (Universidade de Lisboa, INESC-ID)
- Rodrigo Rodrigues (Universidade de Lisboa, INESC-ID)
- Mário Figueiredo (Universidade de Lisboa, IT)
- André Martins (Universidade de Lisboa, IT)
- Mariana (Priberam)
- Jorge Teixeira (Altice-Labs)