BME VIK - Média- és szövegbányászat

vissza a tantárgylistához nyomtatható verzió

Media and Text Mining

A tantárgy neve magyarul / Name of the subject in Hungarian: Média- és szövegbányászat

Last updated: 2024. február 2.

Budapest University of Technology and Economics
Faculty of Electrical Engineering and Informatics

Course ID	Semester	Assessment	Credit	Tantárgyfélév
VITMM277	2,3	3/0/1/v	6

3. Course coordinator and department Dr. Szűcs Gábor,

4. Instructors Dr. Gábor Szűcs, associate professor, TMIT

5. Required knowledge basic mathematical knowledge, probability theory

6. Pre-requisites

Ajánlott:
Recommended: Data mining techniques

7. Objectives, learning outcomes and obtained knowledge The objective of the course is to introduce students to the content and information search services, from text processing to media streams. Students will learn about text and media search techniques, learn media and text analysis methods using deep learning techniques, and will be able to make decisions when designing corporate search systems and media content management systems.

8. Synopsis

Lectures:

- Emerging problems in the field of economic analysis at multinational companies. Typical task types in media and text mining.

- Media and text analysis methods, search techniques, indexing, ranking procedures. Bag of words model. Information retrieval models: Boolean model and Vector model. Weighting schemes (tf-idf), cosine similarity. Query optimization. Web search, web mining.

- Text preprocessing steps. Tokenization, stemming algorithms, Porter, Lovins stemmers. Shallow and deep parsing. POS tagging. Syntax trees and dependency graphs. Stanford tools.

- Language detection, language dependence, Zipf's law. NLP (Natural Language Processing) tools.

- Named entity recognition, relation extraction from text. Typical approaches to relation extraction: co-occurrence, pattern matching methods, supervised machine learning methods. Opinion mining as a modern tool of market research.

- Use of deep neural networks in text analysis (LSTM - Long Short-Term Memory) and image and video content analysis (CNN - Convolutional Neural Network).

- Media classification for images and videos. Preprocessing steps. Types and methods of media classification. CBIR (Context-Based Image Retrieval), simple image processing procedures.

- Connecting image and text modalities. Deep learning methods and systems. Generative Adversarial Network (GAN).

- Reduction of the problem space of text corpora and media datasets, feature extraction and feature selection techniques.

- Text classification. Types and methods of text classification. Naïve Bayes classifier. Rocchio algorithm. Automatic text processing (text generation with deep learning).

- Chatbots and virtual assistants used in companies.

- Cost-effective classification. Active learning. Ensemble classifiers. Clustering media and text datasets.

- Single-label and multi-label text classification. Change tracking in classification tasks. Concept drift.

- Media recommendation systems.

The programming language used in the labs is Python with the appropriate program libraries.

Laboratories:

- Calculation of a weighting scheme (tf-idf) for a text corpus.

- Text preprocessing, indexing, stemming.

- Sentiment analysis.

- Digit recognition task with Keras program library.

- Application of deep learning methods.

- Effective classification (on text corpora, media datasets).

- Text mining application in the economic field.

9. Method of instruction The laboratory exercises are grouped every two weeks, the other hours are lectures.

10. Assessment

a. During the teaching period: a midterm test.

b. During the exam period: preparing the solution to the homework (started in the teaching period) (written), defending it in the exam (oral).

c. The condition for signature is to write at least to a sufficient level in the midterm test (including repeated midterm tests: see in the next point). The midterm test or repeated midterm test is successful if the student has reached the maximum score of at least 40%.

d. At least five of the laboratory exercises must be successfully completed to obtain a signature.

The exam consists of 2 parts: an oral examination of the entire course material of the semester and a defense of the previously submitted written homework. The 2 parts count 50-50% in the final mark.

11. Recaps We provide an opportunity to make up for midterm test during the teaching period. For those who failed the midterm test and the repeated midterm test, we provide 1 opportunity for another (final) repeated midterm test. It is not possible to make up for the laboratory exercises. The condition for signature is to write one of the midterm test (first or repeated or final one) to at least a sufficient level and at least 5 successful laboratory exercises.

12. Consultations At pre-arranged times with lecturer.

13. References, textbooks and resources

Blanken, de Vries, Blok, Fres (eds): Multimedia Retrieval. Springer, 2007.

Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze: Introduction to Information Retrieval. Cambridge University Press, 2008

Ronen Feldman, James Sanger: The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data, Cambridge University Press, 2007

14. Required learning hours and assignment

Contact hour	56
Preparation for lectures	18
Preparation for midterm test	20
Preparation for homework	46
Learning written course material	0
Preparation for exam	40
Sum	180

15. Syllabus prepared by Dr. Gábor Szűcs, associate professor, TMIT

Budapest University of Technology and Economics, Faculty of Electrical Engineering and Informatics

Média- és szövegbányászat