Focused crawling ethnopharmacological references with active and reinforcement learning

Κοντογιάννης, Ανδρέας

dc.contributor.author	Κοντογιάννης, Ανδρέας	el
dc.date.accessioned	2021-03-26T07:11:45Z
dc.date.available	2021-03-26T07:11:45Z
dc.identifier.uri	https://dspace.lib.ntua.gr/xmlui/handle/123456789/53128
dc.identifier.uri	http://dx.doi.org/10.26240/heal.ntua.20826
dc.rights	Default License
dc.subject	Εστιασμένη διαδικτυακή ανίχνευση	el
dc.subject	Ενισχυτική μάθηση	el
dc.subject	Ενεργός μάθηση	el
dc.subject	Βαθιά μάθηση	el
dc.subject	Επιλογή καλύτερου μοντέλου	el
dc.subject	Focused crawling	en
dc.subject	Reinforcement learning	en
dc.subject	Active learning	en
dc.subject	Deep learning	en
dc.subject	Model selection	en
dc.title	Focused crawling ethnopharmacological references with active and reinforcement learning	en
heal.type	bachelorThesis
heal.classification	Ευφυή Συστήματα	el
heal.language	el
heal.access	free
heal.recordProvider	ntua	el
heal.publicationDate	2021-03-12
heal.abstract	Ethnopharmacology is the scientific study of ethnic groups and their use of herbal medicines. It - being a particular field of traditional medicine - is now widely considered as a promising alternative medicine for complementary treatment of the well known western world. However, the search and documentation of indigenous knowledge on the use of specific plant properties by the experts themselves is a very challenging task, taking into account the volume of information shared through ethnopharmacological literature. Scientific research requires anyone to be able to efficiently search for relevant documents related to their subjects. These kinds of challenges can be faced as Internet focused search problems. To support experts, we propose the use of intelligent focused search systems, known as focused crawlers. Typically, such a system receives a few initial seed documents/URLs and optionally some keywords as input, all of which are relevant to a predefined search topic. The goal of a focused crawler is to discover and output as many relevant webpages as possible. In the present thesis, we develop intelligent focused crawler systems, so that they become supportive tools for the ethnopharmacological research. We propose a two-stage Machine Learning focused crawler that follows a Researcher-Apprentice paradigm. In the first stage, we recommend the use of Active Learning (AL); the system is trained to identify the relevant documents by receiving feedback from the researcher, when that is needed. In the second stage, we propose the use of Reinforcement Learning (RL), regarding the focused crawler as an intelligent agent. The agent estimates how profitable would be to follow the available URLs, in the long term, and selects the most promising ones. In the RL framework, we model the focused crawler environment as a Markov Decision Process (MDP), considering shared representations between the states and the actions of the agent. The representation features consist of the publication title word embeddings, statistical features extracted from the link structure, keywords and/or relevance predictions of the pretrained models from the first stage. Additionally, we consider cases where the AL model, trained in the first stage, is used as the reward function. We evaluate two different search problems; one general, based on initial seed documents and one more specific, based on initial seed documents along with keywords. We compare 6 different AL models, such as the MarginSVM and the DoubleLSTM, 3 different state-action shared representations (General/Keyword/Only NLP Representation) and 2 RL agents; the Deep Q-Network (DQN) and the Double DQN (DDQN). The two-stage focused crawler with the use of DQN, as well as DDQN, agent is more effective than baseline methods, such as random crawling and a greedy deterministic focused crawler we defined. Finally, comparing our method on the more specific setting to an estimated real-time researcher performance, we outperform 5.14 times the efficiency and 3.31 times the effectiveness of the expert.	en
heal.advisorName	Ποταμιάνος, Αλέξανδρος	el
heal.committeeMemberName	Τσανάκας, Παναγιώτης	el
heal.committeeMemberName	Ρουσσάκη, Ιωάννα	el
heal.committeeMemberName	Ποταμιάνος, Αλέξανδρος	el
heal.academicPublisher	Εθνικό Μετσόβιο Πολυτεχνείο. Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών	el
heal.academicPublisherID	ntua
heal.numberOfPages	144 σ.	el
heal.fullTextAvailability	false