heal.abstract |
Ethnopharmacology is the scientific study of ethnic groups and their use of herbal medicines. It - being a particular field of traditional medicine - is now widely considered as a promising alternative medicine for complementary treatment of the well known western world. However, the search and documentation of indigenous knowledge on the use of specific plant properties by the experts themselves is a very challenging task, taking into account the volume of information shared through ethnopharmacological literature.
Scientific research requires anyone to be able to efficiently search for relevant documents related to their subjects. These kinds of challenges can be faced as Internet focused search problems. To support experts, we propose the use of intelligent focused search
systems, known as focused crawlers. Typically, such a system receives a few initial seed documents/URLs and optionally some keywords as input, all of which are relevant to a predefined search topic. The goal of a focused crawler is to discover and output as many relevant webpages as possible.
In the present thesis, we develop intelligent focused crawler systems, so that they become supportive tools for the ethnopharmacological research. We propose a two-stage Machine Learning focused crawler that follows a Researcher-Apprentice paradigm. In the first stage, we recommend the use of Active Learning (AL); the system is trained to identify the relevant documents by receiving feedback from the researcher, when that is needed. In the second stage, we propose the use of Reinforcement Learning (RL), regarding the focused crawler as an intelligent agent. The agent estimates how profitable would be to
follow the available URLs, in the long term, and selects the most promising ones.
In the RL framework, we model the focused crawler environment as a Markov Decision Process (MDP), considering shared representations between the states and the actions of the agent. The representation features consist of the publication title word embeddings, statistical features extracted from the link structure, keywords and/or relevance predictions of the pretrained models from the first stage. Additionally, we consider cases where the AL model, trained in the first stage, is used as the reward function.
We evaluate two different search problems; one general, based on initial seed documents and one more specific, based on initial seed documents along with keywords. We compare 6 different AL models, such as the MarginSVM and the DoubleLSTM, 3 different
state-action shared representations (General/Keyword/Only NLP Representation) and 2 RL agents; the Deep Q-Network (DQN) and the Double DQN (DDQN).
The two-stage focused crawler with the use of DQN, as well as DDQN, agent is more effective than baseline methods, such as random crawling and a greedy deterministic focused crawler we defined. Finally, comparing our method on the more specific setting to an estimated real-time researcher performance, we outperform 5.14 times the efficiency and 3.31 times the effectiveness of the expert. |
en |