Abstract
Recent technological advances underscore the dynamic nature of the labor
market. These transformative shifts yield significant consequences for
employment prospects, resulting in the increase of job vacancy data across
platforms and languages. The aggregation of such data holds the potential
to gain valuable insights into labor market demands, the emergence of
new skills, and the overall facilitation of job matching. These benefits
extend to various parties, including job platforms, recruitment agencies,
applicants, and other stakeholders within the ecosystem. However, despite
the prevalence of such insights in the private sector, we lack transparent
language technology systems and data for this domain.
The primary objective of this thesis is to investigate the use of Natural
Language Processing (NLP) technology for the extraction of relevant
information from job descriptions. We identify several general challenges
within this domain. These encompass a scarcity of available training and
evaluation data, a lack of standardized guidelines to annotate data, and a
shortage of effective methods for extracting information from job ads.
Therefore, we embark on a comprehensive study of the entire process:
First, framing the problem and getting annotated data for training NLP
models. Here, our contributions encompass job description datasets, including
a de-identification dataset, and a novel active learning algorithm
designed for efficient model training. Second, we introduce several extraction
methodologies to tackle the task of information extraction from job
advertisement data: A skill extraction approach using weak supervision,
a taxonomy-aware pre-training methodology adapting a multilingual language
model to the job market domain, and a retrieval-augmented model
leveraging multiple skill extraction datasets to enhance overall extraction
performance. Lastly, given the extracted information, we delve into the
grounding of this data within a designated taxonomy.
market. These transformative shifts yield significant consequences for
employment prospects, resulting in the increase of job vacancy data across
platforms and languages. The aggregation of such data holds the potential
to gain valuable insights into labor market demands, the emergence of
new skills, and the overall facilitation of job matching. These benefits
extend to various parties, including job platforms, recruitment agencies,
applicants, and other stakeholders within the ecosystem. However, despite
the prevalence of such insights in the private sector, we lack transparent
language technology systems and data for this domain.
The primary objective of this thesis is to investigate the use of Natural
Language Processing (NLP) technology for the extraction of relevant
information from job descriptions. We identify several general challenges
within this domain. These encompass a scarcity of available training and
evaluation data, a lack of standardized guidelines to annotate data, and a
shortage of effective methods for extracting information from job ads.
Therefore, we embark on a comprehensive study of the entire process:
First, framing the problem and getting annotated data for training NLP
models. Here, our contributions encompass job description datasets, including
a de-identification dataset, and a novel active learning algorithm
designed for efficient model training. Second, we introduce several extraction
methodologies to tackle the task of information extraction from job
advertisement data: A skill extraction approach using weak supervision,
a taxonomy-aware pre-training methodology adapting a multilingual language
model to the job market domain, and a retrieval-augmented model
leveraging multiple skill extraction datasets to enhance overall extraction
performance. Lastly, given the extracted information, we delve into the
grounding of this data within a designated taxonomy.
Originalsprog | Engelsk |
---|
Antal sider | 314 |
---|---|
ISBN (Trykt) | 978-87-7949-414-5 |
ISBN (Elektronisk) | 978-87-7949-414-5 |
Status | Udgivet - 2024 |