This paper presents a comprehensive methodology to collect and standardise vacancy
information systematically from job portals. Describes available information in Colombian job
portals. Describes the methodology (web scraping) and challenges to automatically and rapidly
collect a massive number of online job vacancies. Also explains the methods that can be used
to homogenise variables, and explains challenges involved in standardising two of the most
relevant variables for the economic analysis of the labour market: skills and occupations. This
paper develops a method to automatically identify skills patterns in job vacancy descriptions
based on international skill descriptors and text mining. In addition, it conducts a novel mixedmethod
approach (software classifiers and machine learning algorithms) to properly classify job
titles into occupations. Furthermore, it deals with duplication and missing value issues, by using
predictors such as occupation, city, and experience requirements.

JEL classification: C88, J23


  • Jeisson Arley Cárdenas

Palabras clave:

  • Big data
  • Machine learning
  • Occupations
  • Skills
  • Text mining
  • Web scraping


  • Proyecto 2
  • Documentos de trabajo