Genisa - An Advanced Information Retrieval System for International Student Recruitment

John Feng · August 1, 2023

LLM RAG MLOps Web Scraping

Project Overview

Genisa: An Advanced Information Retrieval System for International Student Recruitment

Our client, a student recruitment agency, provides a critical service by helping international students apply to overseas colleges and universities. However, they are currently facing a significant challenge - information overload. With the multitude of colleges, each with its own unique set of admission requirements, programs, courses, and schedules, the task of sorting and managing this information is overwhelming.

The core of Genisa lies in its innovative use of web scraping and large language models (LLMs). The system scrapes information from various college websites, gathering a massive pool of data on programs, courses, admission requirements, and important dates. This raw data is then processed using advanced LLMs. These LLMs are trained to understand, categorize, and summarize the collected data.

The ultimate goal is to generate a reliable and user-friendly information retrieval system. Upon receiving a query from the user, Genisa retrieves the most relevant data and generates a concise, comprehensive answer, all in real-time.

The project is currently in its prototype phase, and an early version of the application can be viewed at https://genisa.vercel.app/. We are continually improving and refining the functionalities based on user feedback and the latest AI advancements.

In conclusion, Genisa is designed to revolutionize the student recruitment agency’s workflow by minimizing information overload and providing quick, accurate responses to any inquiries. The end result will be a more efficient process for the agency and a better experience for the students they serve.

Scraping College Websites

Tools: Python, Scrapy
I recommend watching some free tutorials on Scrapy

I chose to use Scrapy because it is the most complete scraping tool with web crawling and web page scraping capabilities. Beautiful soup is much simpler to use and can definitely be used for the same purposes.
This section intends to give a high level description of the scraping process for the information retrieval chatbot application. Some python and html knowledge is required to make use of this section.

URL crawling

The first part to scraping all the information for a particular college is to collect all the webpages on the college website. This just requires crawling all the subdomain and subdirectory of a base domain. For example, saultcollege.ca is the base domain, saultcollege.ca/admissions is the subdirectory. A list of all urls with saultcollege.ca subdomains is collected in this csv file.
urls_saultcollege_parsed

Selecting relevant URLs

The choice of url exclusion will be different for each college domain since each college structures their website content differently. It is up to the discretion of the developer to decide what is relevant and what is not. Although this step is not absolutely necessary, it helps improve the accuracy of the information retrieval process, as well as introduces less “junk” data, thus taking less memory and creating a more responsive application.

For example, in the case of saultcollege.ca, notice that there are ~3000 unique urls from that domain. Do all of them contain relevant information? The vast majority of these ~3000 urls have the subdirectory “course”, and each course web page has very brief information about the course, not connecting it to any programs or similar courses. Based on my judgement, international students applying abroad rarely care about specific courses. So I excluded all the courses with webpages. There’s also a subdomain https://training.saultcollege.ca/, which offers a variety of courses outside of degree and diploma programs. This is also not relevant to international students applying abroad, therefore can also be excluded.

Scraping relevant information

It helps to have some basic knowledge of html and xpath to understand this next section. Here are some links I recommend for learning:
https://www.w3schools.com/xml/xpath_intro.asp
https://www.w3schools.com/html/default.asp

Example of webpage layout. Ideally only want to scrape content.

If one scrapes all the text content of the webpage, we find that repetitive and non-relevant text is captured. Each webpage is created with a template that is most likely used for all webpages of the same domain. We want to extract the main content, and ignore the header, navigation menu, and footer, as they don’t provide any useful information. The main content can be found by right-click → Inspect on the webpage, then just digging through the html code. Each website will have a different html tag that highlights the body. Typically it is just under header and contains the keyword “main”. At the moment this process is done manually; it is difficult to automate because every website uses a different html tag for the body.

Notice <main class=”l-main”> highlights the body, but excludes the navigation bar and everything above it.

For the Cambrian College website, the body xpath is <main class=”main”…>.

An optional but sometimes helpful tag to scrape is the header of each webpage. Most of the time, the main body content is enough for accurate similarity search. There are a few cases where the user is searching for very detailed information on a certain topic, and the header is needed to provide context to the main body content. In a few cases, adding a header provides the necessary context for retrieving the most relevant webpage.

Text cleaning

In my cleaning process, I first remove all metacharacters (\r \t \n), extra white spaces, and commas. Then I look for common string patterns that are not relevant to the main content of the webpage and strings that are repeated in every webpage. For example the “Share on Share on Facebook Share on Twitter” string is scraped from a few buttons at the top of the webpage that links to social media, and is present in many webpages. These string patterns are found by sampling through raw scraped text and detecting them by eye. The string patterns also differ for each college website. For the purpose of developing a more scalable process, some kind of algorithm can be developed to automate this step, as it’s quite tedious to do manually.

Example of the stages of text cleaning

Similar to the section on excluding irrelevant URLs, text cleaning is not absolutely mandatory. However, in my experience, it improves the accuracy of information retrieval by eliminating “junk” text, since each word or sentence contributes to the relevancy of similarity search.

Creating The Vector DB

Ingesting College information

From a folder containing CSVs of the scraped webpage content for each college, we create a large Pandas Dataframe which should include the columns: header, url and cleaned_text. We also create another header: college_name based on the file name of the csv. Therefore, each csv should follow the file name convention {college_name}_webpages.csv

Data Preparation

We use a chunk size of 400 characters and spilt the text based on the delimeters from the cleaned scraped text. These delimeters include double spaces, newlines and periods.

We add a new column to the data frame “chunked_texts” which contains all the text in the webpage split into chunks of 400 characters.

Next, we create a new dataframe which excludes all unnecessary columns and which separates each chunk of text into its own row.
We save this dataframe into a csv located at scraped_webpage_dataframe/flattened_scraped_chunks.csv for when concatenating adjacent chunks in the retrieval step. Note: this should become its own table in a SQL database when scaling to hundreds of colleges.

Storing into Pinecone DB

Embeddings are essentially the state of the neural network when a language model is doing a prediction. Each embedding is represented as a 1536-dimensional vector using OpenAI’s text-embedding-ada-002. Learn more here.

Each chunk when embedded lies somewhere in the vector space. This means that chunks that share similar meaning have vectors that are closer together.

By keeping the chunk size small, limited to 1 or 2 ideas, we are able to retrieve the webpages that are most relevant to the main keywords derived from the user’s question.

We embed the chunked text along with the webpage’s header and college name as one combined string. We drop the url from this string as it contains many characters and can be described sufficiently by the college name and header.

We use Pinecone as our vector database as it is a cloud option that can scale to hundreds of colleges easily. Compared to a local option such as FAISS which exceeded 65 MB with 2 colleges and required a more complex AWS S3 solution to integrate.

Creating a Pinecone DB is very simple and only requires the PINECONE_API_KEY and PINECONE_ENVIRONMENT found when creating an account following this guide. We use the name “colleges” for our index name. Note: When scaling to hundreds of colleges, the free tier will not be sufficient and an upgrade to a paid tier will be necessary.

Screen on app.pinecone.io to retrieve the relevant environment variables

Next, create an index from the left side panel called “colleges”.

Screen to create the colleges index

Before calling /initialize_vector_db, ensure the colour next to the index name is a green circle, not orange.

Screen showing status of the index

LLM Pipeline

LLM pipeline for effective retrieval Q and A system. Conversation memory is updated every round of Q and A, and feeds into the search query and final inference.

Converting question to query

We add an intermediary LLM processing on the user question to improve the vector db retrieval process, especially when the user is referring to something in the previous conversation. Essentially, this converts a user’s question into a few keywords that are likely to appear in the colleges’ websites.

Prompt template for converting question to search query.

In order to formulate a pertinent query, we inject the conversation history into the prompt, so the query will contain context previously mentioned.

Here is an example:
USER QUESTION: Can you provide information about the student support services offered at Cambrian College, such as counseling, academic advising, and career services?
SEARCH QUERY: “Student support services” Cambrian College counseling academic advising career services

Next time /inference is called …
USER QUESTION: What international programs does this college offer?
SEARCH QUERY: Cambrian College international programs

Retrieve Webpage Content From VectorDB

The figure above depicts the LLM information retrieval pipeline. The search query previously generated is first embedded as a vector (using the same embedding model as used when creating the vector db). Then, a similarity score is calculated for each vector in the vectordb (each vector representing a chunk of scraped text). Next, we return the top n chunks in the vectordb, and feed those text chunks into the LLM to generate an answer for the user’s most recent question. The methodology for our choice of n is discussed in the next section.

However, just using these chunks as context would not provide enough information to generate a sufficient answer. We decided to use a method of concatenating the adjacent scraped text on the webpage together, that is around the retrieved chunk. This method works better than providing the whole webpage’s scraped text as context, as some webpages are very long and will exceed the language model’s token limit.

Concatenating Adjacent Chunks

We apply thresholding on the top n chunks, to determine if the answer to the user’s question is simply conversational or requires specific information from the database of colleges. If the similarity score of a retrieved chunk is sufficiently low, we do not confuse the language model with any webpage context and allow it to answer the user’s question using its pre-trained conversational intelligence.

We calculate n to be 3, as our chunk size is adequately small that our search query regularly matches the information on the webpage very closely. This results in the relevant webpage typically returned in the top 1 or 2 results.

Algorithm describing adjacent chunk concatenation.

Our algorithm for concatenating the adjacent chunks is based on the token limit. Of the 3 returned results, we allocate 6000, 4000 and 2000 characters to each webpage respectively, the webpage with highest similarity score getting the most characters. This distribution biases the top most results to have the most content. The exact numbers are derived using the the formula 3x + 2x + x = max characters, where max characters =
(token limit (4000) - conversation history token limit (1000)) * 4 characters per token
= 12000 characters

The algorithm then concatones the chunk of text that is above and below the retrieved text while staying within the bounds of the webpage. This continues until the token limit is reached or until a maximum depth of 30 chunks has been added.

Currently, we retrieve the adjacent chunks based on the index of the retrieved chunk in the dataframe. We then concatenate chunks with index +1 or index -1, based on the conditions described above. But when scaling to hundreds of colleges, a dataframe containing all the college data will be too large. Therefore, when using a table to store this data, this algorithm should be modified to still retrieve chunks based on index, where index should be an incremented primary key.

Conversation context

To maintain context for more than one query in a session, we feed the transcript of the most recent k messages into the final inference prompt. We calculate k based on the character count of the previous conversation. Out of the ~4000 tokens that can be used as input into the language model GPT-3.5, we allocate 1000 tokens to the conversation history (~4 characters per token) and 3000 tokens to the injected webpage context. The reason we do not feed the entire transcript is to avoid the token limit of GPT-3.5. Other memory types can be implemented (more details can be found here).

Final Inference

The final inference is a carefully constructed prompt that consists of all the retrieved webpages text and the conversation transcript.

The template starts with instructions that aim to steer the base LLM model to behave a certain way. {webpages} and {conversation} are input fields not shown in this figure because they are typically very long.

SQLite DB Schema

Conversation storage with sqlite DB. There are two columns, session_id and conversation.

The conversations are stored in the backend via sqlite database. Each session has a unique session_id to identify the conversation. Each entry of conversation is formatted as a list of jsons, each json being a message from either the human or ai.

Example:
[{“type”: “human”, “data”: {“content”: “Where do babies come from?”},
{“type”: “ai”, “data”: {“content”: “Storks bring them from the sky.”}]

Deployment Frontend

We have used Vercel for Frontend deployment. You can create a repo and push it to the Github repo with the setup of the next js we have on the project, then Connect github to Vercel and it will automatically deploy the next js application for you. Or you can add Vercel CLI to your local machine and deploy from the local machine.

Deployment Backend

This document provides guidelines on how to deploy a Flask app and DynamoDB on AWS ECS Fargate.
Requirements:

A Flask app that you want to deploy.
A Docker file for building a Docker image of your Flask app.
A Docker registry to store your Docker image (e.g., Amazon ECR or Docker hub)
A DynamoDB table to store and retrieve data.
An AWS ECS Fargate cluster to run your containers.
An AWS ECS task definition that describes how to run your containers.
An AWS ECS service that manages your containers and automatically scales them based
on demand.
Application Load Balancer
A Security Group both for the Cluster and Application Load Balancer
An AWS IAM role with the necessary permissions to access AWS resources required for your application to run. Here below are some of them:

Permission	User	Cluster
AmazonECSTaskExecutionRolePolicy	False	True
AmazonDynamoDBFullAccess	True	True
AmazonEC2ContainerRegistryFullAccess	True	True
AmazonECS_FullAccess	True	False
CloudWatchLogsFullAccess	True	True

NOTE: These are all the permissions needed but not limited to them!

Here are the steps to follow to deploy a flask app and DynamoDB on AWS ECS Farget:

Set up your Flask app.
Create a Docker file.
Build and test the Docker image.
Push the Docker image to AWS ECR/Docker hub.
Create a Security Group.
Here we need two distinct security groups. One for the cluster to accept requests from the Application Load Balancer and the other for the Application Load Balancer to accept outside http requests from anywhere.
- AWS Management Console
- Search for EC2
- Security Groups
- Create Security Group
  - Provide name for security group.
  - Select VPC
  - Add Inbound Rules
NB:

▪ For the Application Load Balancer security group, the inbound rule is

to accept any HTTP request with TCP protocol on port 80 coming

from any source 0.0.0.0/0.

▪ For the Cluster security group, the inbound rule is to accept ALL TCP

requests coming only from the Application Load Balancer Security

Group.
Create a new ECS Task Definition.

Here are the steps:

AWS Management Console.
Search for ECS.
Task definitions.
Create new task definitions.
- On Task definition configuration Section fill in:
  - Task definition family.
- On Container Section fill in:
  - Name
  - Image URI
  - Secrets Manager ARN or name
  - Port Mapping
    - Container port
  - Environment variables if any
- Next
- On environment section fill in
  - App environment (select to be AWS Farget)
  - Operating system/Architecture
  - Task size
    - CPU and Memory
  - Task roles, network mode
    - Task role (select escTaskExecutionRole)
    - Task execution role (select escTaskExecutionRole)
- Next
- Create

Create a new ECS Cluster.
Here are the steps:
- AWS Management Console.
- Search for ECS.
- Clusters
- Create Cluster
  - On Cluster Configuration Section Fill in
- Cluster name
  * On Networking Section
- VPC (select default VPC)
- Subnets (remove lambda related subnets and leave the rest 6 subnets)
- Default namespace
  * Create
Launch a New ECS Service
Here are the steps:
- AWS Management Console.
- Search ECS
- Clusters
- Select previously created clusters.
- Service Tab
- Create
  - On Environment Section Fill in
- Compute configuration (select Launch Type)
- Launch Type
- Platform Version
  * On Deployment configuration Section Fill in
- Application type (select Service)
- Family (select previously created task definition)
- Service name
- Desired tasks
  * On Networking Section Fill in
- VPC (select default VPC)
- Subnets (remove lambda related subnets and leave the rest 6 subnets)
- Security group (select Use an existing security group)
- Security group name (select previously created security group for the Cluster) and remove the default.
  * On Load balancing Section Fill in
- Load balancer type (select Application Load Balancer)
- Application Load Balancer (select Create a new load balancer)
- Load balancer name
- Target group name
- Health check grace period (mostly 20 sec)
  * Create
  *
Configure Load Balancer
Here are the steps:
- AWS Management Console.
- Search EC2
- Load Balancing
- Load Balancer
  - Select the previously created load balancer.
  - Security Tab
  - Edit
  - On Security groups Section
- Security groups (remove any other security group attached to it
  
  and give it the Application Load Balancer Security Group created
  
  previously)
Test From the previous page (load balancer page) find the DNS name, copy it and
check it on browser

Appendix

Python API design

After creating the vector database, the steps to generate an answer from the chatbot are as follows:

/start_session
- Creates a new row in the SQL conversation table
- Returns a new session_id
/get_conversation
- This is only needed if a session already exists and to retrieve a previous conversation
/inference
- Produces a new conversation based on a conversation passed in
- Expects the user to have the last message in the conversation

Ingesting from PDFs

During the scraping process, I came across a few PDF booklets that contain

Some colleges have created PDFs of relevant college and program information. These are a very organized source of information. Best of all, the text data do not require any parsing like we do in web scraping. The recipe for integrating PDFs into a vector database can be found in Langchain docs:
https://python.langchain.com/docs/modules/data_connection/document_loaders/how_to/pdf

LLM limitations

The current approach of the information retrieval uses LLM to generate an answer by first fetching the most relevant chunks of text and synthesizing an answer based on the retrieved chunks. The reason we create chunks of text data is because the LLM can only inspect a few thousand words at a time. This approach creates very accurate responses for questions that can be directly answered from the content of our vector databases. However, LLM cannot summarize the entire database to answer more “holistic” questions. For example, it would fail to answer “What are the top computer science programs in Ontario?”, even if we scraped all Ontario colleges and universities. Unless there is a specific webpage that answers that question directly, the app will most likely just give a list of computer science programs from random sources by matching the key entities e.g. “computer science program Ontario”. Such meta-analysis capabilities are not currently possible with our retrieval system.

Web Scraping Limitations

The information that is obtained from scraping is purely text. Some information on webpages are alternate forms of media such as pictures and videos, which will not be captured from scraping. Another difficult format to scrape are tables. Since the html code is eliminated during the cleaning process, the row and column structures are sometimes lost. Scraping works very well if most of the webpages for each college follow a standard format, with text information arranged in a predictable pattern. When webpages deviate from the common template, or have lots of media and little text, then the scraped content is not as useful.

Abbreviations and acronyms explained

vectordb = vector database
LLM = Large Language Model
URL = Uniform Resource Locator, the web page’s address link
CSV = comma separated value, type of tabular data file
Token = a group of characters OpenAI uses as the smallest quantity in its model. Read more here

Share: Twitter, Facebook