## Clarification We made a clarification on the discussion forum. In case you didn't pay attenton there, we are adding the clarification here. 1. when determining which document contains "Tom", we do not consider the word "Tomato" as a match; also, to simplify your task, we do not consider "Tom.", "Tom-", ".Tom", "-Tom", "_Tom", etc., as considering all these cases would make your job much harder. So the word Tom is found only if "Tom" is right before whitespaces and is followed by whitespaces. In other words, the character before "Tom" and the character after "Tom" must be a whitespace character. 2. However, these are two situations where the above rule does not apply: 2.1. when constructing the snippet, this above rule does not apply. When constructing the snippet, you just find the first occurrence of that word (or that query), and that really is saying that you can just call the **std::string::find**() function to find the first occurrence of that word (or that query) within the body section of the HTML file. And therefore your snippet may be like this: "I am Lady Gaga." when the search is a phrase search of "Lady Gaga". So this means that "." after Gaga is okay, we do not care. This is also why for test case 4.2, the following is showed in the snippet: "Since 1982, The Statue of Liberty-Ellis Island Foundation has partnered with the" when the search query is a phrase search of "Statue of Liberty". And this means that "-" after Liberty is okay, we do not care. 2.2. when counting the number of occurrences of each keyword (in the keyword density score calculation process), the above rule does not apply. When counting the occurrences of each keyword, you can just call the **std::string::find**() function to find the occurrence of that keyword. And therefore, when the keyword is *Gaga*, and the **std::string::find**() function finds *Gaga* in the sentence of "I am Lady Gaga.", that is okay, we will count this one as a valid occurrence even though there is period "." after *Gaga*. So you may see that 1 and 2 are not consistent; but the only reason we allow this inconsistence to exist in this assignment is to simplify your task. A fully functioning search engine will need to handle a lot of complicated cases, and that's way beyond the scope of this course. # Homework 7 — Design and Implementation of a Simple Google In this assignment you will develop a simple search engine called New York Search. Your program will mimic some of the features provided by Google. Please read the entire handout before starting to code the assignment. ## Learning Objectives - Practice writing recursive programs. - Practice using std::map and std::set. ## Background When talking about Google Search Engine, what words come to your mind? Page Ranking? Inverted Indexing? Web Crawler? When developing a search engine, the first question we want to ask is, where to start? When you type "Selena Gomez" or "Tom Brady" in the search box in Google, where does Google start? Does Google start searching from one specific website? The answer is Google does not start from one specific website, rather they maintain a list of URLs which are called Seed URLs. These Seed URLs are manually chosen which represent a diverse range of high-quality, reputable websites. Search engines usually have a component called web crawler, which crawls these URLs and then follow links from these web pages to other web pages. As the web crawler crawls these other web pages, it collects links from these other web pages to more web pages, and then follow these links to crawl more web pages. This process continues, ultimately, the goal is to discover as many web pages as possible. Once all pages are visited, the search engine will build a map, which is known as the inverted index, which maps terms (i.e., individual words) to web pages (also known as documents). Below is an example: | Key (Term) | Value (List of Document References) | |-----------|---------------------------------------| | apple | Document1, Document3, Document5 | | banana | Document2, Document4 | | orange | Document1, Document2 | **Note**: in this README, the term web page, page, document, and HTML file, all have the same meaning. When a user enters a search query, the search engine consults its inverted index map to identify the documents that match the query term. These matching documents will then be ranked based on various factors, and the ranked documents will then be presented to the user. And this ranking process is the so-called Page Ranking. ## Implementation Based on the above description, you can see there are 3 steps when implementing a search engine: 1. web crawling 2. query searching 3. page ranking And thus, in this assignment, you are recommended to write your search engine following this same order of 3 steps (the reason this is just a recommendation, rather than a requirement, is because one mentor told us that she can produce all the results in the web crawling stage, and she doesn't need 3 steps). More details about each of these 3 steps are described below: ### Web Crawling The Web Crawler's goal is to build the inverted index. ### Query Searching The Query Searching component's goal is to identify the matching documents. ### Page Ranking Once the matching documents are identified, you should rank these documents and present them to the user. Google uses a variety of factors in its page ranking, but in this assignment, your page ranking is required to consider the following factors: - Keywords Density. - Backlinks. For each page to be presented, we calculate a page score, and then present these pages in a descending order to the user, i.e., pages whose page score is higher should be presented first. As the page score consists of two factors, we will calculate the score for each of these two factors, and we name them the *keywords density score*, and the *backlinks score*, respectively. Once we have these two scores, we can get the page score using this formula: page score = (0.5 * keywords density score + 0.5 * backlinks score); [**formula 1**] In order to match the results used by the autograder, you should define all scores as *double*. Next we will describe how to calculate the keywords density score and the backlinks score. #### Keywords Density Score A search query may contain one keyword or multiple keywords. Given a set of keywords, we can calculate the keywords density score by following these two steps: 1. Calculate a density score for each keyword within the document. 2. Accumulate these individual density scores into a combined score. For each keyword, the keyword's density score is a measure of how the keyword's frequency in a document compares to its occurrence in all documents, and we can use the following formula to calculate the density score of one keyword. ```console Keyword Density Score = (Number of Times Keyword Appears) / (Total Content Length of this One Document * Keyword Density Across All Documents) ``` Here, we consider the content of each document as a string. Also, here "Total Content Length" means the total length of the whole document, not just the length of the <body> section; and the "Number of Times Keyword Appears" means the number of times the keyword appears in the whole document, not just in the <body> section. Similarly, when calculating the "Keyword Density Across All Documents", you should also consider the whole document, not just the <body> section. Let's explain this formula with an example: let's say we have 4 documents in total, and the user wants to search *Tom Cruise*. Assume the first document has 50 characters (i.e., the document length of the first document is 50), the second document has 40 characters, the third document has 100 characters, and the fourth document has 200 characters. The keyword *Tom* appears in the first document 2 times, appears in the second document 3 times, appears in the third document 4 times, and appears in the fourth document 0 times. Then for this keyword *Tom*, the density across all documents would be: ```console (2 + 3 + 4 + 0) / (50 + 40 + 100 + 200) = 0.023 ``` and the keyword density score for this keyword *Tom* in the first document, would be: ```console 2 / (50 * 0.023) = 1.739 ``` and the keyword density score for this keyword *Tom* in the second document, would be: ```console 3 / (40 * 0.023) = 3.261 ``` and the keyword density score for this keyword *Tom* in the third document, would be: ```console 4 / (100 * 0.023) = 1.739 ``` Once we get the density score for the keyword *Tom* in the first document (let's denote this score by denScore1), and we get the density score for the keyword *Cruise* in the first document (let's denote this score by denScore2), then the keywords density score for the search query *Tom Cruise* in the first document would be *(denScore1 + denScore2)*. #### Backlinks Score A backlinks score for a webpage is based on the importance of its incoming backlinks, considering that pages with fewer outgoing links are considered more valuable and contribute more to the score. Let's say there are N web pages which have links pointing to this current page. We name these pages doc_1, doc_2,... to doc_N, and we use doc_i->outgoingLinks to denote how many outgoing links document i has. Then we can calculate the backlinks score of this current page as following: ```console backlinks score = ( 1.0 / (1 + doc_1->outgoingLinks) + 1.0 / (1 + doc_2->outgoingLinks) + ... + 1.0 / (1 + doc_N->outgoingLinks) ); ``` Once you have both the keywords density score and the backlinks score, you can then use [formula 1](#formula-1) to get the overall score for a page. ## Assignment Scope To reduce the scope of the assignment, and hence reduce the amount of work from you, we make the following rules for this search engine. ### Rule 1. Case-sensitive Search Engine Search engines are usually case-insensitive, but making the search engine case-insensitive will require some extra work and likely need to call some functions we have not learned in this course. Therefore, to simplify your tasks and reduce the amount of your work, in this assignment, the search engine you are going to implement is case-sensitive. ### Rule 2. Search HTML Files Only Search Engines like Google will search all types of files on the Internet, but in this assignment, we assume all files we search are HTML files. And we consider an HTML file contains the search query only if the search query can be found within the <body> section of the HTML file. The <body> section, enclosed within the <body></body> tags in an HTML document, represents the primary content area of the web page. Based on Rule 1 and Rule 2: when the search query is *Tom Cruise*, the second page showed in this image should not be included in your search results, unless the words *Tom Cruise* appears in the other part of the <body></body> section of this web page, which is not displayed here.  But wait, we see *Tom Cruise* here:  That's true, but this line is not in the <body> section of the HTML file, it is created via a meta description tag which is in the <head> section of the HTML file. We will have more details on this in [a later section](#the-description) in this README. The same thing for this line:  this line is not in the <body> section of the HTML file, rather, it is created via a title tag which is in the <head> section of the HTML file. More details on this in [a later section](#the-title) in this README. ### Rule 3. Search Query: No More Than 3 Words We also limit the user to search no more than 3 words in each query. Based on this rule, we allow users to search *Tom*, *Tom Cruise*, *Tom and Jerry*, but *Tom Hanks Academy Award* is not allowed, as it contains more than 3 words. ### Rule 4. Local Searching Only The search engine you implement will not search anything on the Internet, as that requires extensive knowledge in computer networks and will need to include network libraries, which is way beyond the scope of this course. In this assignment, we limit our searches to a local folder, which is provided as [html_files](html_files). You are also not allowed to use file system libraries such as <filesystem> to access the HTML files, rather, you should follow the instructions given in the [other useful code](#other-useful-code) section to open HTML files and follow links within each HTML file to get to other HTML files. ## Supported Commands Your program will be run like this: ```console nysearch.exe html_files/index.html output.txt Tom nysearch.exe html_files/index.html output.txt Tom Cruise nysearch.exe html_files/index.html output.txt Tom and Jerry nysearch.exe html_files/index.html output.txt "Tom Cruise" ``` Here: - *nysearch.exe* is the executable file name. - html_files/index.html is the Seed URL. While Google maintains a list of Seed URL, in this assignment, we will just use one single HTML file as the Seed page and the path of this file is the Seed URL. - output.txt is where to print your output to. - *Tom* is an example of a search query which contains one word, *Tom Cruise* is an example of a search query which contains two words, *Tom and Jerry* is an example of a search query which contains three words. *"Tom Cruise"* is an example of a phrase search, in which the user wants to find an exact match to this whole phrase. ### Phrase Search vs Regular Search Your search engine should support both phrase search and regular search. 1. When searching multiple words with double quotes, it is called a phrase search. In phrase search, the whole phrase must exist somewhere in the searched document. In other words, the search engine will search for the exact phrase, word for word, and in the specified order. 2. When searching multiple words without double quotes, it is called a regular search. In this assignment, we define the term *regular search* as such: the search engine should look for documents which contain every word of the search query, but these words do not need to appear together, and they can appear in any order within the document. Based on the above definition, a document which only contains the following two lines (in the body section of the HTML file) is a valid document when the user searches *Tom Cruise*: ```console Tom and Jerry show Have Fun And Save Now With Great Deals When You Cruise With Carnival. Book Online Today. ``` Because we can find both the word *Tom* and the word *Cruise*. But it is not a valid document if the user does a phrase search - *"Tom Cruise"*, as no exact match can be found in this document. ## Input Files All the input files are HTML files, and they are provided under the [html_files](html_files) directory. Among these HTML files, there is only one HTML file which will be provided via the command line, and this file will be considered as the Seed file, and the path of this file (i.e. html_files/index.html) therefore will be used as the Seed URL. Your web crawler should search this HTML file and find links contained in this HTML file, and then follow these links to crawl other HTML files, and repeat this process until you can not reach any more files. Keep in mind that links which take you to an HTML file which you have already crawled, should be skipped, otherwise you will get into an infinite loop situation. ## Output File Format The output of your program should go to the output file. - If no matches can be found for a search query, your search engine should print the following message to the output file. ```console Your search - dsdwoddjojdjeokdddfjwoewojo - did not match any documents. ``` Replace *dsdwoddjojdjeokdddfjwoewojo* with the search query. This behavior matches with what Google does.  - If matches are found, you should print the ranked results in a format similar to what Google does, as shown in this following image:  More specifically, for each document, print 1. the title 2. the url 3. the description 4. a snippet ### The Title In all HTML files we provide, in the <head> section of the HTML, we have a "title" element. It is used to define the title of the web page or document. In the following example, the text "ESPN" within the <title> tags represents the title of the web page, which is typically displayed in the browser's title bar or tab, and it is often used by search engines to display the title of the page in search results. ```html