From 95499cee2f9ba47bad3f3d1069d349b5e9b6f0fd Mon Sep 17 00:00:00 2001 From: Jidong Xiao Date: Thu, 14 Mar 2024 19:41:17 -0400 Subject: [PATCH] clarify on the input files --- old_hws/07_search_engine/README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/old_hws/07_search_engine/README.md b/old_hws/07_search_engine/README.md index c227149..367d2fc 100644 --- a/old_hws/07_search_engine/README.md +++ b/old_hws/07_search_engine/README.md @@ -225,7 +225,9 @@ The function takes one single character as its sole argument. It return a non-ze ## Input Files -All the input files are HTML files, and they are provided under the [html_files](html_files) directory. Among these HTML files, there is only one HTML file which will be provided via the command line, and this file will be considered as the Seed file, and the path of this file (i.e. html_files/index.html) therefore will be used as the Seed URL. Your web crawler should search this HTML file and find links contained in this HTML file, and then follow these links to crawl other HTML files, and repeat this process until you can not reach any more files. Keep in mind that links which take you to an HTML file which you have already crawled, should be skipped, otherwise you will get into an infinite loop situation. +Your program takes two types of input files: the HTML files and the input.txt file, which contains all the search query terms. + +All the HTML files are provided under the [html_files](html_files) directory. Among these HTML files, there is only one HTML file which will be provided via the command line, and this file will be considered as the Seed file, and the path of this file (i.e. html_files/index.html) therefore will be used as the Seed URL. Your web crawler should search this HTML file and find links contained in this HTML file, and then follow these links to crawl other HTML files, and repeat this process until you can not reach any more files. Keep in mind that links which take you to an HTML file which you have already crawled, should be skipped, otherwise you will get into an infinite loop situation. ## Output File Format