# Homework 9 — TikTok Trends In this assignment you will develop a program to display the trends page like TikTok does, let's call this program New York Trends. Please read the entire handout before starting to code the assignment. ## Learning Objectives - Practice using std::priority_queue. - Practice using std::unordered_map, std::unordered_set. - Practice using C++ exceptions. ## Background ### TikTok Discover According to [TikTok support](https://support.tiktok.com/en/using-tiktok/exploring-videos/discover-and-search): Discover is a page on TikTok that allows you to search and explore the wide variety of content in the TikTok community. In this feed you'll find trending videos, hashtags, creators, and sponsored content. To access the Discover page via the mobile app, users just tap Discover, located at the bottom of phone screen. To access the Discover page via your web browser, just go to [https://www.tiktok.com/discover](https://www.tiktok.com/discover). ![alt text](images/tiktok_discover.png "tiktok discover") As can be seen from the above screenshot (taken on November 19th, 2023), on the Discover page, two lists of videos are displayed: trending hashtags (on the left) and trending sounds (on the right). And displaying these two lists of videos is the main task of this assignment. ## Supported Commands Your program will be run like this: ```console nytrends.exe input.json output.txt hashtag nytrends.exe input.json output.txt sound ``` Here: - *nytrends.exe* is the executable file name. - input.json contains data collected from TikTok. In this README we will refer to this file as **the json file**. - output.txt is where to print your output to. In this README we will refer to this file as **the output file**. - this field will be either hashtag or sound. When this field is *hashtag*, your program should display the top 10 trending hashtags to the output file. When this field is *sound*, your program should display the top 10 trending sounds to the output file. To summerize what your program does: your program reads data from **the json file**, analyze the data and find out the top 10 trending hashtags, or the top 10 trending sounds, and display them in the output file. ## Format of input.json input.json represents the json file. It stores posts we collected from TikTok. Each line of the json file represents one post, and each line has the same format. And below is an example, which describes a post by Taylor Swift. (You can view her post [here](https://www.tiktok.com/@taylorswift/video/7216853341702278446).) ```console {"id": "7216853341702278446", "text": "That\u2019s my whole world \ud83d\udc95 #tstheerastour #swifttok ", "createTime": 1680304615, "createTimeISO": "2023-03-31T23:16:55.000Z", "authorMeta": {"id": "6881290705605477381", "name": "taylorswift", "nickName": "Taylor Swift", "verified": true, "signature": "This is pretty much just a cat account", "bioLink": "taylorswift.com", "avatar": "https://p16-sign-va.tiktokcdn.com/tos-maliva-avt-0068/13f2a0d585f3cd8578da0d18c36a18c4~c5_720x720.jpeg?x-expires=1700456400&x-signature=jkLwlnqFUpLwoYe6TvlGXZs%2FhP8%3D", "privateAccount": false, "region": "US", "following": 0, "fans": 22900000, "heart": 200400000, "video": 61, "digg": 2161}, "musicMeta": {"musicName": "So it goes x Miss Americana", "musicAuthor": "\ud83e\udea9", "musicOriginal": false, "playUrl": "https://v16-webapp-prime.us.tiktok.com/video/tos/useast5/tos-useast5-v-27dcd7-tx/3b1da6666aed49658c9f51e43d08ea46/?a=1988&ch=0&cr=0&dr=0&er=0&lr=default&cd=0%7C0%7C0%7C0&br=250&bt=125&bti=ODszNWYuMDE6&ft=tlc-I-Inz7TfiVYZiyq8Z&mime_type=audio_mpeg&qs=6&rc=ZmY0aTtlOjY0ZjxlaDNlOUBpM212eGU6ZnVsZjMzZzU8NEBfNTE1NjAuNjAxY18tNTYtYSNxcjZtcjQwNGhgLS1kMS9zcw%3D%3D&btag=e00008000&expire=1700307910&l=202311180544290984F2C815B65729734D&ply_type=3&policy=3&signature=00588d20de31148a1b020adebf99713b&tk=0", "coverMediumUrl": "https://p16-sign.tiktokcdn-us.com/tos-useast5-avt-0068-tx/0049bec51b5b8fcacf4339562209fd19~c5_720x720.jpeg?x-expires=1700456400&x-signature=6NwY7jHmDO1xGlE4ULhwCOEA%2F6o%3D", "musicId": "7145281770450078507"}, "webVideoUrl": "https://www.tiktok.com/@taylorswift/video/7216853341702278446", "videoMeta": {"height": 1088, "width": 576, "duration": 7, "coverUrl": "https://p16-sign.tiktokcdn-us.com/obj/tos-useast5-p-0068-tx/673c6a9a5a13481f9b1ad0c4fd1bac57?x-expires=1700456400&x-signature=knRr2wspgekIz60TWQ80WwON3%2Bw%3D", "definition": "540p", "format": "mp4", "downloadAddr": "https://v16-webapp-prime.us.tiktok.com/video/tos/useast5/tos-useast5-pve-0068-tx/71aa3cd7b7b043f484a10b6f836747cc/?a=1988&ch=0&cr=3&dr=0&lr=tiktok_m&cd=0%7C0%7C1%7C3&cv=1&br=3358&bt=1679&bti=ODszNWYuMDE6&cs=0&ds=3&ft=_rKBMBnZq8Zmoc_CKQ_vjFy.VAhLrus&mime_type=video_mp4&qs=0&rc=Zjw6ODY5aTdmOTg0NjM0ZkBpM2o2bjc6ZjlwajMzZzczNEAvMTRiNl9gNTUxLWA0XmFfYSMwYDJncjRfZmdgLS1kMS9zcw%3D%3D&btag=e00008000&expire=1700307877&l=202311180544290984F2C815B65729734D&ply_type=2&policy=2&signature=b2a0bf53c132df575cfec2b39c2dcfc7&tk=tt_chain_token"}, "diggCount": 3700000, "shareCount": 33600, "playCount": 29300000, "commentCount": 47000, "mentions": []} ``` The line is enclosed with a pair of curly braces. And every line has these same fields: - *id*: TikTok assigns each post an id. - text: each post has its text content and its video/audio content. The text content is stored here. Keep in mind that on TikTok, a post can't just include text information, it must contain a video. Therefore, in the remainder of this section, when we say **the video** or **this video**, we mean the video which comes with this post. When users uses hash tags, these hash tags will appear in the text content, like in this above example, Taylor Swift used hash tags twice: *#tstheerastour* and *#swifttok*. - *createTime*: a timestamp indicating when this post was created. This is the timestamp in Unix epoch format. It represents the number of seconds that have passed since January 1, 1970 (the Unix epoch) until the specified date and time. - *createTimeISO*: still a timestamp indicating when this post was created. This is the same timestamp but presented in the ISO 8601 date and time format, which is more human friendly. Here, *"T"* is a separator indicating the beginning of the time portion; and *"Z"* indicates that the time is in Coordinated Universal Time (UTC). - *authorMeta*: the author's information, which includes multiple items. - *musicMeta*: information of the music used in the video. This also includes multiple items. - *webVideoUrl*: the URL of this post. To satsify your curiosity, open this specific [webVideoUrl](https://www.tiktok.com/@taylorswift/video/7216853341702278446) in your browser, and you will see which video we are talking about right now. - *videoMeta*: information of the video. This also includes multiple items. - *diggCount*: how many likes this video has received. - *shareCount*: how many times this video has been shared. - *playCount*: how many times this video has been viewed. - *commentCount*: how many comments users have made as a reaction to this video. - *mentions*: whom the author of this post has mentioned in the post. This could include multiple items - if multiple users are mentioned. Each field is a key-value pair. As mentioned above, there are four fields which could include multiple items, and these four fields are: *authorMeta*, *musicMeta*, *videoMeta*, *mentions*. We will describe each of these four fields next. ### Author Meta The word *meta* means meta data. Let's extract the *authorMeta* field from this same Taylor Swift post and take a closer look. ```console "authorMeta": {"id": "6881290705605477381", "name": "taylorswift", "nickName": "Taylor Swift", "verified": true, "signature": "This is pretty much just a cat account", "bioLink": "taylorswift.com", "avatar": "https://p16-sign-va.tiktokcdn.com/tos-maliva-avt-0068/13f2a0d585f3cd8578da0d18c36a18c4~c5_720x720.jpeg?x-expires=1700456400&x-signature=jkLwlnqFUpLwoYe6TvlGXZs%2FhP8%3D", "privateAccount": false, "region": "US", "following": 0, "fans": 22900000, "heart": 200400000, "video": 61, "digg": 2161} ``` TikTok uses the following sub-fields to describe each author (i.e., user): - *id*: TikTok assigns each author an id. - *name*: the user name. Not necessarily the real name; but of course celebrities would use their real name for their official account. - *nickName*: each user can also have nick name. - *verified*: official accounts are usually verified. - *signature*: users can put a few words introducin this account. - *bioLink*: users can put a link in their bio section. - *avatar*: link to the account's profile picture. - *privateAccount*: is this a private account? Private accounts are only visible to users who have the permission from the account owner. - *region*: where this user is located. - *following*: how many accounts this user is following. Taylor Swift does not follow anyone. Hence her *following* is 0. - *fans*: how many followers this account has. - *heart*: how many likes (in total) this account received. - *video*: how many videos this account has posted. - *digg*: how many likes this user has pressed. Some of these sub-fields (such as name, nickName, verified, signature, bioLink, avatar, following, fans, heart) are directly visible on Taylor Swift's TikTok profile page, as shown in this following screenshot, taken on November 19th, 2023. ![alt text](images/taylor_swift.png "taylor swift profile") ### Music Meta Let's extract the *musicMeta* field from this same Taylor Swift post and take a closer look. ```console "musicMeta": {"musicName": "So it goes x Miss Americana", "musicAuthor": "\ud83e\udea9", "musicOriginal": false, "playUrl": "https://v16-webapp-prime.us.tiktok.com/video/tos/useast5/tos-useast5-v-27dcd7-tx/3b1da6666aed49658c9f51e43d08ea46/?a=1988&ch=0&cr=0&dr=0&er=0&lr=default&cd=0%7C0%7C0%7C0&br=250&bt=125&bti=ODszNWYuMDE6&ft=tlc-I-Inz7TfiVYZiyq8Z&mime_type=audio_mpeg&qs=6&rc=ZmY0aTtlOjY0ZjxlaDNlOUBpM212eGU6ZnVsZjMzZzU8NEBfNTE1NjAuNjAxY18tNTYtYSNxcjZtcjQwNGhgLS1kMS9zcw%3D%3D&btag=e00008000&expire=1700307910&l=202311180544290984F2C815B65729734D&ply_type=3&policy=3&signature=00588d20de31148a1b020adebf99713b&tk=0", "coverMediumUrl": "https://p16-sign.tiktokcdn-us.com/tos-useast5-avt-0068-tx/0049bec51b5b8fcacf4339562209fd19~c5_720x720.jpeg?x-expires=1700456400&x-signature=6NwY7jHmDO1xGlE4ULhwCOEA%2F6o%3D", "musicId": "7145281770450078507"} ``` TikTok uses the following sub-fields to describe each music: - *musicName*: the name of this music. - *musicAuthor*: the author of this music. - *musicOriginal*: is this original music? - *playUrl*: this url takes you to audio content of this music. - *coverMediumUrl*: this url takes you to the cover page of this music. - *musicId": TikTok assigns each music an id. Keep in mind that two songs can have the same name, but the musicId is unique. ### Video Meta ```console "videoMeta": {"height": 1088, "width": 576, "duration": 7, "coverUrl": "https://p16-sign.tiktokcdn-us.com/obj/tos-useast5-p-0068-tx/673c6a9a5a13481f9b1ad0c4fd1bac57?x-expires=1700456400&x-signature=knRr2wspgekIz60TWQ80WwON3%2Bw%3D", "definition": "540p", "format": "mp4", "downloadAddr": "https://v16-webapp-prime.us.tiktok.com/video/tos/useast5/tos-useast5-pve-0068-tx/71aa3cd7b7b043f484a10b6f836747cc/?a=1988&ch=0&cr=3&dr=0&lr=tiktok_m&cd=0%7C0%7C1%7C3&cv=1&br=3358&bt=1679&bti=ODszNWYuMDE6&cs=0&ds=3&ft=_rKBMBnZq8Zmoc_CKQ_vjFy.VAhLrus&mime_type=video_mp4&qs=0&rc=Zjw6ODY5aTdmOTg0NjM0ZkBpM2o2bjc6ZjlwajMzZzczNEAvMTRiNl9gNTUxLWA0XmFfYSMwYDJncjRfZmdgLS1kMS9zcw%3D%3D&btag=e00008000&expire=1700307877&l=202311180544290984F2C815B65729734D&ply_type=2&policy=2&signature=b2a0bf53c132df575cfec2b39c2dcfc7&tk=tt_chain_token"} ``` TikTok uses the following sub-fields to describe each music: - *height*: how this video will be displayed - the height. - *width*: how this video will be displayed - the width. - *duration*: the duration of this video - how many seconds. - *coverUrl*: this url takes you to the thumbnail view image of this video. - *definition*: the definition of this video. - *format*: the format of this video. - *downloadAddr*: the url where you can download this video. ### Mentions Unliked the *authorMeta*, *musicMeta*, *videoMeta* which includes multiple sub-fields. *mentions* is more like an array which store objects of the same type. If multiple users are mentioned, then these users will appear in this *mentions* array; if no account is mentioned, like the case in this Taylor Swift post, then the *mentions* field will be stored like an empty array like this: ```console "mentions": [] ``` ## Output File Format 1. when users run this command: ```console nytrends.exe input.json output.txt hashtag ``` your program should produce an output similar to what TikTok does (of course we will not print the pictures): ![alt text](images/hashtags.png "Hashtags") this basically is the trending hashtags, each is associated with some videos. In your output, these videos should be sorted in a descending order, based on how many views the video has received. More specifically, you should print the top 10 trending hashtags, and then for each hashtag, print 3 videos which use this hashtag in its post text. If a hashtag is used in 100 videos, select the 3 (out of these 100) most viewed videos. Print the most viewed video first, and then print the next most viewed video, and then the third most viewed video. Definition of the top 10 trending hashtags: this should be the based on the usage of the hashtag - how many times in total each hashtag is used. When two hashtags are both used for the same amount of times, break the tie by the total view count of the videos associated with each hashtag. And if still a tie, break the tie by the hashtag's name, and the smaller name is the winner. Example 1: hashtag A is used 100 times, hashtag B is used 20 times. Then hashtag A is the clear winner. Example 2: hashtag A is used 1000 times, hashtag B is used 1000 times, but all the posts which use hashtag A are (in total) viewed 50000 times, but all the posts which use hashtag B are (in total) viewed 2000 times, then A is the clear winner. Example 3: hashtag A and hashtag B are both used 100 times, and their associated videos are both view 1000 times. hashtag A is "#tstheerastour", and hashtag B is "#swifttok", both are std::string objects. Then, "#swifttok" will be the winner, because "#swifttok" is smaller than "#tstheerastour". 2. when users run this command: ```console nytrends.exe input.json output.txt sound ``` your program should produce an output similar to what TikTok does (of course we will not print the pictures): ![alt text](images/sounds.png "sounds") this basically is the trending sounds, each is associated with some videos. In your output, these videos should be sorted in a descending order, based on how many views the video has received. More specifically, you should print the top 10 trending sounds, and then for each sound, print 3 videos which use this sound. If a sound is used in 100 videos, select the 3 (out of these 100) most viewed videos. Print the most viewed video first, and then print the next most viewed video, and then the third most viewed video. Definition of the top 10 trending sounds: this should be the based on the total view count of the videos which use this sound. If there is a tie, break the tie by the music id - the one with the smaller music id will be displayed first. Example 1: sound A is used in 100 videos, and these 100 videos have been viewed (in total) 10000 times, sound B is used in 1 video, but this video has been viewed 1000000 times. Then sound B is the clear winner. Example 2: sound A is used in 1000 videos, and these 1000 videos have been viewed (in total) 10000 times; sound B is used in 5000 videos, and all these 5000 videos in total have been viewed 10000 times. Then we get a tie based on the view count. Let's say sound A's music id is 123, sound B's music id is 456, then the smaller music id wins. Thus we break the tie and A is the winner. ## Useful Code ### getline **Note**: this next paragraph is the same as that paragraph in homework 8, and you are once again recommended to read the whole file into a large string; but if you want to beat Jidong on the leaderboard, whether or not this is the most efficient way to read the file is a question for you to think about. Unlike previous assignments where the input files only contain fields separated by spaces, in this assignment, fields are not separated by spaces, and therefore you may need a different way to read the input files. And the function *getline* will now come into play. To read the json file and store the whole json file into a std::string, you can use the following lines of code: ```cpp // assume inputFile is a std::string, containing the file name of the input file. std::ifstream jsonFile(inputFile); if (!jsonFile.is_open()) { std::cerr << "Failed to open the JSON file." << std::endl; exit(1); } std::string json_content; std::string line; while (std::getline(jsonFile, line)) { json_content += line; } // don't need this json file anymore, as the content is read into json_content. jsonFile.close(); ``` After these lines, the whole content of the json file will be stored as a string in the std::string variable *json_content*. And you can then parse it to get each individual post. In order to parse the *json_content*, which is a std::string, you will once again find that the std::string functions such as *std::string::find*(), and *std::string::substr*() to be very useful. ### Extract Hashtags from the Post Text Assume you store the post text content in a std::string variable called *text*, the following code block will extract all hashtags from this text string. ```cpp // the text of the post is given as a std::string, extract hashtags from the text. // define a regular expression to match hashtags with emojis std::regex hashtagRegex("#([\\w\\u0080-\\uFFFF]+)"); // create an iterator for matching std::sregex_iterator hashtagIterator(text.begin(), text.end(), hashtagRegex); std::sregex_iterator endIterator; // iterate over the matches and extract the hashtags while (hashtagIterator != endIterator) { std::smatch match = *hashtagIterator; std::string hashtag = match.str(1); // extract the first capturing group // this line will print each hash tag // if you want to do more with each hash tag, do it here. for example, store all hash tags in your container. std::cout << "Hashtag: " << hashtag << std::endl; ++hashtagIterator; } } ``` In order to use this above code block, you need to include the regular expression library like this: ```cpp #include ``` ## Program Requirements & Submission Details In this assignment, you are required to use std::priority_queue. You can also use any other data structures we have already learned, such as std::string, std::vector, std::list, std::map, std::set, std::pair, std::unordered_map, std::unordered_set, std::stack, std::queue. It is okay if you decide not to use std::unordered_map or std::unordered_set, although they fit very well with this assignment. **You must use try/throw/catch to handle exceptions in your code**. You do not need to do so everywhere in your code. You will only lose points if you do not use it at all. Use good coding style when you design and implement your program. Organize your program into functions: don’t put all the code in main! Be sure to read the [Homework Policies](https://www.cs.rpi.edu/academics/courses/fall23/csci1200/homework_policies.php) as you put the finishing touches on your solution. Be sure to make up new test cases to fully debug your program and don’t forget to comment your code! Use the provided template [README.txt](./README.txt) file for notes you want the grader to read. You must do this assignment on your own, as described in the [Collaboration Policy & Academic Integrity](https://www.cs.rpi.edu/academics/courses/fall23/csci1200/academic_integrity.php) page. If you did discuss the problem or error messages, etc. with anyone, please list their names in your README.txt file. **Due Date**: 11/30/2023, Thursday, 23:59pm. ## Instructor's Code To be added (not a promise). ## Rubric 17 pts - README.txt Completed (2 pts) - One of name, collaborators, or hours not filled in. (-1) - Two or more of name, collaborators, or hours not filled in. (-2) - IMPLEMENTATION AND CODING STYLE (Good class design, split into a .h and .cpp file. Functions > 1 line are in .cpp file. Organized class implementation and reasonable comments throughout. Correct use of const/const& and of class method const. ) (7 pts) - No credit (significantly incomplete implementation) (-7) - Putting almost everything in the main function. It's better to create separate functions for different tasks. (-2) - Function bodies containing more than one statement are placed in the .h file. (okay for templated classes) (-2) - Missing include guards in the .h file. (Or does not declare them correctly) (-1) - Functions are not well documented or are poorly commented, in either the .h or the .cpp file. (-1) - Improper uses or omissions of const and reference. (-1) - Overly cramped, excessive whitespace, or poor indentation. (-1) - Poor file organization: Puts more than one class in a file (okay for very small helper classes) (-1) - Poor variable names. (-1) - Uses global variables. (-1) - Contains useless comments like commented-out code, terminal commands, or silly notes. (-1) - DATA REPRESENTATION (6 pts) - Does not use std::priority_queue at all. (-6) - Member variables are public. (-2) - Exceptions (2 pts) - Does not use try/throw/catch anywhere in the code. (-2)