Compare commits
2 Commits
1a0f9f7bf1
...
3b991a8135
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
3b991a8135 | ||
|
|
320e5c296b |
5
.gitignore
vendored
5
.gitignore
vendored
@@ -1,4 +1,5 @@
|
|||||||
*.lock
|
*.lock
|
||||||
public/
|
public/
|
||||||
resources/
|
resources/_gen/
|
||||||
isableFastRander/
|
isableFastRander/
|
||||||
|
.hugo_build.lock
|
||||||
@@ -640,6 +640,8 @@ defaultContentLanguageInSubdir = true
|
|||||||
[params.page.heading.number]
|
[params.page.heading.number]
|
||||||
# whether to enable auto heading numbering
|
# whether to enable auto heading numbering
|
||||||
enable = false
|
enable = false
|
||||||
|
# FixIt 0.3.3 | NEW only enable in main section pages (default is posts)
|
||||||
|
onlyMainSection = true
|
||||||
[params.page.heading.number.format]
|
[params.page.heading.number.format]
|
||||||
h1 = "{title}"
|
h1 = "{title}"
|
||||||
h2 = "{h2} {title}"
|
h2 = "{h2} {title}"
|
||||||
|
|||||||
BIN
content/en/posts/csci-1100/hw-6/HW6.zip
Normal file
BIN
content/en/posts/csci-1100/hw-6/HW6.zip
Normal file
Binary file not shown.
439
content/en/posts/csci-1100/hw-6/index.md
Normal file
439
content/en/posts/csci-1100/hw-6/index.md
Normal file
@@ -0,0 +1,439 @@
|
|||||||
|
---
|
||||||
|
title: CSCI 1100 - Homework 6 - Files, Sets and Document Analysis
|
||||||
|
subtitle:
|
||||||
|
date: 2024-04-13T15:36:47-04:00
|
||||||
|
slug: csci-1100-hw-6
|
||||||
|
draft: false
|
||||||
|
author:
|
||||||
|
name: James
|
||||||
|
link: https://www.jamesflare.com
|
||||||
|
email:
|
||||||
|
avatar: /site-logo.avif
|
||||||
|
description: This blog post introduces a Python programming assignment for analyzing and comparing text documents using natural language processing techniques, such as calculating word length, distinct word ratios, and Jaccard similarity between word sets and pairs.
|
||||||
|
keywords: ["Python", "natural language processing", "text analysis", "document comparison"]
|
||||||
|
license:
|
||||||
|
comment: true
|
||||||
|
weight: 0
|
||||||
|
tags:
|
||||||
|
- CSCI 1100
|
||||||
|
- Homework
|
||||||
|
- RPI
|
||||||
|
- Python
|
||||||
|
- Programming
|
||||||
|
categories:
|
||||||
|
- Programming
|
||||||
|
collections:
|
||||||
|
- CSCI 1100
|
||||||
|
hiddenFromHomePage: false
|
||||||
|
hiddenFromSearch: false
|
||||||
|
hiddenFromRss: false
|
||||||
|
hiddenFromRelated: false
|
||||||
|
summary: This blog post introduces a Python programming assignment for analyzing and comparing text documents using natural language processing techniques, such as calculating word length, distinct word ratios, and Jaccard similarity between word sets and pairs.
|
||||||
|
resources:
|
||||||
|
- name: featured-image
|
||||||
|
src: featured-image.jpg
|
||||||
|
- name: featured-image-preview
|
||||||
|
src: featured-image-preview.jpg
|
||||||
|
toc: true
|
||||||
|
math: true
|
||||||
|
lightgallery: false
|
||||||
|
password:
|
||||||
|
message:
|
||||||
|
repost:
|
||||||
|
enable: true
|
||||||
|
url:
|
||||||
|
|
||||||
|
# See details front matter: https://fixit.lruihao.cn/documentation/content-management/introduction/#front-matter
|
||||||
|
---
|
||||||
|
|
||||||
|
<!--more-->
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
This homework is worth 100 points total toward your overall homework grade. It is due Thursday, March 21, 2024 at 11:59:59 pm. As usual, there will be a mix of autograded points, instructor test case points, and TA graded points. There is just one "part" to this homework.
|
||||||
|
|
||||||
|
See the handout for Submission Guidelines and Collaboration Policy for a discussion on grading and on what is considered excessive collaboration. These rules will be in force for the rest of the semester.
|
||||||
|
|
||||||
|
You will need the data files we provide in `hw6_files.zip`, so be sure to download this file from the Course Materials section of Submitty and unzip it into your directory for HW 6. The zip file contains data files and example input / output for your program.
|
||||||
|
|
||||||
|
## Problem Introduction
|
||||||
|
|
||||||
|
There are many software systems for analyzing the style and sophistication of written text and even deciding if two documents were authored by the same individual. The systems analyze documents based on the sophistication of word usage, frequently used words, and words that appear closely together. In this assignment you will write a Python program that reads two files containing the text of two different documents, analyzes each document, and compares the documents. The methods we use are simple versions of much more sophisticated methods that are used in practice in the field known as natural language processing (NLP).
|
||||||
|
|
||||||
|
## Files and Parameters
|
||||||
|
|
||||||
|
Your program must work with three files and an integer parameter.
|
||||||
|
|
||||||
|
The name of the first file will be `stop.txt` for every run of your program, so you don't need to ask the user for it. The file contains what we will refer to as "stop words" — words that should be ignored. You must ensure that the file `stop.txt` is in the same folder as your `hw6_sol.py` python file. We will provide one example of it, but may use others in testing your code.
|
||||||
|
|
||||||
|
You must request the names of two documents to analyze and compare and an integer "maximum separation" parameter, which will be referred to as `max_sep` here. The requests should look like:
|
||||||
|
|
||||||
|
```text
|
||||||
|
Enter the first file to analyze and compare ==> doc1.txt
|
||||||
|
doc1.txt
|
||||||
|
Enter the second file to analyze and compare ==> doc2.txt
|
||||||
|
doc2.txt
|
||||||
|
Enter the maximum separation between words in a pair ==> 2
|
||||||
|
2
|
||||||
|
```
|
||||||
|
|
||||||
|
## Parsing
|
||||||
|
|
||||||
|
The job of parsing for this homework is to break a file of text into a single list of consecutive words. To do this, the contents from a file should first be split up into a list of strings, where each string contains consecutive non-white-space characters. Then each string should have all non-letters removed and all letters converted to lower case. For example, if the contents of a file (e.g., `doc1.txt`) are read to form the string (note the end-of-line and tab characters)
|
||||||
|
|
||||||
|
```python
|
||||||
|
s = " 01-34 can't 42weather67 puPPy, \r \t and123\n Ch73%allenge 10ho32use,.\n"
|
||||||
|
```
|
||||||
|
|
||||||
|
then the splitting should produce the list of strings
|
||||||
|
|
||||||
|
```python
|
||||||
|
['01-34', "can't", '42weather67', 'puPPy,', 'and123', 'Ch73%allenge', '10ho32use,.']
|
||||||
|
```
|
||||||
|
|
||||||
|
and this should be split into the list of (non-empty) strings
|
||||||
|
|
||||||
|
```python
|
||||||
|
['cant', 'weather', 'puppy', 'and', 'challenge', 'house']
|
||||||
|
```
|
||||||
|
|
||||||
|
Note that the first string, `'01-34'` is completely removed because it has no letters. All three files — `stop.txt` and the two document files called `doc1.txt` and `doc2.txt` above — should be parsed this way.
|
||||||
|
|
||||||
|
Once this parsing is done, the list resulting from parsing the file `stop.txt` should be converted to a set. This set contains what are referred to in NLP as "stop words" — words that appear so frequently in text that they should be ignored.
|
||||||
|
|
||||||
|
The files `doc1.txt` and `doc2.txt` contain the text of the two documents to compare. For each, the list returned from parsing should be further modified by removing any stop words. Continuing with our example, if `'cant'` and `'and'` are stop words, then the word list should be reduced to
|
||||||
|
|
||||||
|
```python
|
||||||
|
['weather', 'puppy', 'challenge', 'house']
|
||||||
|
```
|
||||||
|
|
||||||
|
Words like "and" are almost always in stop lists, while "cant" (really, the contraction "can't") is in some. Note that the word lists built from `doc1.txt` and `doc2.txt` should be kept as lists because the word ordering is important.
|
||||||
|
|
||||||
|
### Analyze Each Document's Word List
|
||||||
|
Once you have produced the word list with stop words removed, you are ready to analyze the word list. There are many ways to do this, but here are the ones required for this assignment:
|
||||||
|
|
||||||
|
1. Calculate and output the average word length, accurate to two decimal places. The idea here is that word length is a rough indicator of sophistication.
|
||||||
|
|
||||||
|
2. Calculate and output, accurate to three decimal places, the ratio between the number of distinct words and the total number of words. This is a measure of the variety of language used (although it must be remembered that some authors use words and phrases repeatedly to strengthen their message.)
|
||||||
|
|
||||||
|
3. For each word length starting at 1, find the set of words having that length. Print the length, the number of different words having that length, and at most six of these words. If for a certain length, there are six or fewer words, then print all six, but if there are more than six print the first three and the last three in alphabetical order. For example, suppose our simple text example above were expanded to the list
|
||||||
|
|
||||||
|
```python
|
||||||
|
['weather', 'puppy', 'challenge', 'house', 'whistle', 'nation', 'vest',
|
||||||
|
'safety', 'house', 'puppy', 'card', 'weather', 'card', 'bike',
|
||||||
|
'equality', 'justice', 'pride', 'orange', 'track', 'truck',
|
||||||
|
'basket', 'bakery', 'apples', 'bike', 'truck', 'horse', 'house',
|
||||||
|
'scratch', 'matter', 'trash']
|
||||||
|
```
|
||||||
|
|
||||||
|
Then the output should be
|
||||||
|
|
||||||
|
```text
|
||||||
|
1: 0:
|
||||||
|
2: 0:
|
||||||
|
3: 0:
|
||||||
|
4: 3: bike card vest
|
||||||
|
5: 7: horse house pride ... track trash truck
|
||||||
|
6: 7: apples bakery basket ... nation orange safety
|
||||||
|
7: 4: justice scratch weather whistle
|
||||||
|
8: 1: equality
|
||||||
|
9: 1: challenge
|
||||||
|
```
|
||||||
|
|
||||||
|
4. Find the distinct word pairs for this document. A word pair is a two-tuple of words that appear `max_sep` or fewer positions apart in the document list. For example, if the user input resulted in `max_sep == 2`, then the first six word pairs generated will be:
|
||||||
|
|
||||||
|
```python
|
||||||
|
('puppy', 'weather'), ('challenge', 'weather'),
|
||||||
|
('challenge', 'puppy'), ('house', 'puppy'),
|
||||||
|
('challenge', 'house'), ('challenge', 'whistle')
|
||||||
|
```
|
||||||
|
|
||||||
|
Your program should output the total number of distinct word pairs. (Note that `('puppy', 'weather')` and `('weather', 'puppy')` should be considered the same word pair.) It should also output the first 5 word pairs in alphabetical order (as opposed to the order they are formed, which is what is written above) and the last 5 word pairs. You may assume, without checking, that there are enough words to generate these pairs. Here is the output for the longer example above (assuming that the name of the file they are read from is `ex2.txt`):
|
||||||
|
|
||||||
|
```text
|
||||||
|
Word pairs for document ex2.txt
|
||||||
|
54 distinct pairs
|
||||||
|
apples bakery
|
||||||
|
apples basket
|
||||||
|
apples bike
|
||||||
|
apples truck
|
||||||
|
bakery basket
|
||||||
|
...
|
||||||
|
puppy weather
|
||||||
|
safety vest
|
||||||
|
scratch trash
|
||||||
|
track truck
|
||||||
|
vest whistle
|
||||||
|
```
|
||||||
|
|
||||||
|
5. Finally, as a measure of how distinct the word pairs are, calculate and output, accurate to three decimal places, the ratio of the number of distinct word pairs to the total number of word pairs.
|
||||||
|
|
||||||
|
#### Compare Documents
|
||||||
|
The last step is to compare the documents for complexity and similarity. There are many possible measures, so we will implement just a few.
|
||||||
|
|
||||||
|
Before we do this we need to define a measure of similarity between two sets. A very common one, and the one we use here, is called Jaccard Similarity. This is a sophisticated-sounding name for a very simple concept (something that happens a lot in computer science and other STEM disciplines). If A and B are two sets, then the Jaccard similarity is just
|
||||||
|
|
||||||
|
$$
|
||||||
|
J(A, B) = \frac{|A \cap B)|}{|A \cup B)|}
|
||||||
|
$$
|
||||||
|
|
||||||
|
In plain English it is the size of the intersection between two sets divided by the size of their union. As examples, if $A$ and $B$ are equal, $J(A, B)$ = 1, and if A and B are disjoint, $J(A, B)$ = 0. As a special case, if one or both of the sets is empty the measure is 0. The Jaccard measure is quite easy to calculate using Python set operations.
|
||||||
|
|
||||||
|
Here are the comparison measures between documents:
|
||||||
|
|
||||||
|
1. Decide which has a greater average word length. This is a rough measure of which uses more sophisticated language.
|
||||||
|
|
||||||
|
2. Calculate the Jaccard similarity in the overall word use in the two documents. This should be accurate to three decimal places.
|
||||||
|
|
||||||
|
3. Calculate the Jaccard similarity of word use for each word length. Each output should also be accurate to three decimal places.
|
||||||
|
|
||||||
|
4. Calculate the Jaccard similarity between the word pair sets. The output should be accurate to four decimal places. The documents we study here will not have substantial similarity of pairs, but in other cases this is a useful comparison measure.
|
||||||
|
|
||||||
|
See the example outputs for details.
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
- An important part of this assignment is to practice with the use of sets. The most complicated instance of this occurs when handling the calculation of the word sets for each word length. This requires you to form a list of sets. The set associated with entry k of the list should be the words of length k.
|
||||||
|
|
||||||
|
- Sorting a list or a set of two-tuples of strings is straightforward. (Note that when you sort a set, the result is a list.) The ordering produced is alphabetical by the first element of the tuple and then, for ties, alphabetical by the second. For example,
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> v = [('elephant', 'kenya'), ('lion', 'kenya'), ('elephant', 'tanzania'), \
|
||||||
|
('bear', 'russia'), ('bear', 'canada')]
|
||||||
|
>>> sorted(v)
|
||||||
|
[('bear', 'canada'), ('bear', 'russia'), ('elephant', 'kenya'), \
|
||||||
|
('elephant', 'tanzania'), ('lion', 'kenya')]
|
||||||
|
```
|
||||||
|
|
||||||
|
- Submit just a single Python file, `hw6_sol.py`.
|
||||||
|
|
||||||
|
- A component missing from our analysis is the frequency with which each word appears. This is easy to keep track of using a dictionary, but we will not do that for this assignment. As you learn about dictionaries think about how they might be used to enhance the analysis we do here.
|
||||||
|
|
||||||
|
## Document Files
|
||||||
|
|
||||||
|
We have provided the example described above and we will be testing your code along with several other documents (few of them are):
|
||||||
|
|
||||||
|
- Elizabeth Alexander's poem Praise Song for the Day.
|
||||||
|
- Maya Angelou's poem On the Pulse of the Morning.
|
||||||
|
- A scene from William Shakespeare's Hamlet.
|
||||||
|
- Dr. Seuss's The Cat in the Hat
|
||||||
|
- Walt Whitman's When Lilacs Last in the Dooryard Bloom'd (not all of it!)
|
||||||
|
|
||||||
|
All of these are available full-text on-line. See poetryfoundation.org and learn about some of the history of these poets, playwrites and authors.
|
||||||
|
|
||||||
|
## Supporting Files
|
||||||
|
|
||||||
|
{{< link href="HW6.zip" content="HW6.zip" title="Download HW6.zip" download="HW6.zip" card=true >}}
|
||||||
|
|
||||||
|
## Solution
|
||||||
|
|
||||||
|
### hw6_sol.py
|
||||||
|
|
||||||
|
```python
|
||||||
|
"""
|
||||||
|
This is a implement of the homework 6 solution for CSCI-1100
|
||||||
|
"""
|
||||||
|
|
||||||
|
#work_dir = "/mnt/c/Users/james/OneDrive/RPI/Spring 2024/CSCI-1100/Homeworks/HW6/hw6_files/"
|
||||||
|
work_dir = ""
|
||||||
|
stop_word = "stop.txt"
|
||||||
|
|
||||||
|
def get_stopwords():
|
||||||
|
stopwords = []
|
||||||
|
stoptxt = open(work_dir + stop_word, "r")
|
||||||
|
stop_words = stoptxt.read().split("\n")
|
||||||
|
stoptxt.close()
|
||||||
|
stop_words = [x.strip() for x in stop_words if x.strip() != ""]
|
||||||
|
for i in stop_words:
|
||||||
|
text = ""
|
||||||
|
for j in i:
|
||||||
|
if j.isalpha():
|
||||||
|
text += j.lower()
|
||||||
|
if text != "":
|
||||||
|
stopwords.append(text)
|
||||||
|
#print("Debug - Stop words:", stopwords)
|
||||||
|
return set(stopwords)
|
||||||
|
|
||||||
|
def parse(raw):
|
||||||
|
parsed = []
|
||||||
|
parsing = raw.replace("\n"," ").replace("\t"," ").replace("\r"," ").split(" ")
|
||||||
|
#print("Debug - Parssing step 1:", parsing)
|
||||||
|
parsing = [x.strip() for x in parsing if x.strip() != ""]
|
||||||
|
#print("Debug - Parssing step 2:", parsing)
|
||||||
|
for i in parsing:
|
||||||
|
text = ""
|
||||||
|
for j in i:
|
||||||
|
if j.isalpha():
|
||||||
|
text += j.lower()
|
||||||
|
if text != "":
|
||||||
|
parsed.append(text)
|
||||||
|
#print("Debug - Parssing step 3:", parsed)
|
||||||
|
parsed = [x for x in parsed if x not in get_stopwords()]
|
||||||
|
#print("Debug - Parssing step 4:", parsed)
|
||||||
|
return parsed
|
||||||
|
|
||||||
|
def get_avg_word_len(file):
|
||||||
|
#print("Debug - File:", file)
|
||||||
|
filetxt = open(work_dir + file, "r")
|
||||||
|
raw = filetxt.read()
|
||||||
|
filetxt.close()
|
||||||
|
parsed = parse(raw)
|
||||||
|
#print("Debug - Parsed:", parsed)
|
||||||
|
avg = sum([len(x) for x in parsed]) / len(parsed)
|
||||||
|
#print("Debug - Average:", avg)
|
||||||
|
return avg
|
||||||
|
|
||||||
|
def get_ratio_distinct(file):
|
||||||
|
filetxt = open(work_dir + file, "r").read()
|
||||||
|
distinct = list(set(parse(filetxt)))
|
||||||
|
total = len(parse(filetxt))
|
||||||
|
ratio = len(distinct) / total
|
||||||
|
#print("Debug - Distinct:", ratio)
|
||||||
|
return ratio
|
||||||
|
|
||||||
|
def word_length_ranking(file):
|
||||||
|
filetxt = open(work_dir + file, "r").read()
|
||||||
|
parsed = parse(filetxt)
|
||||||
|
max_length = max([len(x) for x in parsed])
|
||||||
|
#print("Debug - Max length:", max_length)
|
||||||
|
ranking = [[] for i in range(max_length + 1)]
|
||||||
|
for i in parsed:
|
||||||
|
if i not in ranking[len(i)]:
|
||||||
|
ranking[len(i)].append(i)
|
||||||
|
#print("Debug - Adding", i, "to", len(i))
|
||||||
|
for i in range(len(ranking)):
|
||||||
|
ranking[i] = sorted(ranking[i])
|
||||||
|
#print("Debug - Ranking:", ranking)
|
||||||
|
return ranking
|
||||||
|
|
||||||
|
def get_word_set_table(file):
|
||||||
|
str1 = ""
|
||||||
|
data = word_length_ranking(file)
|
||||||
|
for i in range(1, len(data)):
|
||||||
|
cache = ""
|
||||||
|
if len(data[i]) <= 6:
|
||||||
|
cache = " ".join(data[i])
|
||||||
|
else:
|
||||||
|
cache = " ".join(data[i][:3]) + " ... "
|
||||||
|
cache += " ".join(data[i][-3:])
|
||||||
|
if cache != "":
|
||||||
|
str1 += "{:4d}:{:4d}: {}\n".format(i, len(data[i]), cache)
|
||||||
|
else:
|
||||||
|
str1 += "{:4d}:{:4d}:\n".format(i, len(data[i]))
|
||||||
|
return str1.rstrip()
|
||||||
|
|
||||||
|
def get_word_pairs(file, maxsep):
|
||||||
|
filetxt = open(work_dir + file, "r").read()
|
||||||
|
parsed = parse(filetxt)
|
||||||
|
pairs = []
|
||||||
|
for i in range(len(parsed)):
|
||||||
|
for j in range(i+1, len(parsed)):
|
||||||
|
if j - i <= maxsep:
|
||||||
|
pairs.append((parsed[i], parsed[j]))
|
||||||
|
return pairs
|
||||||
|
|
||||||
|
def get_distinct_pairs(file, maxsep):
|
||||||
|
total_pairs = get_word_pairs(file, maxsep)
|
||||||
|
pairs = []
|
||||||
|
for i in total_pairs:
|
||||||
|
cache = sorted([i[0], i[1]])
|
||||||
|
pairs.append((cache[0], cache[1]))
|
||||||
|
return sorted(list(set(pairs)))
|
||||||
|
|
||||||
|
def get_word_pair_table(file, maxsep):
|
||||||
|
pairs = get_distinct_pairs(file, maxsep)
|
||||||
|
#print("Debug - Pairs:", pairs)
|
||||||
|
str1 = " "
|
||||||
|
str1 += str(len(pairs)) + " distinct pairs" + "\n"
|
||||||
|
if len(pairs) <= 10:
|
||||||
|
for i in pairs:
|
||||||
|
str1 += " {} {}\n".format(i[0], i[1])
|
||||||
|
else:
|
||||||
|
for i in pairs[:5]:
|
||||||
|
str1 += " {} {}\n".format(i[0], i[1])
|
||||||
|
str1 += " ...\n"
|
||||||
|
for i in pairs[-5:]:
|
||||||
|
str1 += " {} {}\n".format(i[0], i[1])
|
||||||
|
return str1.rstrip()
|
||||||
|
|
||||||
|
def get_jaccard_similarity(list1, list2):
|
||||||
|
setA = set(list1)
|
||||||
|
setB = set(list2)
|
||||||
|
intersection = len(setA & setB)
|
||||||
|
union = len(setA | setB)
|
||||||
|
if union == 0:
|
||||||
|
return 0.0
|
||||||
|
else:
|
||||||
|
return intersection / union
|
||||||
|
|
||||||
|
def get_word_similarity(file1, file2):
|
||||||
|
file1txt = open(work_dir + file1, "r").read()
|
||||||
|
file2txt = open(work_dir + file2, "r").read()
|
||||||
|
parsed1 = parse(file1txt)
|
||||||
|
parsed2 = parse(file2txt)
|
||||||
|
return get_jaccard_similarity(parsed1, parsed2)
|
||||||
|
|
||||||
|
def get_word_similarity_by_length(file1, file2):
|
||||||
|
word_by_length_1 = word_length_ranking(file1)
|
||||||
|
word_by_length_2 = word_length_ranking(file2)
|
||||||
|
similarity = []
|
||||||
|
for i in range(1, max(len(word_by_length_1), len(word_by_length_2))):
|
||||||
|
if i < len(word_by_length_1) and i < len(word_by_length_2):
|
||||||
|
similarity.append(get_jaccard_similarity(word_by_length_1[i], word_by_length_2[i]))
|
||||||
|
else:
|
||||||
|
similarity.append(0.0)
|
||||||
|
return similarity
|
||||||
|
|
||||||
|
def get_word_similarity_by_length_table(file1, file2):
|
||||||
|
similarity = get_word_similarity_by_length(file1, file2)
|
||||||
|
str1 = ""
|
||||||
|
for i in range(len(similarity)):
|
||||||
|
str1 += "{:4d}: {:.4f}\n".format(i+1, similarity[i])
|
||||||
|
return str1.rstrip()
|
||||||
|
|
||||||
|
def get_word_pairs_similarity(file1, file2, maxsep):
|
||||||
|
pairs1 = get_distinct_pairs(file1, maxsep)
|
||||||
|
pairs2 = get_distinct_pairs(file2, maxsep)
|
||||||
|
return get_jaccard_similarity(pairs1, pairs2)
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
# Debugging
|
||||||
|
#file1st = "cat_in_the_hat.txt"
|
||||||
|
#file2rd = "pulse_morning.txt"
|
||||||
|
#maxsep = 2
|
||||||
|
|
||||||
|
#s = " 01-34 can't 42weather67 puPPy, \r \t and123\n Ch73%allenge 10ho32use,.\n"
|
||||||
|
#print(parse(s))
|
||||||
|
#get_avg_word_len(file1st)
|
||||||
|
#get_ratio_distinct(file1st)
|
||||||
|
#print(word_length_ranking(file1st)[10])
|
||||||
|
#print(get_word_set_table(file1st))
|
||||||
|
|
||||||
|
# Get user input
|
||||||
|
file1st = input("Enter the first file to analyze and compare ==> ").strip()
|
||||||
|
print(file1st)
|
||||||
|
file2rd = input("Enter the second file to analyze and compare ==> ").strip()
|
||||||
|
print(file2rd)
|
||||||
|
maxsep = int(input("Enter the maximum separation between words in a pair ==> ").strip())
|
||||||
|
print(maxsep)
|
||||||
|
|
||||||
|
files = [file1st, file2rd]
|
||||||
|
for i in files:
|
||||||
|
print("\nEvaluating document", i)
|
||||||
|
print("1. Average word length: {:.2f}".format(get_avg_word_len(i)))
|
||||||
|
print("2. Ratio of distinct words to total words: {:.3f}".format(get_ratio_distinct(i)))
|
||||||
|
print("3. Word sets for document {}:\n{}".format(i, get_word_set_table(i)))
|
||||||
|
print("4. Word pairs for document {}\n{}".format(i, get_word_pair_table(i, maxsep)))
|
||||||
|
print("5. Ratio of distinct word pairs to total: {:.3f}".format(len(get_distinct_pairs(i, maxsep)) / len(get_word_pairs(i, maxsep))))
|
||||||
|
|
||||||
|
print("\nSummary comparison")
|
||||||
|
avg_word_length_ranking = []
|
||||||
|
for i in files:
|
||||||
|
length = get_avg_word_len(i)
|
||||||
|
avg_word_length_ranking.append((i, length))
|
||||||
|
avg_word_length_ranking = sorted(avg_word_length_ranking, key=lambda x: x[1], reverse=True)
|
||||||
|
print("1. {} on average uses longer words than {}".format(avg_word_length_ranking[0][0], avg_word_length_ranking[1][0]))
|
||||||
|
print("2. Overall word use similarity: {:.3f}".format(get_word_similarity(file1st, file2rd)))
|
||||||
|
print("3. Word use similarity by length:\n{}".format(get_word_similarity_by_length_table(file1st, file2rd)))
|
||||||
|
print("4. Word pair similarity: {:.4f}".format(get_word_pairs_similarity(file1st, file2rd, maxsep)))
|
||||||
|
```
|
||||||
204
content/en/posts/wordpress/cc-attack-on-index-php/index.md
Normal file
204
content/en/posts/wordpress/cc-attack-on-index-php/index.md
Normal file
@@ -0,0 +1,204 @@
|
|||||||
|
---
|
||||||
|
title: Principles and Discussion of DDoS Attacks Targeting WordPress Features
|
||||||
|
subtitle:
|
||||||
|
date: 2024-04-13T13:12:44-04:00
|
||||||
|
slug: cc-attack-on-index-php
|
||||||
|
draft: false
|
||||||
|
author:
|
||||||
|
name: James
|
||||||
|
link: https://www.jamesflare.com
|
||||||
|
email:
|
||||||
|
avatar: /site-logo.avif
|
||||||
|
description: This blog post explores the principles and challenges of a specific DDoS attack targeting WordPress instances by requesting non-existent paths to bypass caching mechanisms, and discusses possible defense and offense strategies from the perspectives of both blue and red teams.
|
||||||
|
keywords: ["DDoS", "WordPress", "Nginx", "CloudFlare", "IPv6"]
|
||||||
|
license:
|
||||||
|
comment: true
|
||||||
|
weight: 0
|
||||||
|
tags:
|
||||||
|
- WordPress
|
||||||
|
- Nginx
|
||||||
|
- WAF
|
||||||
|
categories:
|
||||||
|
- Security
|
||||||
|
- Discussion
|
||||||
|
hiddenFromHomePage: false
|
||||||
|
hiddenFromSearch: false
|
||||||
|
hiddenFromRss: false
|
||||||
|
hiddenFromRelated: false
|
||||||
|
summary: This blog post explores the principles and challenges of a specific DDoS attack targeting WordPress instances by requesting non-existent paths to bypass caching mechanisms, and discusses possible defense and offense strategies from the perspectives of both blue and red teams.
|
||||||
|
resources:
|
||||||
|
- name: featured-image
|
||||||
|
src: featured-image.jpg
|
||||||
|
- name: featured-image-preview
|
||||||
|
src: featured-image-preview.jpg
|
||||||
|
toc: true
|
||||||
|
math: true
|
||||||
|
lightgallery: false
|
||||||
|
password:
|
||||||
|
message:
|
||||||
|
repost:
|
||||||
|
enable: true
|
||||||
|
url:
|
||||||
|
|
||||||
|
# See details front matter: https://fixit.lruihao.cn/documentation/content-management/introduction/#front-matter
|
||||||
|
---
|
||||||
|
|
||||||
|
<!--more-->
|
||||||
|
|
||||||
|
## Background
|
||||||
|
|
||||||
|
KEIJILION mentioned in [Website Defense Upgrade: fail2ban Integrates with CloudFlare to Intelligently Block Malicious Attacks](https://blog.kejilion.pro/fail2ban-cloudflare/) that there is a DDoS attack method targeting WordPress instances, which bypasses the caching mechanism by requesting a non-existent path to attack the server. However, he did not clearly explain the principle of this attack method, so I will attempt to explain it here.
|
||||||
|
|
||||||
|
First, Layer 7 DDoS attacks require the target server to run a program or code in order to consume its resources. In other words, our requests cannot be intercepted before reaching WordPress, because the overhead of WAFs like Nginx intercepting requests is very small, and it would be very costly to exhaust them with a massive number of requests, which is beyond the capability of most people.
|
||||||
|
|
||||||
|
Secondly, the requests cannot hit the cache, because if they are cached by CDN or other means, the WordPress program is not involved in serving them, and our goal of exhausting the target server's resources cannot be achieved. Finally, avoid being banned based on IP, UA, or other content, for the same reason as why requests cannot be intercepted before reaching WordPress.
|
||||||
|
|
||||||
|
## Challenge
|
||||||
|
|
||||||
|
So why does the "404" attack mentioned by KEIJILION result in cache penetration and resource exhaustion?
|
||||||
|
|
||||||
|
### Behavior
|
||||||
|
|
||||||
|
First, let's talk about what the "404" attack he refers to is. It's actually generating random URL paths. For example:
|
||||||
|
|
||||||
|
```text
|
||||||
|
https://www.jamesflare.com/en/XXbwQMzBFL27zizGAeh7
|
||||||
|
https://www.jamesflare.com/en/mvQ3oX3NJRCfy8LBdWdL
|
||||||
|
https://www.jamesflare.com/en/AK3VdReDX4AKmAYanV9j
|
||||||
|
https://www.jamesflare.com/en/2Msmu2zDGwA4Fd4hDroF
|
||||||
|
https://www.jamesflare.com/en/crq8KXvMaFphdYhGNaFA
|
||||||
|
```
|
||||||
|
|
||||||
|
This is done to penetrate the cache and let the requests reach the WordPress instance.
|
||||||
|
|
||||||
|
### Nginx Rewrite Rule
|
||||||
|
|
||||||
|
You may ask, doesn't that just return a 404? What's the problem with that? First, we need to understand that returning a 404 also has an overhead, and this overhead varies across different applications. If it's a static website served by Nginx, then when Nginx can't find the requested file locally, it will return a 404, and the overhead is very low.
|
||||||
|
|
||||||
|
But in WordPress, this overhead is not very low, which has to do with its logic. WordPress pages need to be handled by `index.php`, and we've probably written similar Rewrite Rules when configuring it, right?
|
||||||
|
|
||||||
|
```nginx
|
||||||
|
# enforce NO www
|
||||||
|
if ($host ~* ^www\.(.*))
|
||||||
|
{
|
||||||
|
set $host_without_www $1;
|
||||||
|
rewrite ^/(.*)$ $scheme://$host_without_www/$1 permanent;
|
||||||
|
}
|
||||||
|
|
||||||
|
# unless the request is for a valid file, send to bootstrap
|
||||||
|
if (!-e $request_filename)
|
||||||
|
{
|
||||||
|
rewrite ^(.+)$ /index.php?q=$1 last;
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
The idea is to let `index.php` handle the request,
|
||||||
|
|
||||||
|
```text
|
||||||
|
# Requested by the browser
|
||||||
|
https://www.jamesflare.com/en/XXbwQMzBFL27zizGAeh7
|
||||||
|
https://www.jamesflare.com/en/XXbwQMzBFL27zizGAeh7.jpg
|
||||||
|
# What WordPress sees
|
||||||
|
https://www.jamesflare.com/index.php/en/XXbwQMzBFL27zizGAeh7
|
||||||
|
https://www.jamesflare.com/index.php?q=XXbwQMzBFL27zizGAeh7.jpg
|
||||||
|
```
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
flowchart TD
|
||||||
|
Browser --https://www.jamesflare.com/en/XXbwQMzBFL27zizGAeh7--> Nginx --https://www.jamesflare.com/index.php/en/XXbwQMzBFL27zizGAeh7--> WordPress
|
||||||
|
```
|
||||||
|
|
||||||
|
### Performance
|
||||||
|
|
||||||
|
Now the pressure is on WordPress. It first matches the database to see if there is such an article, and after not finding it, returns a 404, then starts weaving the HTML content of the 404 page like knitting a sweater. If your 404 page is more stylish and uses more resources, the overhead is even greater. It's almost equivalent to directly accessing a dynamic article (but the overhead should still be less than a real article).
|
||||||
|
|
||||||
|
Also, people may overestimate the performance of their VPS. The vast majority of VPS have poor performance, and 3-4 junk cores may not even match half a core of your laptop. Without caching, a few dozen RPS may be enough to crash the webpage.
|
||||||
|
|
||||||
|
{{< link href="../netcup-arm-review" content="netcup vServer (ARM64) Benchmark and Review" title="netcup vServer (ARM64) Benchmark and Review" card=true >}}
|
||||||
|
|
||||||
|
This 18-core VPS only reaches the level of an AMD Ryzen 7 7840U, and that's conservative, because my laptop also has an AMD Ryzen 7 7840U, and its performance is about 40% higher than the data in Geekbench 6. The multi-core score in the database is 8718, while my actual test is 12127, so the 18-core VPS is roughly 8650.
|
||||||
|
|
||||||
|
## Blue Team
|
||||||
|
|
||||||
|
I speculate that KEIJILION's solution to this `index.php` feature is to scan the Nginx logs for IPs that generate abnormal codes like 404, and add them to the CloudFlare blacklist via API, releasing them after an hour. The logs look something like this:
|
||||||
|
|
||||||
|
```text
|
||||||
|
47.29.201.179 - - [28/Feb/2019:13:17:10 +0000] "GET /?p=1 HTTP/2.0" 200 5316 "https://domain1.com/?p=1" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36" "2.75"
|
||||||
|
```
|
||||||
|
|
||||||
|
This greatly alleviates this vulnerability (actually I think it's more appropriate to call it a feature).
|
||||||
|
|
||||||
|
## Red Team
|
||||||
|
|
||||||
|
So, what are the loopholes in this model? First, let's sort out the ideas. This specialized approach is based on:
|
||||||
|
|
||||||
|
1. Detect abnormal HTTP codes
|
||||||
|
2. Block IP
|
||||||
|
|
||||||
|
In addition to this extra patch, general security measures include:
|
||||||
|
|
||||||
|
1. Rate limiting
|
||||||
|
2. CloudFlare security rules
|
||||||
|
1. IP address risk
|
||||||
|
2. Browser fingerprint
|
||||||
|
3. UA
|
||||||
|
4. Human verification
|
||||||
|
3. Origin blocks non-CloudFlare IPs
|
||||||
|
|
||||||
|
### Objective
|
||||||
|
|
||||||
|
So, our approach is also clear, which is to find a way to hit the site's dynamic resources with our requests. Let's break down the tasks:
|
||||||
|
|
||||||
|
1. Find dynamic resources
|
||||||
|
2. Bypass CloudFlare's security measures
|
||||||
|
3. Send requests at a not-so-aggressive rate
|
||||||
|
|
||||||
|
First, we need to understand that the performance of most people's VPS is not good, and WordPress is not as efficient as imagined. I think handling a few dozen to a hundred RPS without caching is already very high, and it's not impossible to crash with just a dozen RPS. Those who aggressively send packets, easily reaching thousands or tens of thousands of RPS, and don't spread the traffic across many IPs, who else would they ban if not you?
|
||||||
|
|
||||||
|
### IP Wave Tactics
|
||||||
|
|
||||||
|
Here I'll just throw out some ideas, as the Red Team I'll give a few bad ideas. The most brutal one is to use a bunch of IP addresses, still using the previous "404" attack, at most I'll use one IP per request. You might say, how is this possible, wouldn't it require tens of thousands or hundreds of thousands of IP addresses in an hour? Even those DDoS attacks recorded in history only had tens of thousands of addresses. Even renting IPs, one IP must cost at least $1, right?
|
||||||
|
|
||||||
|
You're right, but also not quite right. This is the case for IPv4 addresses, but what about IPv6? Many VPS come with a /48 subnet when you buy them, and even if they're stingy, it's a /64 subnet, which is still incomparably huge.
|
||||||
|
|
||||||
|
|Prefix Length|Example Address|Address Range|
|
||||||
|
|-|-|-|
|
||||||
|
|32|2001:db8::/32|2001:0db8:0000:0000:0000:0000:0000:0000 2001:0db8:ffff:ffff:ffff:ffff:ffff:ffff|
|
||||||
|
|40|2001:db8:ab00::/40|2001:0db8:ab00:0000:0000:0000:0000:0000 2001:0db8:abff:ffff:ffff:ffff:ffff:ffff|
|
||||||
|
|48|2001:db8\:abcd::/48|2001:0db8\:abcd:0000:0000:0000:0000:0000 2001:0db8\:abcd:ffff:ffff:ffff:ffff:ffff|
|
||||||
|
|56|2001:db8\:abcd:1200::/56|2001:0db8\:abcd:1200:0000:0000:0000:0000 2001:0db8\:abcd:12ff:ffff:ffff:ffff:ffff|
|
||||||
|
|64|2001:db8\:abcd\:1234::/64|2001:0db8\:abcd\:1234:0000:0000:0000:0000 2001:0db8\:abcd\:1234:ffff:ffff:ffff:ffff|
|
||||||
|
|
||||||
|
To make it easy, let's not count reserved addresses. A /64 prefix means there are still 64 bits of address space behind it, which is 2^64 IP addresses. This number is astronomical, and I don't know if there are that many grains of sand on Earth. If the blocking strategy is not changed, it is absolutely impossible to block them one address at a time.
|
||||||
|
|
||||||
|
You might say, isn't this tactic too outrageous, relying on the other party not reacting in time? Wouldn't it be unfortunate to make them block subnet by subnet? You're right, but you can mix subnets of different sizes, and blocking subnets will result in a huge number of IP addresses being blocked, which is a very bad choice. Besides, it's not impossible to rent a portion of a /32 subnet, right? If you have a good relationship with a vendor and they give you a subnet slightly smaller than a /32, wouldn't that be hard for you to deal with?
|
||||||
|
|
||||||
|
### Chromedriver
|
||||||
|
|
||||||
|
Okay, let me give you another bad idea. Since we don't need a very large request volume to crash WordPress, and the performance gap is very large, we can use some advanced tools to directly simulate browsers to bypass CloudFlare's verification code, which is not impossible.
|
||||||
|
|
||||||
|
[](https://github.com/ultrafunkamsterdam/undetected-chromedriver)
|
||||||
|
|
||||||
|
We can use this modified Selenium Chromedriver to bypass CloudFlare's verification code, UA, browser fingerprint, and other detection methods.
|
||||||
|
|
||||||
|
Then find a more dynamic point, such as entering random content in the search box to search. Coupled with our IPv6 human wave tactics, just a few dozen RPS can lead to a performance crisis for them. So many Selenium Chromedrivers may indeed consume some performance, but it's not very difficult to run on your own laptop. But from the Blue Team's perspective, it's a headache. They will see an extremely normal scene, with different IP addresses having a user accessing only once every half hour, an hour, or even a few hours. Or some IP addresses may not even access a second time. Will you wonder if your website has gone viral somewhere, rather than being attacked?
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
KEIJILION mentioned a DDoS attack method targeting WordPress instances, which bypasses the caching mechanism by requesting a non-existent path to consume server resources. This attack is effective because:
|
||||||
|
|
||||||
|
1. WordPress page requests all need to be handled by index.php, and even 404 pages consume some resources. Many VPS have limited performance, and WordPress without caching may only be able to handle a few dozen RPS.
|
||||||
|
|
||||||
|
2. Attackers bypass CDN, WAF, and other defenses by requesting random URL paths, allowing requests to penetrate the cache and directly reach the WordPress backend.
|
||||||
|
|
||||||
|
3. Attackers will avoid being banned based on IP, UA, etc. to evade detection.
|
||||||
|
|
||||||
|
KEIJILION may alleviate this problem by scanning Nginx logs for abnormal status codes and blocking the corresponding IPs. But this solution still has some loopholes:
|
||||||
|
|
||||||
|
1. Attackers can leverage the huge address space of IPv6, making it nearly impossible to block one by one.
|
||||||
|
|
||||||
|
2. By simulating real browser requests, CloudFlare's verification code, fingerprint, and other detections can be bypassed, making the attack look like normal access.
|
||||||
|
|
||||||
|
3. Due to WordPress's performance bottleneck, attackers only need low-rate attacks of a few dozen RPS to cause harm, which is difficult to notice.
|
||||||
|
|
||||||
|
In summary, this type of DDoS attack exploits the characteristics of the WordPress architecture and is difficult to completely prevent. In addition to improving WordPress performance, website administrators also need more comprehensive monitoring and defense measures.
|
||||||
BIN
content/zh-cn/posts/csci-1100/hw-6/HW6.zip
Normal file
BIN
content/zh-cn/posts/csci-1100/hw-6/HW6.zip
Normal file
Binary file not shown.
439
content/zh-cn/posts/csci-1100/hw-6/index.md
Normal file
439
content/zh-cn/posts/csci-1100/hw-6/index.md
Normal file
@@ -0,0 +1,439 @@
|
|||||||
|
---
|
||||||
|
title: CSCI 1100 - 作业 6 - 文件、集合和文档分析
|
||||||
|
subtitle:
|
||||||
|
date: 2024-04-13T15:36:47-04:00
|
||||||
|
slug: csci-1100-hw-6
|
||||||
|
draft: false
|
||||||
|
author:
|
||||||
|
name: James
|
||||||
|
link: https://www.jamesflare.com
|
||||||
|
email:
|
||||||
|
avatar: /site-logo.avif
|
||||||
|
description: 这篇博文介绍了一个 Python 编程作业,使用自然语言处理技术分析和比较文本文档,例如计算单词长度、不同单词比率以及单词集和对之间的 Jaccard 相似度。
|
||||||
|
keywords: ["Python", "自然语言处理", "文本分析", "文档比较"]
|
||||||
|
license:
|
||||||
|
comment: true
|
||||||
|
weight: 0
|
||||||
|
tags:
|
||||||
|
- CSCI 1100
|
||||||
|
- 作业
|
||||||
|
- RPI
|
||||||
|
- Python
|
||||||
|
- 编程
|
||||||
|
categories:
|
||||||
|
- 编程语言
|
||||||
|
collections:
|
||||||
|
- CSCI 1100
|
||||||
|
hiddenFromHomePage: false
|
||||||
|
hiddenFromSearch: false
|
||||||
|
hiddenFromRss: false
|
||||||
|
hiddenFromRelated: false
|
||||||
|
summary: 这篇博文介绍了一个 Python 编程作业,使用自然语言处理技术分析和比较文本文档,例如计算单词长度、不同单词比率以及单词集和对之间的 Jaccard 相似度。
|
||||||
|
resources:
|
||||||
|
- name: featured-image
|
||||||
|
src: featured-image.jpg
|
||||||
|
- name: featured-image-preview
|
||||||
|
src: featured-image-preview.jpg
|
||||||
|
toc: true
|
||||||
|
math: true
|
||||||
|
lightgallery: false
|
||||||
|
password:
|
||||||
|
message:
|
||||||
|
repost:
|
||||||
|
enable: true
|
||||||
|
url:
|
||||||
|
|
||||||
|
# See details front matter: https://fixit.lruihao.cn/documentation/content-management/introduction/#front-matter
|
||||||
|
---
|
||||||
|
|
||||||
|
<!--more-->
|
||||||
|
|
||||||
|
## 概述
|
||||||
|
|
||||||
|
这个作业在你的总作业成绩中占 100 分。截止日期为 2024 年 3 月 21 日星期四晚上 11:59:59。像往常一样,会有自动评分分数、教师测试用例分数和助教评分分数的混合。这个作业只有一个"部分"。
|
||||||
|
|
||||||
|
请参阅提交指南和协作政策手册,了解关于评分和过度协作的讨论。这些规则将在本学期剩余时间内生效。
|
||||||
|
|
||||||
|
你将需要我们在 `hw6_files.zip` 中提供的数据文件,所以请务必从 Submitty 的课程材料部分下载此文件,并将其解压缩到你的 HW 6 目录中。该 zip 文件包含数据文件以及程序的示例输入/输出。
|
||||||
|
|
||||||
|
## 问题介绍
|
||||||
|
|
||||||
|
有许多软件系统可以分析书面文本的风格和复杂程度,甚至可以判断两个文档是否由同一个人撰写。这些系统根据词汇使用的复杂程度、常用词以及紧密出现在一起的词来分析文档。在这个作业中,你将编写一个 Python 程序,读取包含两个不同文档文本的两个文件,分析每个文档,并比较这些文档。我们使用的方法是在自然语言处理 (NLP) 领域实际使用的更复杂方法的简化版本。
|
||||||
|
|
||||||
|
## 文件和参数
|
||||||
|
|
||||||
|
你的程序必须使用三个文件和一个整数参数。
|
||||||
|
|
||||||
|
第一个文件的名称对于你程序的每次运行都将是 `stop.txt`,所以你不需要询问用户。该文件包含我们将称为"停用词"的内容——应该忽略的词。你必须确保 `stop.txt` 文件与你的 `hw6_sol.py` Python 文件在同一文件夹中。我们将提供一个示例,但可能在测试你的代码时使用其他示例。
|
||||||
|
|
||||||
|
你必须请求要分析和比较的两个文档的名称以及一个整数"最大分隔"参数,这里将称为 `max_sep`。请求应如下所示:
|
||||||
|
|
||||||
|
```text
|
||||||
|
Enter the first file to analyze and compare ==> doc1.txt
|
||||||
|
doc1.txt
|
||||||
|
Enter the second file to analyze and compare ==> doc2.txt
|
||||||
|
doc2.txt
|
||||||
|
Enter the maximum separation between words in a pair ==> 2
|
||||||
|
2
|
||||||
|
```
|
||||||
|
|
||||||
|
## 解析
|
||||||
|
|
||||||
|
这个作业的解析工作是将文本文件分解为一个连续单词的列表。为此,应首先将文件的内容拆分为字符串列表,其中每个字符串包含连续的非空白字符。然后,每个字符串应删除所有非字母并将所有字母转换为小写。例如,如果文件的内容(例如 `doc1.txt`)被读取以形成字符串(注意行尾和制表符)
|
||||||
|
|
||||||
|
```python
|
||||||
|
s = " 01-34 can't 42weather67 puPPy, \r \t and123\n Ch73%allenge 10ho32use,.\n"
|
||||||
|
```
|
||||||
|
|
||||||
|
然后拆分应产生字符串列表
|
||||||
|
|
||||||
|
```python
|
||||||
|
['01-34', "can't", '42weather67', 'puPPy,', 'and123', 'Ch73%allenge', '10ho32use,.']
|
||||||
|
```
|
||||||
|
|
||||||
|
并且这应该被拆分为(非空)字符串列表
|
||||||
|
|
||||||
|
```python
|
||||||
|
['cant', 'weather', 'puppy', 'and', 'challenge', 'house']
|
||||||
|
```
|
||||||
|
|
||||||
|
请注意,第一个字符串 `'01-34'` 被完全删除,因为它没有字母。所有三个文件——`stop.txt` 和上面称为 `doc1.txt` 和 `doc2.txt` 的两个文档文件——都应以这种方式解析。
|
||||||
|
|
||||||
|
完成此解析后,解析 `stop.txt` 文件产生的列表应转换为集合。此集合包含在 NLP 中被称为"停用词"的内容——出现频率如此之高以至于应该忽略的词。
|
||||||
|
|
||||||
|
`doc1.txt` 和 `doc2.txt` 文件包含要比较的两个文档的文本。对于每个文件,从解析返回的列表应通过删除任何停用词来进一步修改。继续我们的示例,如果 `'cant'` 和 `'and'` 是停用词,那么单词列表应减少为
|
||||||
|
|
||||||
|
```python
|
||||||
|
['weather', 'puppy', 'challenge', 'house']
|
||||||
|
```
|
||||||
|
|
||||||
|
像"and"这样的词几乎总是在停用词列表中,而"cant"(实际上是缩写"can't")在某些列表中。请注意,从 `doc1.txt` 和 `doc2.txt` 构建的单词列表应保留为列表,因为单词顺序很重要。
|
||||||
|
|
||||||
|
### 分析每个文档的单词列表
|
||||||
|
一旦你生成了删除停用词的单词列表,你就可以分析单词列表了。有很多方法可以做到这一点,但以下是此作业所需的方法:
|
||||||
|
|
||||||
|
1. 计算并输出平均单词长度,精确到小数点后两位。这里的想法是单词长度是复杂程度的粗略指标。
|
||||||
|
|
||||||
|
2. 计算并输出不同单词数与总单词数之比,精确到小数点后三位。这是衡量所使用语言多样性的一种方法(尽管必须记住,一些作者重复使用单词和短语以加强他们的信息。)
|
||||||
|
|
||||||
|
3. 对于从 1 开始的每个单词长度,找到具有该长度的单词集。打印长度、具有该长度的不同单词数以及最多六个这些单词。如果对于某个长度,有六个或更少的单词,则打印所有六个,但如果有超过六个,则按字母顺序打印前三个和后三个。例如,假设我们上面的简单文本示例扩展为列表
|
||||||
|
|
||||||
|
```python
|
||||||
|
['weather', 'puppy', 'challenge', 'house', 'whistle', 'nation', 'vest',
|
||||||
|
'safety', 'house', 'puppy', 'card', 'weather', 'card', 'bike',
|
||||||
|
'equality', 'justice', 'pride', 'orange', 'track', 'truck',
|
||||||
|
'basket', 'bakery', 'apples', 'bike', 'truck', 'horse', 'house',
|
||||||
|
'scratch', 'matter', 'trash']
|
||||||
|
```
|
||||||
|
|
||||||
|
那么输出应该是
|
||||||
|
|
||||||
|
```text
|
||||||
|
1: 0:
|
||||||
|
2: 0:
|
||||||
|
3: 0:
|
||||||
|
4: 3: bike card vest
|
||||||
|
5: 7: horse house pride ... track trash truck
|
||||||
|
6: 7: apples bakery basket ... nation orange safety
|
||||||
|
7: 4: justice scratch weather whistle
|
||||||
|
8: 1: equality
|
||||||
|
9: 1: challenge
|
||||||
|
```
|
||||||
|
|
||||||
|
4. 找到此文档的不同单词对。单词对是文档列表中相隔 `max_sep` 个或更少位置出现的两个单词的二元组。例如,如果用户输入导致 `max_sep == 2`,那么生成的前六个单词对将是:
|
||||||
|
|
||||||
|
```python
|
||||||
|
('puppy', 'weather'), ('challenge', 'weather'),
|
||||||
|
('challenge', 'puppy'), ('house', 'puppy'),
|
||||||
|
('challenge', 'house'), ('challenge', 'whistle')
|
||||||
|
```
|
||||||
|
|
||||||
|
你的程序应输出不同单词对的总数。(请注意,`('puppy', 'weather')` 和 `('weather', 'puppy')` 应视为相同的单词对。)它还应按字母顺序输出前 5 个单词对(而不是它们形成的顺序,上面写的就是这样)和最后 5 个单词对。你可以假设,无需检查,有足够的单词来生成这些对。以下是上面较长示例的输出(假设读取它们的文件名为 `ex2.txt`):
|
||||||
|
|
||||||
|
```text
|
||||||
|
Word pairs for document ex2.txt
|
||||||
|
54 distinct pairs
|
||||||
|
apples bakery
|
||||||
|
apples basket
|
||||||
|
apples bike
|
||||||
|
apples truck
|
||||||
|
bakery basket
|
||||||
|
...
|
||||||
|
puppy weather
|
||||||
|
safety vest
|
||||||
|
scratch trash
|
||||||
|
track truck
|
||||||
|
vest whistle
|
||||||
|
```
|
||||||
|
|
||||||
|
5. 最后,作为单词对的独特性的度量,计算并输出不同单词对的数量与单词对总数之比,精确到小数点后三位。
|
||||||
|
|
||||||
|
#### 比较文档
|
||||||
|
最后一步是比较文档的复杂性和相似性。有许多可能的度量方法,所以我们将只实现其中的一些。
|
||||||
|
|
||||||
|
在我们这样做之前,我们需要定义两个集合之间的相似性度量。一个非常常见的,也是我们在这里使用的,称为 Jaccard 相似度。这是一个听起来很复杂的名称,但概念非常简单(在计算机科学和其他 STEM 学科中经常发生这种情况)。如果 A 和 B 是两个集合,那么 Jaccard 相似度就是
|
||||||
|
|
||||||
|
$$
|
||||||
|
J(A, B) = \frac{|A \cap B)|}{|A \cup B)|}
|
||||||
|
$$
|
||||||
|
|
||||||
|
用通俗的英语来说,它就是两个集合的交集大小除以它们的并集大小。举例来说,如果 $A$ 和 $B$ 相等,$J(A, B)$ = 1,如果 A 和 B 不相交,$J(A, B)$ = 0。作为特殊情况,如果一个或两个集合为空,则度量为 0。使用 Python 集合操作可以非常容易地计算 Jaccard 度量。
|
||||||
|
|
||||||
|
以下是文档之间的比较度量:
|
||||||
|
|
||||||
|
1. 决定哪个文档的平均单词长度更大。这是衡量哪个文档使用更复杂语言的粗略度量。
|
||||||
|
|
||||||
|
2. 计算两个文档中总体单词使用的 Jaccard 相似度。这应精确到小数点后三位。
|
||||||
|
|
||||||
|
3. 计算每个单词长度的单词使用的 Jaccard 相似度。每个输出也应精确到小数点后三位。
|
||||||
|
|
||||||
|
4. 计算单词对集之间的 Jaccard 相似度。输出应精确到小数点后四位。我们在这里研究的文档不会有实质性的对相似性,但在其他情况下,这是一个有用的比较度量。
|
||||||
|
|
||||||
|
有关详细信息,请参阅示例输出。
|
||||||
|
|
||||||
|
## 注意事项
|
||||||
|
|
||||||
|
- 本作业的一个重要部分是练习使用集合。最复杂的情况发生在处理每个单词长度的单词集的计算时。这需要你形成一个集合列表。与列表中的条目 k 相关联的集合应该是长度为 k 的单词。
|
||||||
|
|
||||||
|
- 对字符串的二元组列表或集合进行排序很简单。(请注意,当你对一个集合进行排序时,结果是一个列表。)产生的顺序是按元组的第一个元素按字母顺序排列,然后对于相同的元素,按第二个元素按字母顺序排列。例如,
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> v = [('elephant', 'kenya'), ('lion', 'kenya'), ('elephant', 'tanzania'), \
|
||||||
|
('bear', 'russia'), ('bear', 'canada')]
|
||||||
|
>>> sorted(v)
|
||||||
|
[('bear', 'canada'), ('bear', 'russia'), ('elephant', 'kenya'), \
|
||||||
|
('elephant', 'tanzania'), ('lion', 'kenya')]
|
||||||
|
```
|
||||||
|
|
||||||
|
- 只提交一个 Python 文件 `hw6_sol.py`。
|
||||||
|
|
||||||
|
- 我们分析中缺少的一个组成部分是每个单词出现的频率。使用字典可以很容易地跟踪这一点,但我们不会在这个作业中这样做。当你学习字典时,思考一下它们如何用于增强我们在这里所做的分析。
|
||||||
|
|
||||||
|
## 文档文件
|
||||||
|
|
||||||
|
我们提供了上面描述的示例,我们将使用其他几个文档测试你的代码(其中一些是):
|
||||||
|
|
||||||
|
- Elizabeth Alexander 的诗《Praise Song for the Day》。
|
||||||
|
- Maya Angelou 的诗《On the Pulse of the Morning》。
|
||||||
|
- William Shakespeare 的《Hamlet》中的一个场景。
|
||||||
|
- Dr. Seuss 的《The Cat in the Hat》
|
||||||
|
- Walt Whitman 的《When Lilacs Last in the Dooryard Bloom'd》(不是全部!)
|
||||||
|
|
||||||
|
所有这些都可以在网上全文阅读。请访问poetryfoundation.org,了解这些诗人、剧作家和作者的一些历史。
|
||||||
|
|
||||||
|
## 支持文件
|
||||||
|
|
||||||
|
{{< link href="HW6.zip" content="HW6.zip" title="Download HW6.zip" download="HW6.zip" card=true >}}
|
||||||
|
|
||||||
|
## 参考答案
|
||||||
|
|
||||||
|
### hw6_sol.py
|
||||||
|
|
||||||
|
```python
|
||||||
|
"""
|
||||||
|
This is a implement of the homework 6 solution for CSCI-1100
|
||||||
|
"""
|
||||||
|
|
||||||
|
#work_dir = "/mnt/c/Users/james/OneDrive/RPI/Spring 2024/CSCI-1100/Homeworks/HW6/hw6_files/"
|
||||||
|
work_dir = ""
|
||||||
|
stop_word = "stop.txt"
|
||||||
|
|
||||||
|
def get_stopwords():
|
||||||
|
stopwords = []
|
||||||
|
stoptxt = open(work_dir + stop_word, "r")
|
||||||
|
stop_words = stoptxt.read().split("\n")
|
||||||
|
stoptxt.close()
|
||||||
|
stop_words = [x.strip() for x in stop_words if x.strip() != ""]
|
||||||
|
for i in stop_words:
|
||||||
|
text = ""
|
||||||
|
for j in i:
|
||||||
|
if j.isalpha():
|
||||||
|
text += j.lower()
|
||||||
|
if text != "":
|
||||||
|
stopwords.append(text)
|
||||||
|
#print("Debug - Stop words:", stopwords)
|
||||||
|
return set(stopwords)
|
||||||
|
|
||||||
|
def parse(raw):
|
||||||
|
parsed = []
|
||||||
|
parsing = raw.replace("\n"," ").replace("\t"," ").replace("\r"," ").split(" ")
|
||||||
|
#print("Debug - Parssing step 1:", parsing)
|
||||||
|
parsing = [x.strip() for x in parsing if x.strip() != ""]
|
||||||
|
#print("Debug - Parssing step 2:", parsing)
|
||||||
|
for i in parsing:
|
||||||
|
text = ""
|
||||||
|
for j in i:
|
||||||
|
if j.isalpha():
|
||||||
|
text += j.lower()
|
||||||
|
if text != "":
|
||||||
|
parsed.append(text)
|
||||||
|
#print("Debug - Parssing step 3:", parsed)
|
||||||
|
parsed = [x for x in parsed if x not in get_stopwords()]
|
||||||
|
#print("Debug - Parssing step 4:", parsed)
|
||||||
|
return parsed
|
||||||
|
|
||||||
|
def get_avg_word_len(file):
|
||||||
|
#print("Debug - File:", file)
|
||||||
|
filetxt = open(work_dir + file, "r")
|
||||||
|
raw = filetxt.read()
|
||||||
|
filetxt.close()
|
||||||
|
parsed = parse(raw)
|
||||||
|
#print("Debug - Parsed:", parsed)
|
||||||
|
avg = sum([len(x) for x in parsed]) / len(parsed)
|
||||||
|
#print("Debug - Average:", avg)
|
||||||
|
return avg
|
||||||
|
|
||||||
|
def get_ratio_distinct(file):
|
||||||
|
filetxt = open(work_dir + file, "r").read()
|
||||||
|
distinct = list(set(parse(filetxt)))
|
||||||
|
total = len(parse(filetxt))
|
||||||
|
ratio = len(distinct) / total
|
||||||
|
#print("Debug - Distinct:", ratio)
|
||||||
|
return ratio
|
||||||
|
|
||||||
|
def word_length_ranking(file):
|
||||||
|
filetxt = open(work_dir + file, "r").read()
|
||||||
|
parsed = parse(filetxt)
|
||||||
|
max_length = max([len(x) for x in parsed])
|
||||||
|
#print("Debug - Max length:", max_length)
|
||||||
|
ranking = [[] for i in range(max_length + 1)]
|
||||||
|
for i in parsed:
|
||||||
|
if i not in ranking[len(i)]:
|
||||||
|
ranking[len(i)].append(i)
|
||||||
|
#print("Debug - Adding", i, "to", len(i))
|
||||||
|
for i in range(len(ranking)):
|
||||||
|
ranking[i] = sorted(ranking[i])
|
||||||
|
#print("Debug - Ranking:", ranking)
|
||||||
|
return ranking
|
||||||
|
|
||||||
|
def get_word_set_table(file):
|
||||||
|
str1 = ""
|
||||||
|
data = word_length_ranking(file)
|
||||||
|
for i in range(1, len(data)):
|
||||||
|
cache = ""
|
||||||
|
if len(data[i]) <= 6:
|
||||||
|
cache = " ".join(data[i])
|
||||||
|
else:
|
||||||
|
cache = " ".join(data[i][:3]) + " ... "
|
||||||
|
cache += " ".join(data[i][-3:])
|
||||||
|
if cache != "":
|
||||||
|
str1 += "{:4d}:{:4d}: {}\n".format(i, len(data[i]), cache)
|
||||||
|
else:
|
||||||
|
str1 += "{:4d}:{:4d}:\n".format(i, len(data[i]))
|
||||||
|
return str1.rstrip()
|
||||||
|
|
||||||
|
def get_word_pairs(file, maxsep):
|
||||||
|
filetxt = open(work_dir + file, "r").read()
|
||||||
|
parsed = parse(filetxt)
|
||||||
|
pairs = []
|
||||||
|
for i in range(len(parsed)):
|
||||||
|
for j in range(i+1, len(parsed)):
|
||||||
|
if j - i <= maxsep:
|
||||||
|
pairs.append((parsed[i], parsed[j]))
|
||||||
|
return pairs
|
||||||
|
|
||||||
|
def get_distinct_pairs(file, maxsep):
|
||||||
|
total_pairs = get_word_pairs(file, maxsep)
|
||||||
|
pairs = []
|
||||||
|
for i in total_pairs:
|
||||||
|
cache = sorted([i[0], i[1]])
|
||||||
|
pairs.append((cache[0], cache[1]))
|
||||||
|
return sorted(list(set(pairs)))
|
||||||
|
|
||||||
|
def get_word_pair_table(file, maxsep):
|
||||||
|
pairs = get_distinct_pairs(file, maxsep)
|
||||||
|
#print("Debug - Pairs:", pairs)
|
||||||
|
str1 = " "
|
||||||
|
str1 += str(len(pairs)) + " distinct pairs" + "\n"
|
||||||
|
if len(pairs) <= 10:
|
||||||
|
for i in pairs:
|
||||||
|
str1 += " {} {}\n".format(i[0], i[1])
|
||||||
|
else:
|
||||||
|
for i in pairs[:5]:
|
||||||
|
str1 += " {} {}\n".format(i[0], i[1])
|
||||||
|
str1 += " ...\n"
|
||||||
|
for i in pairs[-5:]:
|
||||||
|
str1 += " {} {}\n".format(i[0], i[1])
|
||||||
|
return str1.rstrip()
|
||||||
|
|
||||||
|
def get_jaccard_similarity(list1, list2):
|
||||||
|
setA = set(list1)
|
||||||
|
setB = set(list2)
|
||||||
|
intersection = len(setA & setB)
|
||||||
|
union = len(setA | setB)
|
||||||
|
if union == 0:
|
||||||
|
return 0.0
|
||||||
|
else:
|
||||||
|
return intersection / union
|
||||||
|
|
||||||
|
def get_word_similarity(file1, file2):
|
||||||
|
file1txt = open(work_dir + file1, "r").read()
|
||||||
|
file2txt = open(work_dir + file2, "r").read()
|
||||||
|
parsed1 = parse(file1txt)
|
||||||
|
parsed2 = parse(file2txt)
|
||||||
|
return get_jaccard_similarity(parsed1, parsed2)
|
||||||
|
|
||||||
|
def get_word_similarity_by_length(file1, file2):
|
||||||
|
word_by_length_1 = word_length_ranking(file1)
|
||||||
|
word_by_length_2 = word_length_ranking(file2)
|
||||||
|
similarity = []
|
||||||
|
for i in range(1, max(len(word_by_length_1), len(word_by_length_2))):
|
||||||
|
if i < len(word_by_length_1) and i < len(word_by_length_2):
|
||||||
|
similarity.append(get_jaccard_similarity(word_by_length_1[i], word_by_length_2[i]))
|
||||||
|
else:
|
||||||
|
similarity.append(0.0)
|
||||||
|
return similarity
|
||||||
|
|
||||||
|
def get_word_similarity_by_length_table(file1, file2):
|
||||||
|
similarity = get_word_similarity_by_length(file1, file2)
|
||||||
|
str1 = ""
|
||||||
|
for i in range(len(similarity)):
|
||||||
|
str1 += "{:4d}: {:.4f}\n".format(i+1, similarity[i])
|
||||||
|
return str1.rstrip()
|
||||||
|
|
||||||
|
def get_word_pairs_similarity(file1, file2, maxsep):
|
||||||
|
pairs1 = get_distinct_pairs(file1, maxsep)
|
||||||
|
pairs2 = get_distinct_pairs(file2, maxsep)
|
||||||
|
return get_jaccard_similarity(pairs1, pairs2)
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
# Debugging
|
||||||
|
#file1st = "cat_in_the_hat.txt"
|
||||||
|
#file2rd = "pulse_morning.txt"
|
||||||
|
#maxsep = 2
|
||||||
|
|
||||||
|
#s = " 01-34 can't 42weather67 puPPy, \r \t and123\n Ch73%allenge 10ho32use,.\n"
|
||||||
|
#print(parse(s))
|
||||||
|
#get_avg_word_len(file1st)
|
||||||
|
#get_ratio_distinct(file1st)
|
||||||
|
#print(word_length_ranking(file1st)[10])
|
||||||
|
#print(get_word_set_table(file1st))
|
||||||
|
|
||||||
|
# Get user input
|
||||||
|
file1st = input("Enter the first file to analyze and compare ==> ").strip()
|
||||||
|
print(file1st)
|
||||||
|
file2rd = input("Enter the second file to analyze and compare ==> ").strip()
|
||||||
|
print(file2rd)
|
||||||
|
maxsep = int(input("Enter the maximum separation between words in a pair ==> ").strip())
|
||||||
|
print(maxsep)
|
||||||
|
|
||||||
|
files = [file1st, file2rd]
|
||||||
|
for i in files:
|
||||||
|
print("\nEvaluating document", i)
|
||||||
|
print("1. Average word length: {:.2f}".format(get_avg_word_len(i)))
|
||||||
|
print("2. Ratio of distinct words to total words: {:.3f}".format(get_ratio_distinct(i)))
|
||||||
|
print("3. Word sets for document {}:\n{}".format(i, get_word_set_table(i)))
|
||||||
|
print("4. Word pairs for document {}\n{}".format(i, get_word_pair_table(i, maxsep)))
|
||||||
|
print("5. Ratio of distinct word pairs to total: {:.3f}".format(len(get_distinct_pairs(i, maxsep)) / len(get_word_pairs(i, maxsep))))
|
||||||
|
|
||||||
|
print("\nSummary comparison")
|
||||||
|
avg_word_length_ranking = []
|
||||||
|
for i in files:
|
||||||
|
length = get_avg_word_len(i)
|
||||||
|
avg_word_length_ranking.append((i, length))
|
||||||
|
avg_word_length_ranking = sorted(avg_word_length_ranking, key=lambda x: x[1], reverse=True)
|
||||||
|
print("1. {} on average uses longer words than {}".format(avg_word_length_ranking[0][0], avg_word_length_ranking[1][0]))
|
||||||
|
print("2. Overall word use similarity: {:.3f}".format(get_word_similarity(file1st, file2rd)))
|
||||||
|
print("3. Word use similarity by length:\n{}".format(get_word_similarity_by_length_table(file1st, file2rd)))
|
||||||
|
print("4. Word pair similarity: {:.4f}".format(get_word_pairs_similarity(file1st, file2rd, maxsep)))
|
||||||
|
```
|
||||||
204
content/zh-cn/posts/wordpress/cc-attack-on-index-php/index.md
Normal file
204
content/zh-cn/posts/wordpress/cc-attack-on-index-php/index.md
Normal file
@@ -0,0 +1,204 @@
|
|||||||
|
---
|
||||||
|
title: 针对WordPress特性的DDoS攻击原理与探讨
|
||||||
|
subtitle:
|
||||||
|
date: 2024-04-13T13:12:44-04:00
|
||||||
|
slug: cc-attack-on-index-php
|
||||||
|
draft: false
|
||||||
|
author:
|
||||||
|
name: James
|
||||||
|
link: https://www.jamesflare.com
|
||||||
|
email:
|
||||||
|
avatar: /site-logo.avif
|
||||||
|
description: 这篇博客文章探讨了一种针对WordPress实例的特定DDoS攻击的原理和挑战,该攻击通过请求不存在的路径来绕过缓存机制,并从蓝队和红队的角度讨论了可能的防御和进攻策略。
|
||||||
|
keywords: ["DDoS", "WordPress", "Nginx", "CloudFlare", "IPv6"]
|
||||||
|
license:
|
||||||
|
comment: true
|
||||||
|
weight: 0
|
||||||
|
tags:
|
||||||
|
- WordPress
|
||||||
|
- Nginx
|
||||||
|
- WAF
|
||||||
|
categories:
|
||||||
|
- 安全
|
||||||
|
- 讨论
|
||||||
|
hiddenFromHomePage: false
|
||||||
|
hiddenFromSearch: false
|
||||||
|
hiddenFromRss: false
|
||||||
|
hiddenFromRelated: false
|
||||||
|
summary: 这篇博客文章探讨了一种针对WordPress实例的特定DDoS攻击的原理和挑战,该攻击通过请求不存在的路径来绕过缓存机制,并从蓝队和红队的角度讨论了可能的防御和进攻策略。
|
||||||
|
resources:
|
||||||
|
- name: featured-image
|
||||||
|
src: featured-image.jpg
|
||||||
|
- name: featured-image-preview
|
||||||
|
src: featured-image-preview.jpg
|
||||||
|
toc: true
|
||||||
|
math: true
|
||||||
|
lightgallery: false
|
||||||
|
password:
|
||||||
|
message:
|
||||||
|
repost:
|
||||||
|
enable: true
|
||||||
|
url:
|
||||||
|
|
||||||
|
# See details front matter: https://fixit.lruihao.cn/documentation/content-management/introduction/#front-matter
|
||||||
|
---
|
||||||
|
|
||||||
|
<!--more-->
|
||||||
|
|
||||||
|
## 前情提要
|
||||||
|
|
||||||
|
KEIJILION在 [网站防御再升级 fail2ban 对接 cloudflare 智能拦截恶意攻击](https://blog.kejilion.pro/fail2ban-cloudflare/) 中提到有一种针对WordPress实例的DDoS攻击方式,即通过请求请求一个不存在的路径绕过缓存机制来对服务器进行攻击。但是他没说清楚这种攻击方式的原理,我在这里尝试解释一下这种攻击方式的原理。
|
||||||
|
|
||||||
|
首先,Layer 7的DDoS攻击需要让目标服务器运行某个程序或者代码,这样才能消耗它的资源。换句话说,我们的请求不能在到达WordPress之前被拦截,因为Nginx等WAF拦截请求的开销非常小,如果我们要通过海量请求把它们耗死,代价会非常大,一般人根本做不到。
|
||||||
|
|
||||||
|
其次不能命中缓存,因为如果命中被CDN等缓存,那么WordPress程序没有参与服务,我们想耗尽目标服务器资源的目的就无法达成。最后,避免被封禁IP,UA等内容,原因和为什么请求不能在到达WordPress之前被拦截一样。
|
||||||
|
|
||||||
|
## 挑战
|
||||||
|
|
||||||
|
那么为什么KEIJILION提到的“404”攻击会造成缓存击穿,资源耗尽的结局呢?
|
||||||
|
|
||||||
|
### 行为
|
||||||
|
|
||||||
|
首先我们讲,它指的“404”攻击是什么?其实就是生成随机的URL Path。比如
|
||||||
|
|
||||||
|
```text
|
||||||
|
https://www.jamesflare.com/en/XXbwQMzBFL27zizGAeh7
|
||||||
|
https://www.jamesflare.com/en/mvQ3oX3NJRCfy8LBdWdL
|
||||||
|
https://www.jamesflare.com/en/AK3VdReDX4AKmAYanV9j
|
||||||
|
https://www.jamesflare.com/en/2Msmu2zDGwA4Fd4hDroF
|
||||||
|
https://www.jamesflare.com/en/crq8KXvMaFphdYhGNaFA
|
||||||
|
```
|
||||||
|
|
||||||
|
这么做是为了击穿缓存,让请求到达WordPres实例。
|
||||||
|
|
||||||
|
### Nginx Rewrite Rule
|
||||||
|
|
||||||
|
那么,你可能要问了,那不就返回404了吗,有什么问题呢?首先我们要明白返回一个404也是需要开销的,这个开销在不同应用下不一样。如果是一个Ngnix服务的静态网站,那么当Nginx在本地找不到请求的文件时就会返回404,这个开销很低。
|
||||||
|
|
||||||
|
但在WordPress里,这个开销不是很低,这要从它的逻辑说起。WordPress的页面需要让`index.php`来处理,我们在配置的时候想必写过类似的Rewrite Rule吧
|
||||||
|
|
||||||
|
```nginx
|
||||||
|
# enforce NO www
|
||||||
|
if ($host ~* ^www\.(.*))
|
||||||
|
{
|
||||||
|
set $host_without_www $1;
|
||||||
|
rewrite ^/(.*)$ $scheme://$host_without_www/$1 permanent;
|
||||||
|
}
|
||||||
|
|
||||||
|
# unless the request is for a valid file, send to bootstrap
|
||||||
|
if (!-e $request_filename)
|
||||||
|
{
|
||||||
|
rewrite ^(.+)$ /index.php?q=$1 last;
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
道理就是把请求交给`index.php`处理,
|
||||||
|
|
||||||
|
```text
|
||||||
|
# 浏览器请求的
|
||||||
|
https://www.jamesflare.com/en/XXbwQMzBFL27zizGAeh7
|
||||||
|
https://www.jamesflare.com/en/XXbwQMzBFL27zizGAeh7.jpg
|
||||||
|
# WordPress看到的
|
||||||
|
https://www.jamesflare.com/index.php/en/XXbwQMzBFL27zizGAeh7
|
||||||
|
https://www.jamesflare.com/index.php?q=XXbwQMzBFL27zizGAeh7.jpg
|
||||||
|
```
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
flowchart TD
|
||||||
|
Browser --https://www.jamesflare.com/en/XXbwQMzBFL27zizGAeh7--> Nginx --https://www.jamesflare.com/index.php/en/XXbwQMzBFL27zizGAeh7--> WordPress
|
||||||
|
```
|
||||||
|
|
||||||
|
### 性能
|
||||||
|
|
||||||
|
这下压力来到WordPress这一边,它首先匹配一遍数据库里有没有这篇文章,没有之后返回404,然后开始像织毛衣一样编织404页面的HTML内容,要是你的404页面还比较时髦,用到的资源比较多,那开销更大了。约等于直接访问了一次动态的文章(但开销应该还是比真文章小)。
|
||||||
|
|
||||||
|
还有就是一般人可能会高估他们VPS的性能,绝大多数的VPS性能很差,可能3-4个洋垃圾核心不如你笔记本的半个核心,没缓存的话可能几十rqs网页就爆了。
|
||||||
|
|
||||||
|
{{< link href="../netcup-arm-review" content="netcup vServer (ARM64) 基准测试和评测" title="netcup vServer (ARM64) 基准测试和评测" card=true >}}
|
||||||
|
|
||||||
|
这一台18核心的VPS性能也才达到AMD Ryzen 7 7840U的水平,而且是保守的,因为我的笔记本同样搭载AMD Ryzen 7 7840U,性能比Geekbench 6里的数据高了40%左右。数据库里多核是8718,我实测有12127,那18核VPS差不多8650。
|
||||||
|
|
||||||
|
## Blue Team
|
||||||
|
|
||||||
|
我猜测KEIJILION针对这个`index.php`特性的解决方法是通过扫描Nginx日志中产生404等异常码的IP,并且通过API加入CloudFlare中的黑名单,一小时后释放。日志差不多长这个样子
|
||||||
|
|
||||||
|
```text
|
||||||
|
47.29.201.179 - - [28/Feb/2019:13:17:10 +0000] "GET /?p=1 HTTP/2.0" 200 5316 "https://domain1.com/?p=1" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36" "2.75"
|
||||||
|
```
|
||||||
|
|
||||||
|
这样就极大程度缓解了这个漏洞(其实我觉得叫特性更合适)。
|
||||||
|
|
||||||
|
## Red Team
|
||||||
|
|
||||||
|
那么,它这个模式有什么漏洞呢?首先我们理一下思路,这个特化思路在于
|
||||||
|
|
||||||
|
1. 发现异常http code
|
||||||
|
2. 封禁IP
|
||||||
|
|
||||||
|
除了这个额外的补丁,一般的安全措施有
|
||||||
|
|
||||||
|
1. 速率限制
|
||||||
|
2. CloudFlare的安全规则
|
||||||
|
1. IP地址风险
|
||||||
|
2. 浏览器指纹
|
||||||
|
3. UA
|
||||||
|
4. 人机验证
|
||||||
|
3. 站源阻止非CloudFlare IP
|
||||||
|
|
||||||
|
### 目的
|
||||||
|
|
||||||
|
那么,我们的思路也就明确了,也就是通过一种办法把我们的请求命中站点的动态资源。分解一下任务
|
||||||
|
|
||||||
|
1. 找到动态资源
|
||||||
|
2. 绕过CloudFlare的安全措施
|
||||||
|
3. 以一个不是那么激进的速率发出请求
|
||||||
|
|
||||||
|
首先我们要明白,一般人的VPS性能并不好,WordPress也没想象中那么高效,我觉得没缓存处理几十上百rqs已经很高了,十几rqs就爆了也不是不可能。那些猛猛发包,动不动就几千上万rqs的人,而且没几个IP分摊流量,人家不封你封谁。
|
||||||
|
|
||||||
|
### 人海战术
|
||||||
|
|
||||||
|
这里我就抛砖引玉,作为Red Team那我就来几杯坏水。最粗暴的那就是来一堆IP地址,还是用之前的“404”攻击,大不了我一个请求换一个IP。你可能说,这怎么可能,1个小时下来岂不是要几万,十几万个IP地址,那些载入史册的DDoS攻击不过上万个地址。就算租IP,1个IP怎么也得1美元吧。
|
||||||
|
|
||||||
|
你说的对,也不对。IPv4地址是这样的,但是IPv6呢?很多VPS买就送你一个/48子网,就算抠门一点是/64子网,那也是无比巨大的。
|
||||||
|
|
||||||
|
|前缀长度|示例地址|地址范围|
|
||||||
|
|-|-|-|
|
||||||
|
|32|2001:db8::/32|2001:0db8:0000:0000:0000:0000:0000:0000 2001:0db8:ffff:ffff:ffff:ffff:ffff:ffff|
|
||||||
|
|40|2001:db8:ab00::/40|2001:0db8:ab00:0000:0000:0000:0000:0000 2001:0db8:abff:ffff:ffff:ffff:ffff:ffff|
|
||||||
|
|48|2001:db8\:abcd::/48|2001:0db8\:abcd:0000:0000:0000:0000:0000 2001:0db8\:abcd:ffff:ffff:ffff:ffff:ffff|
|
||||||
|
|56|2001:db8\:abcd:1200::/56|2001:0db8\:abcd:1200:0000:0000:0000:0000 2001:0db8\:abcd:12ff:ffff:ffff:ffff:ffff|
|
||||||
|
|64|2001:db8\:abcd\:1234::/64|2001:0db8\:abcd\:1234:0000:0000:0000:0000 2001:0db8\:abcd\:1234:ffff:ffff:ffff:ffff|
|
||||||
|
|
||||||
|
我们方便一点不算保留地址,/64前缀意味着后面还有64位的地址空间,也就是2的64次方个IP地址,这个数量是天文数值,我不知道地球上的沙子有没有这么多。如果不改变封禁策略,一个一个地址封那是根本不可能封禁的。
|
||||||
|
|
||||||
|
你可能说,这招也太逆天了吧,不就仗着人家没反应过来,要让人家一个子网一个子网封禁岂不是倒霉了。你说的对,但是你可以混合不同大小的子网,而且封禁子网会导致巨量的IP地址被封禁,这是一个很糟糕的选择。况且租一个/32子网中的一部分也不是不可能对吧,和哪个商家关系好,给你个/32子网下面一点的子网,你不就不好处理了。
|
||||||
|
|
||||||
|
### 微操大师
|
||||||
|
|
||||||
|
好吧,那我换一个坏水,介于我们打爆WordPress的请求量不需要很大,而且性能差距很大,我们用一些高级工具,直接模拟浏览器绕过CloudFlare的验证码也不是不可能。
|
||||||
|
|
||||||
|
[](https://github.com/ultrafunkamsterdam/undetected-chromedriver)
|
||||||
|
|
||||||
|
我们可以用这个改进的Selenium Chromedriver绕过CloudFlare的验证码,UA,浏览器指纹等检测方式。
|
||||||
|
|
||||||
|
然后找一个动态点,比如去搜索框输入随机内容搜索。再配合我们的IPv6人海战术,只需要几十rqs就可以导致它的性能危机。这么多Selenium Chromedriver可能确实会有些消耗性能,但是在自己笔记本上运行也不是很有难度。但是在Blue Team看来就头大了。他们会看见无比正常的一幕,不同的IP地址有一个用户每半个小时,一个小时甚至几个小时才访问一次。或者有一些IP地址的用户甚至不会访问第二次,你会不会疑惑自己的网站是不是发到哪里火了,而不是被攻击了。
|
||||||
|
|
||||||
|
## 总结
|
||||||
|
|
||||||
|
KEIJILION提到了一种针对WordPress实例的DDoS攻击方式,即通过请求一个不存在的路径来绕过缓存机制,消耗服务器资源。这种攻击之所以有效,是因为
|
||||||
|
|
||||||
|
1. WordPress的页面请求都需要交给index.php处理,即使是404页面也需要消耗一定的资源。很多VPS性能有限,没有缓存的WordPress可能只能承受几十rqs。
|
||||||
|
|
||||||
|
2. 攻击者通过请求随机URL路径,使请求击穿缓存直达WordPress后端,绕过了CDN和WAF等防御。
|
||||||
|
|
||||||
|
3. 攻击者会避免被封禁IP、UA等,以逃避检测。
|
||||||
|
|
||||||
|
KEIJILION可能通过扫描Nginx日志中的异常状态码并封禁相应IP来缓解这个问题。但这个方案仍有一些漏洞
|
||||||
|
|
||||||
|
1. 攻击者可以利用IPv6的巨大地址空间,几乎不可能一个个封禁。
|
||||||
|
|
||||||
|
2. 通过模拟真实浏览器发起请求,可以绕过CloudFlare的验证码、指纹等检测,使攻击看起来像正常访问。
|
||||||
|
|
||||||
|
3. 由于WordPress的性能瓶颈,攻击者只需要几十rqs的低速率攻击就能造成危害,很难引起注意。
|
||||||
|
|
||||||
|
总之,这种DDoS攻击利用了WordPress架构的特点,难以彻底防范。网站管理员除了提高WordPress性能外,还需要更全面的监控和防御措施。
|
||||||
@@ -297,7 +297,7 @@
|
|||||||
{{- with $giscus -}}
|
{{- with $giscus -}}
|
||||||
{{- $commentConfig = .lightTheme | default "light" | dict "lightTheme" | dict "giscus" | merge $commentConfig -}}
|
{{- $commentConfig = .lightTheme | default "light" | dict "lightTheme" | dict "giscus" | merge $commentConfig -}}
|
||||||
{{- $commentConfig = .darkTheme | default "dark" | dict "darkTheme" | dict "giscus" | merge $commentConfig -}}
|
{{- $commentConfig = .darkTheme | default "dark" | dict "darkTheme" | dict "giscus" | merge $commentConfig -}}
|
||||||
<div id="giscus">
|
<div id="giscus" class="comment">
|
||||||
<script
|
<script
|
||||||
src="https://giscus.app/client.js"
|
src="https://giscus.app/client.js"
|
||||||
data-repo="{{ .Repo }}"
|
data-repo="{{ .Repo }}"
|
||||||
@@ -325,7 +325,7 @@
|
|||||||
{{- end -}}
|
{{- end -}}
|
||||||
</div>
|
</div>
|
||||||
{{- /* lightgallery for Artalk and Twikoo */ -}}
|
{{- /* lightgallery for Artalk and Twikoo */ -}}
|
||||||
{{- $params := .Scratch.Get "params" -}}
|
{{- $params := partial "function/params.html" -}}
|
||||||
{{- if not $params.lightgallery | and (($artalk.enable | and $artalk.lightgallery) | or ($twikoo.enable | and $twikoo.lightgallery)) -}}
|
{{- if not $params.lightgallery | and (($artalk.enable | and $artalk.lightgallery) | or ($twikoo.enable | and $twikoo.lightgallery)) -}}
|
||||||
{{- $source := $cdn.lightgalleryCSS | default "lib/lightgallery/css/lightgallery-bundle.min.css" -}}
|
{{- $source := $cdn.lightgalleryCSS | default "lib/lightgallery/css/lightgallery-bundle.min.css" -}}
|
||||||
{{- dict "Source" $source "Fingerprint" $fingerprint | dict "Scratch" .Scratch "Data" | partial "scratch/style.html" -}}
|
{{- dict "Source" $source "Fingerprint" $fingerprint | dict "Scratch" .Scratch "Data" | partial "scratch/style.html" -}}
|
||||||
|
|||||||
Submodule themes/FixIt updated: dc0b96210a...ff8e5fc7bf
Reference in New Issue
Block a user