Ranking Web Pages

recuresive : 재귀적인

크롤러를 만든따. 크롤러는 링크를 따라간다. 크롤링의 결과는 키워드를 넣으면 entry가 나오고 그안에 키워드가 있고 url배열이 있다. 이배열의 첫번째가 가장인기가 있는ㄴ urㅣ이어야 한다.

popularity

친구들이 있다. 누군가는 인기있고 누군가는 없다. A가 B를 친구로 생각한다는 뜻은, B가 A를 친구로 생각한다는 뜻은 아니다. 즉 one direactional. 친구가 많은것을 인기의 척도로 할 수는 없다.이게뭐야도대체..

# popularity(person) => number of person who are friends with person
# bob --> alice <-- chalie 
# popularity(alice) # => 2 이거 안좋다?

def popularity(step, person):
    if step == 0:
        return 1
    else:
        score = 0
        for f in friends(p):
            score = score + popularity(step-1, f)
        return score

graph = {
    'A':['B','C'],
    'B':[],
    'C':['B']
}

Quiz: implementing Urank

Modifythe crawl_web procedure so instead of just returning thi index, it returns an index and a graph. The graph should be a Dictionary where the entries are url: [url, url, ..]

def crawl_web(seed): # returns index, graph of outlinks
    tocrawl = [seed]
    crawled = []
    graph = {}  # <url>:[list of pages it links to]
    index = {} 
    while tocrawl: 
        page = tocrawl.pop()
        if page not in crawled:
            content = get_page(page) #page(url)의 html을 가져온다.
            add_page_to_index(index, page, content)
            outlinks = get_all_links(content) #html안에있는 모든링크를 긁어서 배열에 넣는다.

            graph[page] = outlinks

            union(tocrawl, outlinks)
            crawled.append(page)
    return index, graph

def get_all_links(page):
    links = []
    while True:
        url, endpos = get_next_target(page)
        if url:
            links.append(url)
            page = page[endpos:]
        else:
            break
    return links

...


index, graph = crawl_web()

print(graph)
"""
{
    'http://udacity.com/cs101x/urank/kathleen.html': [],
    'http://udacity.com/cs101x/urank/zinc.html': [
        'http://udacity.com/cs101x/urank/nickel.html',
        'http://udacity.com/cs101x/urank/arsenic.html'
    ],
    'http://udacity.com/cs101x/urank/hummus.html': [], 
    'http://udacity.com/cs101x/urank/arsenic.html': [
        'http://udacity.com/cs101x/urank/nickel.html'
    ], 
    'http://udacity.com/cs101x/urank/index.html': [
        'http://udacity.com/cs101x/urank/hummus.html', 
        'http://udacity.com/cs101x/urank/arsenic.html', 
        'http://udacity.com/cs101x/urank/kathleen.html', 
        'http://udacity.com/cs101x/urank/nickel.html', 
        'http://udacity.com/cs101x/urank/zinc.html'
    ], 
    'http://udacity.com/cs101x/urank/nickel.html': [
        'http://udacity.com/cs101x/urank/kathleen.html'
    ]
}
"""
print(graph['http://udacity.com/cs101x/urank/nickel.html'])
#['http://udacity.com/cs101x/urank/kathleen.html']

Computing Page Rank

굉장히 상대적으로 점수를 매긴다 이거 어떻게하는지 정말 전혀 이해안된다..

computeRanks(graph) # graph ==crawl web

lookupBest(keyword, index, ranks) # => best page

index 모든 keyword를 포함한 모든 페이지를 갖고있다.
keyword 검색할녀석
ranks는 랭킹

def compute_ranks(graph):
    d = 0.8 # damping factor 이게왜필요한지 모르겠고
    numloops = 10 #이게왜필요한지도 모르겠따.

    ranks = {}
    npages = len(graph)
    for page in graph:
        ranks[page] = 1.0/npages # 요만큼 점수를 얻는다? 근데 +가 아니도 그냥 대입인데?

    for i in range(0, numloops): #이거해봐야 루프10번돌동안 아무것도 변하지 않을텐데?
        newranks = {}
        for page in gragh:
            newrank = (1-d)/npages
            #
            #여기에 뭘써야함.
            #

            newranks[page] = newrank
        ranks = newranks
    return ranks

RankingWebPages