How to Manage Data

Structured Data

문자열(String) : sequence of characters

s = 'yabba'
s[0] # -> 'y'
s[2:4] # -> 'bb'

배열(List) : sequence of anything

p = ['y','a','b','b','a']
p[0] # -> 'y'
p[2:4] # -> ['b','b']

<list> -> [<expression>, <expression>, ...]

Nested Lists

mixed_up = ['apple',3,'banana',27, [1,2,['alpha','beta']]]

beatles = [['John', 1940], ['Paul',1942],
           ['Geotrge',1943], ['Ringo', 1940]]

print(beatles[0][0]) # -> 'John'

Mutation

문자열은 immutable 배열은 mutable.

s = 'Yello'
s[0] = 'H'
print(s) # => 'Yello'
# 유사배열이기 때문에 배열처럼 인덱스를 이용할 수 있지만.
# 할당한 값을 수정할 수 없다.(immutable)

s = s + 'w'
# 재할당은 가능하다.
# 'Yellow'

p = ['H','e','l','l','o']
q = p # q는 p를 참조한다.
p[0] = 'Y'
# 인덱스로 접근해 값의 일부를 수정할 수 있다.

print(p)
# => ['Y','e','l','l','o']

print(q)
# p의 값이 변경되었기 때문에, p를 참조하는 q의 값도 변경되었다.
# => ['Y','e','l','l','o']

q.append = 'w'
# q는 p를 참조하기 때문에 q를 수정하면 p도 수정된다.
# q는 p의 별명일 뿐이다. p역시 배열의 별명일 뿐이다. 둘다 그냥 배열의 별명이다.

print(p)
# => ['Y','e','l','l','o','w']

List Operations

<list>.append(<element>)

stooges = ['Moe','Larry','Curly']
stooges.append('Shemp')
print stooges # => ['Moe','Larry','Curly','Shemp']

names = ["Minwoo", "Kuma"]
stooges.append(names)
print stooges # => ['Moe','Larry','Curly','Shemp', ["Minwoo", "Kuma"]]

<list> + <list>

[0,1]+[2,3] # -> [0,1,2,3]

len(<list>)

#list
len([0,1]) # => 2
len(['a',['b',['c']]]) # => 2
r = range(1,5) # [1,2,3,4]
len(r) # => 4

#string
#유사배열이기 때문에 정상적으로 작동한다.
len('Udacity') # => 7

List Loop

while

def find_elements(arr):
    i = 0
    while i < len(arr): # == (i <= len(arr)-1)
        name = arr[i]
        if name == "minwoo": break
        i = i + 1

    return name

names = ["tom", "david", "hue", "minwoo"]    
print find_elements(names) # => minwoo

def print_all_elements(arr):
    for e in arr:
        print e

names = ["tom", "david", "hue", "minwoo"]    
print_all_elements(names)
# >>> tom
# >>> david
# >>> hue
# >>> minwoo

<list>.index(<value>)

stooges = ['Moe','Larry','Curly']
arr.index('Moe') # -> 0
'Moe' in arr # -> True
'Moe' not in arr # -> False 
# not 'Moe' in arr 와 같다.

<list>.pop() / <list>.pop(<index>)

arr = ['apple', 'banana', 'cherry', 'dragon fruit']
arr.pop() # -> 'dragin fruit' / a -> ['apple', 'banana', 'cherry']
arr.pop(0) # -> 'apple' / a ->['banana', 'cherry']
pop_elemnt = arr.pop() # -> 'cherry'
arr.append(pop_element) # ->['banana', 'cherry'] ??

Quiz: Find Element

Define a procedure, find_element, that takes as its inputs a list and a value of any type, and returns the index of the first element in the input list that matches the value. If there is no matching element, return -1

#민우코드
def find_element(arr, target):
    i=0
    for el in arr: #array안의 element를 루프돌린다.
        if target == el: return index
        i = i + 1
    return -1

#while
def find_element(arr, target):
    i = 0
    while i < len(arr): #범위 지정해주는게 좋았다.
        if p[i] == t: return i
        i = i + 1
    return -1

#index()
def find_element(arr, target):
    if el in arr:
        return arr.index(el)
    else:
        return -1

def test():
    assert find_element([1,2,3],3) == 2
    assert find_element(['alpha','beta'],'gamma') == -1
    print "Test Finish"

test() # => "Test Finish"

Quiz: union

Define a procedure, union(), that takes as inputs 2 lists. It should modify the first input list to be the set union of the two lists.

#민우코드
def union(a,b):
    for el in b:
        if el in a:
            continue #일부러 썼다.
        else:
            a.append(el)

a = [1,2,3]
b = [3,4,5]
union(a,b)

def test():
    assert a == [1,2,3,4,5]
    print "Test Finish"

test()

How Computers Store Data

이번에는 컴퓨터가 실제로 데이터를 어떻게 저장하는지 배워본다. 배열을 사용하는데 이러한 지식이 필요하지는 않지만, 이를 통해 컴퓨터 안에서 일어나는 일의 진가(appreciation¹)를 확실히(certainly²) 알게 될것이다.

유튜브 강의 링크

자료 구성의 단위

1바이트(1 byte)는 8비트와 같다(8 bits). 1 bit 는 영상에서 나온 전등스위치(light switch)와 같다고 생각하면 된다.(0,1 / true,false)

2기가 램은 (2 ** 30) * 2 * 8 과 같다. (comparable³ 17 billion light switch)

print 2 ** 10 
# => 1024 (kilobyte)

print 2 ** 20
# => 1048576 (megabyte)

print 2 ** 30
# => 1073741824 (gigabyte)

print 2 ** 40
# => 10995116277776 (terabyte)

메모리 종류

DRAM

전원을 끄면 데이터가 사라지는 휘발성 메모리 장치다.(capacitor⁴)

Hard Drives

자성물질을 입힌 금속 원판을 여러장 겹쳐 만든 기억매체체. DRAM에 비해 느리지만 낮은가격에 큰용량을 사용할수 있다.

1TB는 8bit * 2 ** 40 == 8.8 Trillion bits , latency 7 milliseconds, cost $ 100

	Cost ($ per Bit)	Latency	Latency - Distance(light speed)
그냥 전구(Light bulb)	$ 0.50	1 second	300,000km
CPU Reguster	$ 0.001	< 0.4ns	0.12m
DRAM	$ 10 for 2GB	12ns	3.6m
Hard Drive	n$ 0.01	7 ms	2100km

Crawling

Crawling process

start with seed , tocrwal = [seed] , crawled = []

seed = 'http://www.udacity.com/cs101/index.html'
while there_are_more_pages tocrawl:
    pick_a_page_from tocrawl
    add_that_page_to crawled
return crawled

이론상 멈추지 않는다. 왜냐하면 일반적인 웹사이트는 모든 페이지의 링크가 서로 연결되어있기 때문.

	tocrawl	crawled
0	index.html
1	crawling.html, walking.html, fly.html	index.html
2	walking.html, fly.html	undex.html, crawling.html
2-0	kick.html (in crawling.html)	index.html, crawling.html
		index.html, crawling.html, kick.html
...	...	...

Collecing Links

url.get_page = ' .......<a href="https://www.naver.com"> ......<a href="https://www.daum.net">...'
get_all_links = ['https://www.naver.com', 'https://www.daum.net']

Get All Links

print all links함수를 변경해 get all links함수로 변경하고 list를 반환하도록 했다.

def get_next_targe(page):
    start_link = page.find('<a href=')
    if start_link == -1:
        return None, 0

    start _quote = page.find('"', start_link)
    end_quote = page.find('"', start_quote + 1)
    url = page[start_quote + 1:end_quote]
    return url, end_quote

# print_all_links 를 변형하였다.
def get_all_links(page):
    links = []
    while True:
        url, endpos = get_next_target(page)
        #더이상 발견할수 없거나 첨부터 없었으면 None,0을 반환.
        if url:
            links.append(url)
            page = page[endpos:]
        else:
            break #url이 없으면 멈춘다 (None)
    return links

Quiz: Crawl Web

def crawl_web(seed)
    tocrawl = [seed]
    crawled = []
    while tocrawl:
        page = tocrawl.pop() # tocrawl의 마지막 링크를 꺼내서 page 변수에 저장한다.
        if page not in craweld: # crawled에 page가 이미 있는지 확인한다.
            tocrawl(tocrawl, get_all_links(get_page(page))) #그냥 +해도 되지만 불필요한 반복을 줄여준다.
            crawled.append(page)
    return crawled

# udacity에서 제공하는 get_page함수
def get_page(url):
    # This is a simulated get_page procedure so that you can test your
    # code on two pages "http://xkcd.com/353" and "http://xkcd.com/554".
    # A procedure which actually grabs a page from the web will be 
    # introduced in unit 4.
    try:
        if url == "http://xkcd.com/353":
            return  ' ...some html source '
        elif url == "http://xkcd.com/554":
            return  ' ...some html source '
    except:
        return ""
    return ""

Conclusion

링크

¹. appreciation : 진가를 인정함, 진가, 감사 ↩

². certainly : 분명히, 확실히 ↩

³. comparable : 유사한 ↩

⁴. capacitor : 축전기 . 전기회로에서 전기적 퍼텐셜 에너지를 저장하는 장치. ↩

lesson11