Python으로 Google 지도 데이터 스크랩하기 가이드

12.12.2024

댓글: 0

기사 내용:

환경 설정하기
Google 지도에서 데이터를 스크랩하는 단계별 가이드

1단계. 대상 URL 정의
2단계. 헤더 및 프록시 정의
3단계. 페이지 콘텐츠 가져오기
4단계. HTML 콘텐츠 구문 분석
5단계. 데이터 추출
6단계. 데이터를 CSV로 저장

코드 완성

Python을 사용하여 Google 지도에서 데이터를 스크랩하면 위치, 비즈니스 및 서비스에 대한 유용한 정보를 수집할 수 있어 시장 분석, 최적의 신규 장소 위치 파악, 현재 디렉토리 유지 관리, 경쟁사 분석, 장소의 인기도 측정 등에 유용하게 사용할 수 있습니다. 이 가이드는 Python 라이브러리 요청과 lxml을 활용하여 Google 지도에서 정보를 추출하는 방법에 대한 포괄적인 안내를 제공합니다. 요청하기, 응답 처리하기, 구조화된 데이터 구문 분석하기, CSV 파일로 내보내기에 대한 자세한 지침이 포함되어 있습니다.

환경 설정하기

다음 Python 라이브러리가 설치되어 있는지 확인하세요:

requests;
lxml;
csv(표준 라이브러리).

필요한 경우 pip를 사용하여 이러한 라이브러리를 설치합니다:


pip install requests
pip install lxml

아래에서는 스크래핑의 단계별 프로세스를 예시와 함께 설명합니다.

Google 지도에서 데이터를 스크랩하는 단계별 가이드

다음 섹션에서는 각 단계를 안내하는 시각적 예제와 함께 Google Maps에서 데이터를 스크랩하는 자세한 단계별 프로세스를 안내합니다.

1단계. 대상 URL 정의

데이터를 스크랩할 URL을 지정합니다.


url = "https link"

2단계. 헤더 및 프록시 정의

적절한 헤더를 설정하는 것은 실제 사용자의 활동을 모방하여 스크레이퍼가 봇으로 신고될 가능성을 크게 줄이는 데 매우 중요합니다. 또한 프록시 서버를 통합하면 단일 IP 주소와 관련된 요청 제한을 초과하여 발생할 수 있는 모든 차단을 우회하여 지속적인 스크래핑 활동을 유지할 수 있습니다.


headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'accept-language': 'en-IN,en;q=0.9',
    'cache-control': 'no-cache',
    'dnt': '1',
    'pragma': 'no-cache',
    'priority': 'u=0, i',
    'sec-ch-ua': '"Not)A;Brand";v="99", "Google Chrome";v="127", "Chromium";v="127"',
    'sec-ch-ua-arch': '"x86"',
    'sec-ch-ua-bitness': '"64"',
    'sec-ch-ua-full-version-list': '"Not)A;Brand";v="99.0.0.0", "Google Chrome";v="127.0.6533.72", "Chromium";v="127.0.6533.72"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-model': '""',
    'sec-ch-ua-platform': '"Linux"',
    'sec-ch-ua-platform-version': '"6.5.0"',
    'sec-ch-ua-wow64': '?0',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'none',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36',
}

proxies = {
    "http": "http://username:password@your_proxy_ip:port",
    "https": "https://username:password@your_proxy_ip:port",
}

3단계. 페이지 콘텐츠 가져오기

Google 지도 URL로 요청을 보내 페이지 콘텐츠를 가져옵니다:


import requests

response = requests.get(url, headers=headers, proxies=proxies)
if response.status_code == 200:
    page_content = response.content
else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")

4단계. HTML 콘텐츠 구문 분석

lxml을 사용하여 HTML 콘텐츠를 구문 분석합니다:


from lxml import html

parser = html.fromstring(page_content)

데이터 XPath 식별하기

데이터를 올바르게 추출하려면 HTML 문서의 구조를 이해하는 것이 중요합니다. 스크랩하려는 데이터 요소에 대한 XPath 표현식을 식별해야 합니다. 그 방법은 다음과 같습니다:

웹 페이지를 검사합니다: 웹 브라우저에서 Google 지도 페이지를 열고 브라우저의 개발자 도구(마우스 오른쪽 버튼으로 <검사> 클릭)를 사용하여 HTML 구조를 검사합니다.
관련 요소 찾기: 스크랩하려는 데이터가 포함된 HTML 요소(예: 레스토랑 이름, 주소)를 찾습니다.
XPath 작성: HTML 구조를 기반으로 데이터를 추출하기 위한 XPath 표현식을 작성합니다. 이 가이드의 XPath는 다음과 같습니다:

레스토랑 이름:


//div[@jscontroller="AtSb"]/div/div/div/a/div/div/div/span[@class="OSrXXb"]/text()

주소:


 //div[@jscontroller="AtSb"]/div/div/div/a/div/div/div[2]/text()

옵션:


 = ', '.join(result.xpath('.//div[@jscontroller="AtSb"]/div/div/div/a/div/div/div[4]/div/span/span[1]//text()'))

지리적 위도:


//div[@jscontroller="AtSb"]/div/@data-lat

지리적 경도:


 //div[@jscontroller="AtSb"]/div/@data-lng

5단계. 데이터 추출

식별된 XPath를 사용하여 데이터를 추출합니다:


results = parser.xpath('//div[@jscontroller="AtSb"]')
data = []

for result in results:
    restaurant_name = result.xpath('.//div/div/div/a/div/div/div/span[@class="OSrXXb"]/text()')[0]
    address = result.xpath('.//div/div/div/a/div/div/div[2]/text()')[0]
    options = ', '.join(result.xpath('.//div/div/div/a/div/div/div[4]/div/span/span[1]//text()'))
    geo_latitude = result.xpath('.//div/@data-lat')[0]
    geo_longitude = result.xpath('.//div/@data-lng')[0]

    # 데이터 목록에 추가
    data.append({
        "restaurant_name": restaurant_name,
        "address": address,
        "options": options,
        "geo_latitude": geo_latitude,
        "geo_longitude": geo_longitude
    })

6단계. 데이터를 CSV로 저장

추출된 데이터를 CSV 파일로 저장합니다:


import csv

with open("google_maps_data.csv", "w", newline='', encoding='utf-8') as csv_file:
    writer = csv.DictWriter(csv_file, fieldnames=["restaurant_name", "address", "options", "geo_latitude", "geo_longitude"])
    writer.writeheader()
    for entry in data:
        writer.writerow(entry)

코드 완성

다음은 Google 지도 데이터를 스크랩하는 전체 코드입니다:


import requests
from lxml import html
import csv

# 대상 URL 및 헤더 정의
url = "https://www.google.com/search?sca_esv=04f11db33f1535fb&sca_upv=1&tbs=lf:1,lf_ui:4&tbm=lcl&sxsrf=ADLYWIIFVlh6WQCV6I2gi1yj8ZyvZgLiRA:1722843868819&q=google+map+restaurants+near+me&rflfq=1&num=10&sa=X&ved=2ahUKEwjSs7fGrd2HAxWh1DgGHbLODasQjGp6BAgsEAE&biw=1920&bih=919&dpr=1"
headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'accept-language': 'en-IN,en;q=0.9',
    'cache-control': 'no-cache',
    'dnt': '1',
    'pragma': 'no-cache',
    'priority': 'u=0, i',
    'sec-ch-ua': '"Not)A;Brand";v="99", "Google Chrome";v="127", "Chromium";v="127"',
    'sec-ch-ua-arch': '"x86"',
    'sec-ch-ua-bitness': '"64"',
    'sec-ch-ua-full-version-list': '"Not)A;Brand";v="99.0.0.0", "Google Chrome";v="127.0.6533.72", "Chromium";v="127.0.6533.72"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-model': '""',
    'sec-ch-ua-platform': '"Linux"',
    'sec-ch-ua-platform-version': '"6.5.0"',
    'sec-ch-ua-wow64': '?0',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'none',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36',
}
proxies = {
    "http": "http://username:password@your_proxy_ip:port",
    "https": "https://username:password@your_proxy_ip:port",
}

# 페이지 콘텐츠 가져오기
response = requests.get(url, headers=headers, proxies=proxies)
if response.status_code == 200:
    page_content = response.content
else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")
    exit()

# HTML 콘텐츠 구문 분석
parser = html.fromstring(page_content)

# XPath를 사용하여 데이터 추출
results = parser.xpath('//div[@jscontroller="AtSb"]')
data = []

for result in results:
    restaurant_name = result.xpath('.//div/div/div/a/div/div/div/span[@class="OSrXXb"]/text()')[0]
    address = result.xpath('.//div/div/div/a/div/div/div[2]/text()')[0]
    options = ', '.join(result.xpath('.//div/div/div/a/div/div/div[4]/div/span/span[1]//text()'))
    geo_latitude = result.xpath('.//div/@data-lat')[0]
    geo_longitude = result.xpath('.//div/@data-lng')[0]

    # 데이터 목록에 추가
    data.append({
        "restaurant_name": restaurant_name,
        "address": address,
        "options": options,
        "geo_latitude": geo_latitude,
        "geo_longitude": geo_longitude
    })

# 데이터를 CSV로 저장
with open("google_maps_data.csv", "w", newline='', encoding='utf-8') as csv_file:
    writer = csv.DictWriter(csv_file, fieldnames=["restaurant_name", "address", "options", "geo_latitude", "geo_longitude"])
    writer.writeheader()
    for entry in data:
        writer.writerow(entry)

print("Data has been successfully scraped and saved to google_maps_data.csv.")

효과적인 웹 스크래핑을 위해서는 올바른 요청 헤더와 프록시를 사용하는 것이 중요합니다. 최적의 프록시 선택은 빠른 속도와 짧은 지연 시간을 제공하는 데이터 센터 또는 ISP 프록시입니다. 그러나 이러한 프록시는 정적 프록시이므로 차단을 효과적으로 방지하려면 IP 로테이션을 구현해야 합니다. 보다 사용자 친화적인 대안은 주거용 프록시를 사용하는 것입니다. 이러한 동적 프록시는 로테이션 프로세스를 간소화하고 신뢰 계수가 높기 때문에 차단을 우회하는 데 더 효과적입니다.

0 댓글

이전 기사