Overview of the web scraping tool Octoparse

Comments: 0

Octoparse is an automated web scraping and data extraction tool designed to crawl websites and gather large volumes of information. It efficiently transfers data into spreadsheets and databases for further analysis. This tool is especially valuable for analysts, directors, traders, marketers, and anyone involved in strategic planning, competitive analysis, and targeting within the e-commerce sector.

1.png

Octoparse features

Octoparse is a sophisticated automatic web scraping and data extraction tool widely used across various sectors to collect data and automate routine tasks. Distinguished by its developers for its capability to effectively extract information from 98% of websites, Octoparse excels in handling interactive, complex, and dynamic web resources. The tool mimics human browsing behavior and offers a robust suite of features:

  • Built-in browser: allows users to log into accounts, perform searches, navigate through pages, and operate on endlessly scrolling pages;
  • CAPTCHA bypass: integrated functionality within Octoparse that enables the bypassing of CAPTCHAs;
  • Data extraction: capable of extracting text, both internal and external HTML links, attributes, and selecting values for deeper data collection. It can also retrieve URLs of files and images;
  • Ad blocking: blocks advertisements to reduce traffic usage and accelerate the parsing process;
  • Proxy support: enables the setup and rotation of proxy servers to ensure continuous operation and circumvent site blocks;
  • Scheduled scans: this offers the option to schedule website scans that are updated in real-time, facilitating timely data collection.

    2.png

Octoparse capabilities

Octoparse offers several technical advantages that enhance its web scraping capabilities, allowing users to address a wide range of problems effectively:

  • It can be launched locally on a computer or deployed in the cloud across multiple servers, which can accelerate the web scraping process by up to 20 times.
  • Its “Smart Mode” feature allows for immediate conversion of web pages into structured data tables simply by entering the URL.
  • There are handy Octoparse templates available for popular platforms like Facebook, Instagram, YouTube, Twitter, and Google.
  • It includes RegEx and XPath tools for more precise searching of web elements.
  • Processed data can be exported to various formats including CSV, Excel, JSON, HTML, and TXT.
  • The application is capable of handling tasks like processing authorization, searching forms, expanding comments and lists, collecting data from calendars and maps, and working with Ajax and JavaScript.
  • The workflow can be visualized through the designer to clearly understand the logic (variables, loops, and conditional expressions), with options to modify the diagram using a “Point-and-click” interface.

    3.png

The Octoparse program is designed to be user-friendly, requiring no technical or programming skills, making it ideal for those new to the parsing process. The website offers clear tutorials that demonstrate how to use Octoparse, showcasing its popular features and presenting real-life user scenarios for common tasks. Additionally, the site's frequently asked questions and tutorial section delve into less obvious methods for accelerating data collection, offer solutions to common errors, provide tips on bypassing query restrictions, and include other helpful resources.

Email address extraction

Octoparse can be used to collect email addresses from publicly displayed sources, enabling the sending of offers to potential clients. The software is capable of collecting up to 100,000 email addresses in just a few hours. Additionally, Octoparse features a universal template designed specifically for collecting contact information from various online platforms, including LinkedIn pages, social networks, service directories, and company directories. This makes it a versatile tool for those looking to enhance their marketing and outreach efforts.

Web data extraction

Mass information collection is particularly valuable for applications such as price monitoring, lead generation, and market research. For tasks involving the analysis of a large volume of indicators that change in real-time, web scraping in cloud mode is most effective. This approach allows for up to 20 simultaneous threads to operate on an automated schedule. The data collected can be saved directly to a file on a PC or to a database where it can be sorted, updated, and structured to meet specific needs.

Image extraction

With Octoparse, you can efficiently generate lists of image addresses for subsequent uploading. The scraper's functions enable you to automate various tasks, such as searching by meta tags or update dates, saving links to all images in a carousel, and downloading URLs for full-size images instead of thumbnails. Additionally, Octoparse allows you to capture related information from websites—such as prices, locations, descriptions, and contact details of products, hotels, or services—for further analysis. You can upload files either through a third-party image uploader or using a built-in option when processing locally from your computer.

Phone number extraction

You can use Octoparse to collect data from various sources such as Yelp, Google Maps, LinkedIn, handyman service sites, and company directories. Octoparse is capable of accessing data hidden behind elements like the “Show number” button and copying it. Once configured, the program allows you to gather not just phone numbers, but also names, comments, and service descriptions. All of this information can be efficiently organized and transferred into a table for easy analysis.

Diverse data collection

Octoparse is adept at extracting information from websites that employ anti-scraping technologies, making it a powerful tool for addressing various data collection challenges. Here are some of the key problems it can solve:

  • Extracting information from dynamic resources that use JavaScript and AJAX;
  • Parsing sites with endless scrolling to capture continuous data;
  • Aggregating online news and articles from diverse sources;
  • Extracting nested and embedded structures within web pages;
  • Retrieving e-commerce data such as reviews, supplier lists, ratings, and prices from major platforms like Amazon, eBay, and Aliexpress.

The API integrated into Octoparse enhances its functionality by allowing data to be retrieved without needing to wait for a response from the web server. It enables the automatic transmission of information from the cloud to your work environment, such as a CRM system, and allows for the customization of scripts and task parameters. For basic needs, the free version of Octoparse may suffice. However, for the comprehensive implementation of large-scale projects, the paid package offers more robust features and capabilities.

Octoparse pricing plans

Octoparse offers three subscription types: free, standard, and professional. Both premium subscriptions can be tried for free for 14 days by simply registering and applying. For paid packages, there is an option to request a refund within 5 days of purchase. Additionally, the annual subscriptions in Octoparse are more cost-effective compared to monthly payments.

4.png

All plans in Octoparse utilize the same client software, with the primary difference being the range of functionality available at each subscription level.

Free

Ideal for small projects, Octoparse's free plan allows unlimited page processing. You can set up to 10 tasks and run two simultaneously. However, the free version is limited to local PC launches only; cloud parsing is not supported.

Standard plan

The optimal solution for small businesses and individual employees provides access to almost all popular functions. The main advantages are more than a hundred ready-made templates for various platforms, up to 100 simultaneous tasks, access to cloud processes, and also:

  • The ability to integrate a proxy into Octoparse to change IP and configure rotation, which allows you to increase the number of requests without risking potential blocking;
  • Uploading images and files in jpg, png, gif, doc, pdf, ppt, txt, xls, and zip formats;
  • Auto-export of data and access via API.

Professional plan

Designed for large-scale operations, this package allows up to 250 tasks and the use of 20 cloud processes simultaneously. It includes a cloud autocopy feature. Subscribers receive personalized training and priority technical support.

Tariff Free Standard Professional
Cost Free

$89/month, $900/year

(Save 16%)

$249/month, $2496/year

(Save 16%)

Number of tasks 10 100 250
Parallel local tasks on PC 2 Unlimited Unlimited
Parallel tasks in the cloud 0 6 20
IP proxy rotation Yes Yes Yes
Proxy server support Yes Yes Yes
Scheduled scraping No Yes Yes
API integration with CRM No Yes Yes
Captcha bypass No Yes Yes
Data collection from images Yes Yes Yes

Large corporate clients can request a bespoke tariff plan, tailored to their specific requirements and needs.

The Octoparse interface

Once you launch the program, it immediately asks you to register using your Google, Microsoft, or email account for an automatic login to your profile. A window then appears, giving you a quick overview of what the program can do. Following that, you're invited to take a short, step-by-step tutorial to get you up to speed.

5.png

6.png

User profile

The “My Account” tab offers a concise overview of several key details:

  • User data, including your avatar, email address, full name, username, and password;
  • The type and expiration date of your subscription;
  • Any accounts you have linked;
  • You can view the funds currently available in your balance and manage team actions.

    7.png

Creating a new task

All work with Octoparse begins with the creation of a task, which consists of instructions for the program to execute. On the sidebar, clicking the “New” icon provides two options:

  • Custom Task allows for advanced customization of a task.
  • Task Template offers ready-made templates for most services, accessible with a paid subscription.

    8.png

Selecting “Custom Task” allows you to determine the source of the URL. Options include entering it manually, importing it from a file, or using an existing task. The “Batch generate” function facilitates the creation of numerous links through templates based on a specified URL. Additionally, the task can be assigned to a designated group.

9.png

Dashboard - information panel

The information panel displays existing tasks along with various management options:

  • Tasks can be run in the cloud or on your computer;
  • Autorun settings can be configured;
  • It is possible to check which tasks are currently running in the cloud and which ones have completed;
  • Filters can be applied;
  • Tasks can be searched by name;
  • Various actions can be performed with tasks, such as duplicating, viewing data, exporting, deleting, and more.

    10.png

Templates

The “Templates” tab in Octoparse features a collection of web scraping templates—pre-formatted tasks that are ready to use without the need to establish scraping rules or write any code.

The templates are organized into several categories:

  • Contact information and potential clients, which includes templates for extracting emails, phone numbers, and social media profile links;
  • E-commerce, covering templates for gathering data on products, prices, and delivery options;
  • Travel, with templates for details such as hotel names, addresses, star ratings, amenities, breakfast availability, review counts, average ratings, and room availability;
  • Social media features templates that can pull usernames, post content, number of likes, locations, image or video URLs, and video descriptions.

Additional pre-made templates are available for various other resources.

11.png

Traditionally, web scraping requires knowledge of Python to create a task template, but Octoparse simplifies this with its ready-made templates. Simply choose a template and specify a URL to get started.

12.png

Tools

The toolbar includes several useful features:

  • RegEx tool allows for the automatic creation of regular expressions by setting various criteria. This is particularly useful for matching or replacing characters in field values to refine the extracted data.
  • Database auto-export tool enables the automatic transmission of results to Excel or databases such as MySQL, SQLSERVER, Oracle, and others.

    13.png

How to create a new task in Octoparse

Let's consider the process with a practical example:

Step 1. Creating a new parsing task

To get started, click on the “New” icon and choose “Custom Task”. Then, copy the website's URL and paste it into the “URL Input” line. Click “Save” to store the task. Alternatively, you can directly enter the URL into the search bar on the main page and click “Start” to begin.

14.png

15.png

Step 2. Automatic data field detection

Once you input the URL, Octoparse will load the page in its built-in browser. To proceed, click on “Auto-detect webpage data” in the Tips panel. The program will then scan the page and automatically suggest the appropriate fields for data extraction.

16.png

17.png

Step 3. Configuring data fields

Review the suggested data fields and ensure that the required elements on the page are highlighted. You can rename or delete fields using the “Data Preview” panel at the bottom.

18.png

Step 4. Building the parsing workflow

Click “Create Workflow” to define each step of the process. By clicking on each action, you can verify that the parser is working correctly.

19.png

Step 5. Launching and scheduling the parser

Click “Run” at the top right:

20.png

Select the server where the request will be processed:

  • “Run on your device” is an option available in the free version. It uses your computer's power and internet connection.
  • “Run in the Cloud” is a faster option, ideal for constant scraping. It allows you to schedule autoruns for dynamic websites with frequently updated content to keep your data current.

You can also configure an automatic launch schedule here:

21.png

Step 6. Exporting collected data

After the parser completes, you can export the results to Excel, CSV, HTML, XML, JSON, databases, or Google Sheets for further analysis.

22.png

Step-by-step proxy setup in the Octoparse parser

To bypass parsing protections on most websites and reduce the risk of being blocked due to numerous simultaneous requests from a single IP, it's recommended to utilize the built-in automatic proxy rotation functionality. For configuration, you can use either your proxies or those provided by the program. Let’s walk through the setup process using a specific example of an already-created task:

  1. Open a task and click on “Task Settings”.

    23.png

  2. Under the “Anti-Blocking” section, enable proxy access and choose “Use my own proxies”. Then, click the “Configure” button.

    24.png

  3. Set the rotation time for the proxies and input the proxy addresses in the format IP-address:port:username:password.

    25.png

  4. Click “Confirm” to apply these settings and specify any additional parameters if necessary.

    26.png

  5. Click “Save” and then run the task. With this setup, IPs will rotate and cookies will be cleared automatically, completing the proxy setup in Octoparse.

Conclusion

In this review of Octoparse, we explored its primary features, capabilities, functions, and settings. Octoparse is a straightforward yet powerful tool for scraping web data from both static and dynamically updated websites. For optimal performance and continuous data collection without the risk of being blocked, it is advisable to use proxy servers. You can set up individual IPv4 or ISP data center proxies; however, you'll need to utilize a pool of addresses and configure their rotation. Alternatively, using mobile and residential proxies with a high trust rating is recommended for better reliability.

Comments:

0 comments