Project Summary: Seeking a skilled data specialist or programmer experienced in text parsing and data extraction to process a large text file containing compiled VS Battles Wiki character profiles. The goal is to accurately extract key data points (Character Name, Combat Tier, and Origin/Franchise) for each character and deliver the results in a clean, structured Excel spreadsheet. Project Description: I have compiled the raw text content from numerous individual character profiles on the VS Battles Wiki into a single, large text file. This file serves as the primary source for this data extraction project. The final deliverable I require is a single Microsoft Excel file (.xlsx) that contains one row for each reliably identifiable unique character from the text dump. The output file should include the following columns: * Character Name: The primary name for the character. Need logic to identify and select the most appropriate name, potentially handling common aliases or variations if present within a profile block. * Tier: The character's combat or power Tier. This information is present in the profile text in various formats (e.g., "High 6-C", "Low 2-C | 1-C", "Varies from...", "Unknown"). The extraction needs to be robust to capture these variations accurately and handle cases where the Tier might be missing or explicitly marked as "Unknown". * Origin: The franchise, series, or source material the character belongs to. This information is also present in the profile text in different formats (e.g., "Origin: [[Franchise Name]]", "Origin: Franchise Name", "Origin:" followed by text on the next line). The extraction should identify the specific franchise name and handle cases where the origin is missing, unclear, or listed generically (e.g., "Characters", "Video Game", "Female"). Prioritize specific franchise names over generic terms or "Unknown". * (Optional but preferred) URL: If the URL of the character's profile page can be reliably extracted or constructed from the data within the text dump, include it in a separate column. Input Files I Will Provide: * Primary Source: A single, large text file (.txt) containing the combined raw text content of all character profiles from the VS Battles Wiki. This file is comprehensive and contains the data that needs to be parsed. * Supplementary Files (For Reference): I also have two Excel files (.xlsx) that are results of previous partial extraction attempts focusing on different ways Tier and Origin information can be formatted in the profiles. These files can serve as helpful examples of the data variations you will encounter and demonstrate the kind of specific origin/tier values I am looking for. They are supplementary and not the primary source for extraction. Key Requirements & Expectations: * Develop and use a script (likely in Python with libraries like re for regex parsing, pandas for data handling) to read and parse the large text file. * Implement robust parsing logic to extract Character Name, Tier, and Origin based on the diverse formats within the text. * Apply logic to consolidate data for the same character if they appear multiple times or with slight name variations in the text dump (grouping similar names if necessary). * Handle missing data or generic origins/tiers appropriately (e.g., mark as "Unknown"). * Ensure all identifiable characters from the text dump are included in the output (aiming for a number potentially over 31,000 unique characters). * Output a clean, well-organized Excel (.xlsx) file with the specified columns. * (Optional but preferred) Provide the source code of the extraction script used. Skills Preferred: * Data Extraction * Text Parsing / Data Parsing * Python * Regular Expressions (Regex) * Pandas (or similar data handling library) * Excel I will share the input text file and the supplementary files privately with freelancers who send promising proposals or whom you invite to interview.
Keyword: Content Developer
Price: $100.0
Data Extraction Python Microsoft Excel
We're seeking a Subject Matter Expert (SME) to deliver a 3-day, live remote training on Underwater ROV (Remotely Operated Vehicle) Handling and Operations. The training is designed for beginner to intermediate-level marine engineers, technicians, and offshore personnel....
View JobCMS Readiness Assessment to Production Go Live Providing the expertise necessary through assessing Claimant Management System (CMS) readiness for production in real-time use in accordance with the outlined minimally required tasks. Develop the Technical Readiness Assess...
View JobI have figured out something special. Something that is going to make the world stop and marvel at their place in the universe. I need someone to help me edit & perhaps shoot content with me. I went to UMBC film so I can double with production. I could do the editin...
View Job