Parquet Export Function for Data Scraping Services

kenlyle · March 22, 2023, 12:43pm

I am new to Outscraper and Parquet, but the latter looks like a competitive advantage over services that export CSV, which we know is fragile, and not a database format.

Do you have any idea how or how well the Parquet file you provide will map into BigQuery? My big question- if I run Email Enrichment, and I have multiple people per line, with name, email, job titie, do you guys split the relationship into Business vs. Contacts, and are they being related somehow, in a relational database way? If not, please do consider…

How can I make it more clear? It appears that a Parquet file can be read by certain Big Data, etc. tools as a data source, imported into data lakes, etc. Specifically, BigQuery allows importing it through their console.

The Email Enricher is great, seems to work well, but repeating groups of Name, Email, Job Title are a violation of normalization, and not functional…Clearlly, the relationshiop is something like Business, to which the majority of the scrapted data attach, and Contacts.

I plan on investigating this, but Support implies that this functionality does not exist yet, so I am looking for support. It’s a great idea and very valuable, so please support it.

kenlyle · March 23, 2023, 6:30pm

I am leaving my original post, memorializing my ignorance on the subject.
I have used the Google BigQuery console to import a small dataset. I created a new table, Contacts, with a query of the email “set” of columns, then the equivalent of select into for the other two “sets”. So that’s one way of solving the relationship issue. I used gooogle_id, I think, as the key, but have not yet defined the relation in BQ.
I think it would be really nice if Outscraper could go a little farther toward providing a complete solution, but understand, too, that it’s the user’s problem…still, it’s nice when vendors handle user problems in a real-world friendly way.
I think I have read that a zip of two related Parquet files might be a nice way to solve this.

kenlyle · March 24, 2023, 9:44pm

I am going to continue on this thread, because I am only changing the data destination to Postgres.
Same scenario, exporting businesses with email enrichments, wihich seems to be up to 3…trying to get to an implemented solution, which means, to me, data in a database, and Postgres for benig better at geo than MySQL
At present, our best understanding is to use the Python SDK to do the calls, capture the JSON, and write it to Postgres. As part of this process, we will need to break out the Contacts from the Businesses. Are we strictly in DIY territory here? Is this strategy maintainable? I do like the looks of Postgres UPSERT, although it would be nice if it were more complete, and you didn’t need to name every field individually, since there are so many fields in the Maps Scraper dataset. It would be superb if there were some "magic " package for Python that handles API to database auto mapping, including UPSERTs…well, isn’t Google great? It 's here - there is still a lot of work to be done, coding the generator for these calls, but that might be doable directly from the JSON or an array that maps the JSON keys to the column names. Am I on the right track?

vlad · March 30, 2023, 3:51pm

Hello @kenlyle,

Thank you for your request!

Do you guys split the relationship into Business vs. Contacts, and are they being related somehow, in a relational database way?

Currently, we do not split contacts and companies, but return them as one line, as this is what works for most of the users. However, it does make sense based on what you are trying to do!

For now, I would recommend splitting it while dumping the data inside your database.

kenlyle · March 30, 2023, 4:05pm

Thanks, Vlad for acknowledging.
Please think on this.
For relational database destinations, it makes sense to split the data on your side, optionally, as a favor to the user.
For JSON database destinations, it almost makes more sense to have a single nested object, depending on the database model.
The bigger picture is that you guys seem really close to being able to feed data directly to my database, which would be awesome.

kenlyle · June 22, 2023, 8:53pm

I have since decided that Parquet seems all smoke, no fire, as it’s still just one table per file.
I am now using the API, and less interested in Parquet as a data format, but still very interested in anything that makes it easier to get results into my database.