Convert 10,000+ JSON files into one single SQLite db?

462 Views Asked by At

Ok so I wanted to build a simple web app, that somehow would use githubarchive data. AT first I though of using the BigQuery database and it's API, however, my free quota would be over in just a day.

So, what I've done is download all 2012/2013 records from the site, and extracted them in a folder - now I have 10k+ json files than I want to use to get data for my app.

So what I want: Create a GAE (python) or Django app to visualize this data. For this, I need to convert the json files in to a database - I want to use SQLite but I'm not sure what's the best choice. I'm not sure if creating a DB is required at all, but that's what I can think of right now.

Anyone has any suggestions for me?

2

There are 2 best solutions below

9
On

AppEngine is also not free beyond certain quota. I'm pretty sure that importing 10k+ Json files and then doing analytics will be well beyond the free quota.

Also, AppEngine has a very limited query language (no cross-entity queries, no joins, single inequality filter) which will prevent you doing any meaningful analytics. Go with BigQuery and swallow the cost or just do the analytics on your own machine and then export the stats to GAE for presentation.

Both GAE and BQ are a fast massively-parallel highly-available services, which you would want to use for a production applications. If you are on a budget and do not need the high-availability, go with an affordable VPS.

0
On

Given your constraints it seems your best bet is to process the files on your local machine and save the query-able results to AppEngine Datastore and from there present them to the user. Depending on how well you are able to process the data to prepared query-able results you could also consider Cloud-SQL (MySQL).

Some proposed steps:

  1. Determine exactly what you want to show or allow to queried. From that information then determine what format of data would enable it to be done most easily.
    • From your example about 100GH users with a repo in a language. Then create a table that has the user, language, and possibly some other fields such as number of repos.
  2. Process the data files you have and write to the tables and fields you've defined in 1.
    • You'll have to decide what's the best way to do this in your case. It might be that you write a simple app that you run locally which parses each file, tracks some values in memory and writes them when done to the server. It could also be that you import to a local database server (MySQl, PostgreSQL, ...) and then query it and then write to the server or run some map reduce jobs.
    • Once you have the data if you app has an API you could put it on AppEngine with that, if not maybe Remote-API is a quick solution.
  3. Design your front end web site that then performs the queries you planned for in 1. Generally Those queries should be relatively inexpensive and simple if you've planned ahead well.

Of course there compromises with this solution and it's not going to be trivial to add ah-hoc queries that you didn't prepare easily (some re-processing and updating of data likely). However hopefully you're able to plan your needs well and you can serve a reasonable number of requests without things getting slow or expensive.