Bulk Load CSV files into Elasticsearch
Many developers have CSV files or other types of fixed-length or delimited text files to bulk load into Elasticsearch. Import to Elasticsearch solutions include:
- using Logstash to format these and import them into Elasticsearch
- reading and parsing the fixed-length or or delimited files with scripting languages such as Python or PHP and then loading them into Elasticsearch using the Elasticsearch API
- using an quick import and mapping service like Flex.io to bulk load these files into Elasticsearch.
Logstash is a great solution for loading log files into Elasticsearch, but you can quickly run into problems to process files that aren’t logs or in a log format, particularly if your files don’t include timestamps or other date information. To work around these problems, you can use the Elasticsearch API to upload data into Elasticsearch.
Working with the Elasticsearch API does require some developer know-how plus, of course, the effort of setting up the servers, bash scripts, cron jobs, type mappings and other “stuff” to run the scripts that load the data into Elasticsearch on a periodic basis — not to mention logic to read the files from the various locations in the various formats and perform the dirty work of cleaning them.
If you’re a busy developer (is there another kind?), you may not want to roll your own solution. Flex.io provides a quick solution to the Elasticsearch bulk load problem. Once you’ve set it up once, you can schedule the bulk load to run periodically to continuously upload new data into Elasticsearch.
Build the Pipe: Bulk Load CSV into Elasticsearch
For this example, we’ll take a CSV file (from a URL), reduce the number of fields and make sure that the types map properly and then upload to Elasticsearch. Here’s how to do it:
File Input: CSV
First, we’ll take our sample CSV file from the web:
input file: https://raw.githubusercontent.com/flexiodata/data/master/contact-samples/contacts-ltd1.csv
Convert to table
Since we’re planning to do type conversion and mapping for elasticsearch, we’ll convert this file into a table to use with table-based commands.:
convert from: delimited to: table delimiter: comma qualifier: none header: true
Reduce number of columns
Here we’ll pull out unnecessary columns for our Elasticsearch needs. To do this, we use the select command:
select col: surname, emailaddress, streetaddress, city, state, birthday
Changing Field Types
Next, we want to make sure that the data types are set correctly when they’re loaded into ElasticSearch. Here, we use the settype command to set the birthday data type to a date so that we can search entries by birthdate:
settype col: birthday type: date
Output to Elasticsearch
In order to send data to Elasticsearch, you’ll need to create a connection first.
In general, the host follows a form similar to https://12345abc.us-east-1.aws.found.io, the port is generally between 9200 and 9300, and the username and password are the HTTP basic authorization credentials used to authorize access. If your Elasticsearch service is hosted on Amazon Web Services (AWS), you may need to use a proxy server in order to expose this connection with HTTP basic authorization credentials.
Once you have a connection, you’ll use the connection alias to add your output command:
output to: my-connection-alias
That’s it. Run the pipe to pull data from the CSV and bulk load into Elasticsearch.
Deploy the Pipe
Now that you have your pipe, you can deploy and automate it as desired:
- Schedule the pipe to run automatically
- Use the JSON with an AJAX call via the API
- Call the pipe manually via the Command Line Interface
Click below to sign up and get started creating your own bulk upload to Elasticsearch.