-
Traject::Macros::Marc21Semantics.publication_date
now gets date from 264 before 260. traject#233 -
Allow hashie 4.x in gemspec traject#234
-
Allow
http
gem 4.x versions. traject#236 -
Can now call class-level Indexer.configure multiple times sciencehistory/scihist_digicoll#525
-
NokogiriReader has a "nokogiri.strict_mode" setting. Set to true or string 'true' to ask Nokogori to parse in strict mode, so it will immediately raise on ill-formed XML, instead of nokogiri's default to do what it can with it. traject#226
-
SolrJsonWriter
-
Utility method
delete_all!
sends a delete all query to the Solr URL endpoint. traject#227 -
Allow basic auth configuration of the default http client via
solr_writer.basic_auth_user
andsolr_writer.basic_auth_password
. traject#231
-
-
Context#add_output is added, convenient for custom ruby code.
each_record do |record, context| context.add_output "key", something_from(record) end
-
SolrJsonWriter
-
Class-level indexer configuration, for custom indexer subclasses, now available with class-level
configure
method. Warning, Indexers are still expensive to instantiate though. traject#213 -
SolrJsonWriter has new settings to control commit semantics.
solr_writer.solr_update_args
andsolr_writer.commit_solr_update_args
, both have hash values that are Solr update handler query params. traject#215 -
SolrJsonWriter has a
delete(solr-unique-key)
method. Does not currently use any batching or threading. traject#214 -
SolrJsonWriter, when MaxSkippedRecordsExceeded is raised, it will have a #cause that is the last error, which resulted in MaxSkippedRecordsExceeded. Some error reporting systems, including Rails, will automatically log #cause, so that's helpful. traject#216
-
SolrJsonWriter now respects a
solr_writer.http_timeout
setting, in seconds, to be passed to HTTPClient instance. traject#219 -
Only runs thread pool shutdown code (and logging) if there is a
solr_writer.batch_size
greater than 0. Keep it out of the logs if it was a no-op anyway. -
Logs at DEBUG level every time it sends an update request to solr
-
-
Nokogiri dependency for the NokogiriReader increased to
~> 1.9
. When using Jrubyeach_record_xpath
, resulting yielded documents may have xmlns declarations on different nodes than in MRI (and previous versions of nokogiri), but we could find now way around this with nokogiri >= 1.9.0. The documents should still be semantically equivalent for namespace use. This was necessary to keep JRuby Nokogiri XML working with recent Nokogiri releases. traject#209 -
LineWriter guesses better about when to auto-close, and provides an optional explicit setting in case it guesses wrong. (thanks @justinlittman) traject#211
-
Traject::Indexer will now use a Logger(-compatible) instance passed in in setting 'logger' traject#217
-
JRuby traject no longer includes
traject-marc4j_reader
as a dependency or default reader, although it may provide faster MARC-XML reading on JRuby. To use it manually, see https://github.com/traject/traject-marc4j_reader . See traject#187 -
map_record
now returnsnil
if record was skipped. -
The
Traject::Indexer
class no longer includes marc-specific settings and modules.- If you are using command-line
traject
, this should make no difference to you, as command-line now defaults to the newTraject::Indexer::MarcIndexer
with those removed things. - If you are using Traject::Indexer programmatically and want those features, switch to using
Traject::Indexer::MarcIndexer
. - If neccessary, as a hopefully temporary backwards compat shim, call
Traject::Indexer.legacy_marc_mode!
, which injects the old marc-specific behavior into Traject::Indexer again, globally and permanently.
- If you are using command-line
-
Traject::Indexer::Settings no longer has it's own global defaults, Instead it can be given a set of defaults with #with_defaults, usually right after instantiation. To support different defaults for different Indexers.
-
SolrJsonWriter now assumes an /update/json convenience url is available in solr instead of trying to verify it. If you are using an older Solr (before 4?) or otherwise want a different update url, just use setting
solr.update_url
-
Traject::Indexer#configure is available, and recommended instead of raw
instance_eval
. It just does an instance_eval, but is clearer and safer for future changes. -
traject command line can now take multiple input files. And underlying it, Traject::Indexer#process can take an array of input streams.
-
There is now a built-in mode for XML source records, see docs at xml.md
-
new setting
mapping_rescue
is available, to supply custom logic for handling errors. See docs at settings.md -
Call Traject::ThreadPool.disable_concurrency! to force all pool sizes to be 0, and work to be performed inline. All threading will be disabled.
-
to_field
can now take an array as a first argument, to send values to multiple fields mentioned, eg:to_field ["field1", "field2"], extract_marc("240")
-
to_field
can take multiple transformation procs (all with the same form). traject#153 -
There is a new set of standard transformation macros included in
Traject::Indexer
, from Traject::Macros::Transformation. It includes an extraction of previous/existing arguments frommarc_extract
, along with some additional stuff. , in [Traject::Macros::Transformations]. traject#154-
This is the new preferred way to do post-processing with the
marc_extract
options, but the existing options are not deprecated and there is no current plan for them to be removed. -
before:
to_field "some_field", extract_marc("800", translation_map: "marc_800_map", allow_duplicates: true, first: true, default: "default value")
-
now preferred:
to_field "some_field", extract_marc("800", allow_duplicates: true), translation_map("marc_800_map"), first_only, default("default value")
(still need
allow_duplicates: true
cause extract_marc defaults to false, but see alsounique
macro) -
So, these transformation steps can now be used with non-MARC formats as well. See also new transformation macros:
strip
,split
,append
,prepend
,gsub
, andtransform
. And for MARC use,trim_punctuation
.
-
-
Traject::Indexer new api, for more convenient programmatic/embedded use.
-
Traject::Indexer.new
takes a block for config -
Traject::Indexer#process_record
-
Traject::Indexer#process_with
-
Traject::Indexer#complete
and#run_after_processing_steps
public API.
-
-
Traject::SolrJsonWriter#flush
, flush to solr without closing, may be useful for direct programmatic use. -
Traject::Indexer sub-classes can implement a #source_record_id_proc, which is passed to Context, for source-format-specific logic for getting an ID to use in logging.
-
command line takes an
-i
flag for choice of indexer.
- Totally internal change to provide easier hooks into indexing process
- Further squash use of capture-variabels ('$1', etc.) to try to work around the non-thread-safety of regexp in ruby
- Fix a bug in trim_punctuation where trailing periods were being eliminated even if there was a short string before them (e.g., 'Jr.')
- Begin to reorganize tests, starting with the Marc21 macros
- Change to
extract_marc
to work around a threadsafe problem in JRuby/MRI where regexps were unsafely shared between threads. (@codeforkjeff) - Make trim-punctuation safe for non-just-ASCII text (thanks to @dunn and @redlibrarian)
- Update README with more info about new nil-related options
-
Allow nil values, empty fields, and deduplication
This adds three new settings (all of whose defaults reflect current behavior)
allow_nil_values
(default: false). Allow nil values to be sent on to the writerallow_duplicate_values
(default: true). Allow duplicate values. Set to false to force only unique values.allow_empty_fields
(default: false). Default behavior is that the output hash doesn't even contain keys for ato_field
which doesn't produce any values. Set totrue
to pass empty fields on to the writer (with the value being an empty array)
- Had inadvertently broken use of arrays as extract_marc specifications. Fixed.
- Change DebugWriter to be more forgiving (and informative) about missing record-id fields
- Automatically require DebugWriter for easier use on the command line
- Refactor MarcExtractor to be easier to read
- Fix .travis file to actually work, and target more recent rubies.
- update some docs (typos)
- Make the indexer's
writer
r/w so it can be set at runtime (#110) - Allow
extract_marc
to be callable from anywhere (#111) - Add doc instructions/examples for programmatic Indexer use
- Much better error reporting; easier to find which record went wrong
- Guard against assumption of MARC data when indexing using SolrJsonWriter (#94)
- For MARC Records, try to use the production date when available (#93)
- Fix bad constant in logging (#91)
- Compatible with MRI/RBX
- Default to SolrJsonWriter
- Release separate MRI/JRuby gems
- First release