bd82145626Merge branch 'diffbot-testing' into testing
mwells
2014-04-05 12:34:46 -07:00
89f5c8c059Merge branch 'diffbot-matt' into diffbot-testing
mwells
2014-04-05 11:34:27 -07:00
61b4ec4ca6added some qa testing logic. qa.cpp.
mwells
2014-04-05 11:33:42 -07:00
0988a134d0Merge remote-tracking branch 'origin/diffbot' into diffbot-dan
Daniel Steinberg
2014-04-01 19:48:24 -07:00
4856cc4c60||, not &&
Daniel Steinberg
2014-04-01 10:45:54 -07:00
3e38bd169eand return an error
Daniel Steinberg
2014-04-01 10:43:17 -07:00
94b169b8dconly delete if there were no io errors
Daniel Steinberg
2014-04-01 10:42:12 -07:00
6568858e81implement something that works like mv, which tries rename first, and if that fails copies the bytes. rename doesn't work across devices
Daniel Steinberg
2014-03-31 20:44:39 -07:00
d6434191d1nomenclature changes to reduce collissions. name collection 'qatest123' for doing smoke tests, not 'test'.
Matt Wells
2014-03-31 15:02:17 -07:00
9c8410767dfix critical title alloc/free bug in title.cpp.
Matt Wells
2014-03-28 08:01:01 -07:00
c1671015c8Merge branch 'diffbot-dan' into diffbot-testing
Matt Wells
2014-03-27 12:19:50 -07:00
582349334fdo not use certain other json fields when computing checksum for deduping. like stats, querystring, ...
Matt Wells
2014-03-27 12:20:53 -07:00
402377d2e6fix bug of gbmin, gbmax etc. not working. floats were being rounded down to ints in most cases it seems. so .9 -> 0 etc.
Matt Wells
2014-03-26 11:56:06 -07:00
d67f09feebalso include a timestamp field with an RFC 1123 formatted date
Daniel Steinberg
2014-03-25 21:45:21 -07:00
0efac8c156Defect #2080: seed URLs duplicated
Daniel Steinberg
2014-03-25 17:25:55 -07:00
e1b1b15a38bigger buffer
Daniel Steinberg
2014-03-25 16:34:40 -07:00
9846061dffwhen restarting a bulk job, copy bulkurls.txt to /tmp, and then transfer it back to the new collection folder
Daniel Steinberg
2014-03-25 16:20:24 -07:00
ab90c06d8dadd TODO for regex checking
Daniel Steinberg
2014-03-25 13:05:43 -07:00
1ff6c1fae0Merge remote-tracking branch 'origin/diffbot' into diffbot-dan
Daniel Steinberg
2014-03-25 12:53:37 -07:00
b8836745f0use SpiderRequest instead of isonsamedomain flag to determine whether to output data in CSV (Defect #2122)
Daniel Steinberg
2014-03-25 12:51:08 -07:00
b6e5424e32do not download bulkjob urls in crawlbot. just return a fake http reply. however, do use crawl-delay throttling logic. deduping is already turned off for bulk jobs so it should be ok.
mwells
2014-03-21 12:40:38 -07:00
b33121af7dmake all field names lower case without spaces when we hash them to make the prefixhash. since json names often have mixed case field names and spaces.
Matt Wells
2014-03-20 16:08:02 -07:00
98a10d4936Merge branch 'testing' into diffbot-testing
Matt Wells
2014-03-20 15:50:49 -07:00
bbc8fc0c79always show admin link
Matt Wells
2014-03-20 15:48:51 -07:00
67202f3731Merge branch 'diffbot' into diffbot-testing
Matt Wells
2014-03-20 15:39:03 -07:00
99bd9319fdtemp hack to reduce network comm between trinity and neo
Matt Wells
2014-03-20 15:42:34 -07:00
5ed19026d9temp debug comments
Matt Wells
2014-03-20 15:33:37 -07:00
b8d0e95035Merge branch 'diffbot' into diffbot-testing
Matt Wells
2014-03-20 10:26:55 -07:00
b31eaee9fdsimple bool queries work
mwells
2014-03-18 12:07:29 -07:00
d4302e3301fix core
Matt Wells
2014-03-18 11:12:50 -07:00
3b97682cc3more bool query fixes
Matt Wells
2014-03-18 10:44:56 -07:00
6e23d37e47Merge branch 'diffbot' into diffbot-testing
Matt Wells
2014-03-17 17:27:28 -07:00
54cc8088fbmore bool query fixes. hopefully this will do it, but still can do some optimizations for speed.
mwells
2014-03-17 17:00:08 -07:00
9d3c35ad17nothing
Matt Wells
2014-03-17 13:53:19 -07:00
4abf56a75dcleanups
Matt Wells
2014-03-16 18:06:22 -07:00
d2511d0befhost table cleanups
Matt Wells
2014-03-16 17:14:47 -07:00
5057fdaf14aesthetic cleanups
Matt Wells
2014-03-16 17:12:04 -07:00
d320bf9d75spidering back on in main's coll.conf
Matt Wells
2014-03-16 15:06:39 -07:00
c513ad9418Merge branch 'diffbot' into testing
Matt Wells
2014-03-16 14:51:22 -07:00
acd05aa740fix a few minor bugs. /master/->/admin/ and crawl type mismatch.
Matt Wells
2014-03-16 10:34:58 -07:00
edbd61b0c5thread fixes. if pthread_create fails then keep thread queue and just return. will try to relaunch later. do not count delete keys towards shard rebalance count.
Matt Wells
2014-03-15 20:07:02 -07:00
5ca411e3e2tuning the rebalance loop
Matt Wells
2014-03-15 14:56:11 -07:00
86147fe22ctight merge during rebalance to save disk space, so neg recs annihilate pos recs.
Matt Wells
2014-03-14 23:37:30 -07:00
6c704f6fdfMerge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
Matt Wells
2014-03-14 22:16:40 -07:00
e37eebd76fwhen rebalancing wait for merge to complete before scanning more
Matt Wells
2014-03-14 22:16:25 -07:00
82ac3fab6cmerge fixes
Matt Wells
2014-03-14 22:15:08 -07:00
df46a6fc1dMerge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot-matt
Matt Wells
2014-03-14 19:32:10 -07:00
1f162ce7b2update localhosts.conf too
Matt Wells
2014-03-14 19:20:23 -07:00
553aefdb55keep files tightly merged when doing rebalanced to avoid running out of disk space
Matt Wells
2014-03-14 19:19:41 -07:00
cb483c42eamore fixes for bool searching before using a slightly different and simpler approach
mwells
2014-03-13 16:00:23 -07:00
7812f5c746more bool fixes. still needs a little more work
mwells
2014-03-13 13:54:23 -07:00
3b2d981dffmore fixes for new boolean logic.
mwells
2014-03-13 13:09:33 -07:00
fb0123ad53nothing
Matt Wells
2014-03-13 11:27:28 -07:00
9acb7ef0f4fix core &token= core
Matt Wells
2014-03-13 07:57:06 -07:00
7b5816f194updated error message
Daniel Steinberg
2014-03-12 20:56:27 -07:00
018258bcaaMerge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2014-03-12 20:55:21 -07:00
fbd1bcd349initial attempt at new boolean query logic. supports unlimited # of boolean query terms. already docid phased from phasing logic already there but could be phased more to save more mem and speed up a little more.
Matt Wells
2014-03-12 20:53:44 -07:00
3e7243c6cefix add url core
Matt Wells
2014-03-12 08:28:42 -07:00
34f7540160fix addurl core
Matt Wells
2014-03-12 08:11:48 -07:00
7ec1513d41updates
Matt Wells
2014-03-12 08:09:45 -07:00
f27d549fc6Defect #2122: If a crawl and there are no urlCrawlPattern or urlCrawlRegEx values, only return URLs from that domain
Daniel Steinberg
2014-03-11 19:46:38 -07:00
85a5954256only apply Defect #2099 updates if it's a bulk job. I didn't see that variable yesterday
Daniel Steinberg
2014-03-11 18:52:14 -07:00
c81bbf6934more informative error message
Daniel Steinberg
2014-03-11 18:10:21 -07:00
14c1b2efa3more informative error message
Daniel Steinberg
2014-03-11 18:06:42 -07:00
312438a32bMerge branch 'diffbot-dan' into diffbot-testing
Matt Wells
2014-03-11 17:02:59 -07:00
84784d8d76minor fixups
Matt Wells
2014-03-11 17:02:24 -07:00
2331b4673dDefect #2099: throw an error a crawl request was made with a name that already existed for bulk request (or the other way around)
Daniel Steinberg
2014-03-11 16:21:58 -07:00
8445e53c61fix query reindex some more
Matt Wells
2014-03-11 14:46:49 -07:00
c4b38a5c72fix a few cores from previous code updates
Matt Wells
2014-03-11 09:36:33 -07:00
5c2e78e5faMerge branch 'diffbot' into diffbot-testing
Matt Wells
2014-03-10 20:26:30 -07:00
483f3c5baefix core
Matt Wells
2014-03-10 18:17:28 -07:00
f9fdc96563no use in newline separating the list of urls if they're going to be read back in and need to be space separated
Daniel Steinberg
2014-03-10 15:22:43 -07:00
e293d465a3snprintf instead of sprintf
Daniel Steinberg
2014-03-10 14:03:28 -07:00
41e3988fbcnot a conf file
Daniel Steinberg
2014-03-10 13:57:13 -07:00
4a7bf5d4d0Story #2040: store raw URL submissions for customer bulk jobs
Daniel Steinberg
2014-03-10 13:50:30 -07:00
bfcb7082f4fix bug from nuking doledb on a new collection.
Matt Wells
2014-03-10 13:48:00 -07:00
bd4484db3cMerge branch 'testing' into diffbot-testing
Matt Wells
2014-03-10 12:08:23 -07:00
9debee20dcMerge branch 'diffbot' into testing
Matt Wells
2014-03-09 20:44:09 -07:00
662b6d4b32doc updates
Matt Wells
2014-03-09 20:43:49 -07:00
90ff2c2a25update example site lists
Matt Wells
2014-03-09 20:35:45 -07:00
82db7240a3simple print update
Matt Wells
2014-03-09 19:43:32 -07:00
f7b7274ff1replace "exact:" directive with "seed:" really the same thing.
Matt Wells
2014-03-09 19:35:20 -07:00
f8e561e6f4more new site list api fixes
Matt Wells
2014-03-09 18:15:57 -07:00
11e8c16878new site list updates
Matt Wells
2014-03-09 17:53:24 -07:00
ed626b162amore site list based spider fixes to be more like gsa
Matt Wells
2014-03-08 20:52:31 -07:00
aab165ed20fix bad return value from function
Matt Wells
2014-03-08 19:32:56 -08:00
4cb66c31bfget this new api spidering
Matt Wells
2014-03-08 12:02:20 -07:00
624c1d4e68nuke doledb fixes
Matt Wells
2014-03-08 10:51:15 -07:00