0988a134d0
Merge remote-tracking branch 'origin/diffbot' into diffbot-dan
Daniel Steinberg
2014-04-01 19:48:24 -07:00
4856cc4c60
||, not &&
Daniel Steinberg
2014-04-01 10:45:54 -07:00
3e38bd169e
and return an error
Daniel Steinberg
2014-04-01 10:43:17 -07:00
94b169b8dc
only delete if there were no io errors
Daniel Steinberg
2014-04-01 10:42:12 -07:00
6568858e81
implement something that works like mv, which tries rename first, and if that fails copies the bytes. rename doesn't work across devices
Daniel Steinberg
2014-03-31 20:44:39 -07:00
d6434191d1
nomenclature changes to reduce collissions. name collection 'qatest123' for doing smoke tests, not 'test'.
Matt Wells
2014-03-31 15:02:17 -07:00
9c8410767d
fix critical title alloc/free bug in title.cpp.
Matt Wells
2014-03-28 08:01:01 -07:00
c1671015c8
Merge branch 'diffbot-dan' into diffbot-testing
Matt Wells
2014-03-27 12:19:50 -07:00
582349334f
do not use certain other json fields when computing checksum for deduping. like stats, querystring, ...
Matt Wells
2014-03-27 12:20:53 -07:00
402377d2e6
fix bug of gbmin, gbmax etc. not working. floats were being rounded down to ints in most cases it seems. so .9 -> 0 etc.
Matt Wells
2014-03-26 11:56:06 -07:00
d67f09feeb
also include a timestamp field with an RFC 1123 formatted date
Daniel Steinberg
2014-03-25 21:45:21 -07:00
e1b1b15a38
bigger buffer
Daniel Steinberg
2014-03-25 16:34:40 -07:00
9846061dff
when restarting a bulk job, copy bulkurls.txt to /tmp, and then transfer it back to the new collection folder
Daniel Steinberg
2014-03-25 16:20:24 -07:00
ab90c06d8d
add TODO for regex checking
Daniel Steinberg
2014-03-25 13:05:43 -07:00
1ff6c1fae0
Merge remote-tracking branch 'origin/diffbot' into diffbot-dan
Daniel Steinberg
2014-03-25 12:53:37 -07:00
b8836745f0
use SpiderRequest instead of isonsamedomain flag to determine whether to output data in CSV (Defect #2122)
Daniel Steinberg
2014-03-25 12:51:08 -07:00
b6e5424e32
do not download bulkjob urls in crawlbot. just return a fake http reply. however, do use crawl-delay throttling logic. deduping is already turned off for bulk jobs so it should be ok.
mwells
2014-03-21 12:40:38 -07:00
b33121af7d
make all field names lower case without spaces when we hash them to make the prefixhash. since json names often have mixed case field names and spaces.
Matt Wells
2014-03-20 16:08:02 -07:00
98a10d4936
Merge branch 'testing' into diffbot-testing
Matt Wells
2014-03-20 15:50:49 -07:00
bbc8fc0c79
always show admin link
Matt Wells
2014-03-20 15:48:51 -07:00
67202f3731
Merge branch 'diffbot' into diffbot-testing
Matt Wells
2014-03-20 15:39:03 -07:00
99bd9319fd
temp hack to reduce network comm between trinity and neo
Matt Wells
2014-03-20 15:42:34 -07:00
5ed19026d9
temp debug comments
Matt Wells
2014-03-20 15:33:37 -07:00
b8d0e95035
Merge branch 'diffbot' into diffbot-testing
Matt Wells
2014-03-20 10:26:55 -07:00
ca0843aa8b
more bool query fixes.
mwells
2014-03-20 10:03:25 -07:00
cfbec626e8
more righteous fixes for bool queries
mwells
2014-03-19 13:51:32 -07:00
ab3368b5a0
more bool fixes. not operator support.
mwells
2014-03-19 09:38:45 -07:00
1bb91149d6
more bool fixes
mwells
2014-03-18 14:42:50 -07:00
652892dc10
more bool fixes
mwells
2014-03-18 14:37:59 -07:00
b7d80fd02d
more bool query fixes
mwells
2014-03-18 13:41:36 -07:00
b31eaee9fd
simple bool queries work
mwells
2014-03-18 12:07:29 -07:00
d4302e3301
fix core
Matt Wells
2014-03-18 11:12:50 -07:00
3b97682cc3
more bool query fixes
Matt Wells
2014-03-18 10:44:56 -07:00
6e23d37e47
Merge branch 'diffbot' into diffbot-testing
Matt Wells
2014-03-17 17:27:28 -07:00
54cc8088fb
more bool query fixes. hopefully this will do it, but still can do some optimizations for speed.
mwells
2014-03-17 17:00:08 -07:00
9d3c35ad17
nothing
Matt Wells
2014-03-17 13:53:19 -07:00
4abf56a75d
cleanups
Matt Wells
2014-03-16 18:06:22 -07:00
d2511d0bef
host table cleanups
Matt Wells
2014-03-16 17:14:47 -07:00
5057fdaf14
aesthetic cleanups
Matt Wells
2014-03-16 17:12:04 -07:00
d320bf9d75
spidering back on in main's coll.conf
Matt Wells
2014-03-16 15:06:39 -07:00
c513ad9418
Merge branch 'diffbot' into testing
Matt Wells
2014-03-16 14:51:22 -07:00
acd05aa740
fix a few minor bugs. /master/->/admin/ and crawl type mismatch.
Matt Wells
2014-03-16 10:34:58 -07:00
edbd61b0c5
thread fixes. if pthread_create fails then keep thread queue and just return. will try to relaunch later. do not count delete keys towards shard rebalance count.
Matt Wells
2014-03-15 20:07:02 -07:00
5ca411e3e2
tuning the rebalance loop
Matt Wells
2014-03-15 14:56:11 -07:00
86147fe22c
tight merge during rebalance to save disk space, so neg recs annihilate pos recs.
Matt Wells
2014-03-14 23:37:30 -07:00
6c704f6fdf
Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
Matt Wells
2014-03-14 22:16:40 -07:00
e37eebd76f
when rebalancing wait for merge to complete before scanning more
Matt Wells
2014-03-14 22:16:25 -07:00
82ac3fab6c
merge fixes
Matt Wells
2014-03-14 22:15:08 -07:00
df46a6fc1d
Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot-matt
Matt Wells
2014-03-14 19:32:10 -07:00
1f162ce7b2
update localhosts.conf too
Matt Wells
2014-03-14 19:20:23 -07:00
553aefdb55
keep files tightly merged when doing rebalanced to avoid running out of disk space
Matt Wells
2014-03-14 19:19:41 -07:00
cb483c42ea
more fixes for bool searching before using a slightly different and simpler approach
mwells
2014-03-13 16:00:23 -07:00
7812f5c746
more bool fixes. still needs a little more work
mwells
2014-03-13 13:54:23 -07:00
3b2d981dff
more fixes for new boolean logic.
mwells
2014-03-13 13:09:33 -07:00
fb0123ad53
nothing
Matt Wells
2014-03-13 11:27:28 -07:00
9acb7ef0f4
fix core &token= core
Matt Wells
2014-03-13 07:57:06 -07:00
7b5816f194
updated error message
Daniel Steinberg
2014-03-12 20:56:27 -07:00
018258bcaa
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2014-03-12 20:55:21 -07:00
fbd1bcd349
initial attempt at new boolean query logic. supports unlimited # of boolean query terms. already docid phased from phasing logic already there but could be phased more to save more mem and speed up a little more.
Matt Wells
2014-03-12 20:53:44 -07:00
3e7243c6ce
fix add url core
Matt Wells
2014-03-12 08:28:42 -07:00
34f7540160
fix addurl core
Matt Wells
2014-03-12 08:11:48 -07:00
7ec1513d41
updates
Matt Wells
2014-03-12 08:09:45 -07:00
f27d549fc6
Defect #2122: If a crawl and there are no urlCrawlPattern or urlCrawlRegEx values, only return URLs from that domain
Daniel Steinberg
2014-03-11 19:46:38 -07:00
85a5954256
only apply Defect #2099 updates if it's a bulk job. I didn't see that variable yesterday
Daniel Steinberg
2014-03-11 18:52:14 -07:00
c81bbf6934
more informative error message
Daniel Steinberg
2014-03-11 18:10:21 -07:00
14c1b2efa3
more informative error message
Daniel Steinberg
2014-03-11 18:06:42 -07:00
312438a32b
Merge branch 'diffbot-dan' into diffbot-testing
Matt Wells
2014-03-11 17:02:59 -07:00
84784d8d76
minor fixups
Matt Wells
2014-03-11 17:02:24 -07:00
2331b4673d
Defect #2099: throw an error a crawl request was made with a name that already existed for bulk request (or the other way around)
Daniel Steinberg
2014-03-11 16:21:58 -07:00
8445e53c61
fix query reindex some more
Matt Wells
2014-03-11 14:46:49 -07:00
c4b38a5c72
fix a few cores from previous code updates
Matt Wells
2014-03-11 09:36:33 -07:00
5c2e78e5fa
Merge branch 'diffbot' into diffbot-testing
Matt Wells
2014-03-10 20:26:30 -07:00
483f3c5bae
fix core
Matt Wells
2014-03-10 18:17:28 -07:00
f9fdc96563
no use in newline separating the list of urls if they're going to be read back in and need to be space separated
Daniel Steinberg
2014-03-10 15:22:43 -07:00
e293d465a3
snprintf instead of sprintf
Daniel Steinberg
2014-03-10 14:03:28 -07:00
41e3988fbc
not a conf file
Daniel Steinberg
2014-03-10 13:57:13 -07:00
4a7bf5d4d0
Story #2040: store raw URL submissions for customer bulk jobs
Daniel Steinberg
2014-03-10 13:50:30 -07:00
bfcb7082f4
fix bug from nuking doledb on a new collection.
Matt Wells
2014-03-10 13:48:00 -07:00
bd4484db3c
Merge branch 'testing' into diffbot-testing
Matt Wells
2014-03-10 12:08:23 -07:00
9debee20dc
Merge branch 'diffbot' into testing
Matt Wells
2014-03-09 20:44:09 -07:00
662b6d4b32
doc updates
Matt Wells
2014-03-09 20:43:49 -07:00
90ff2c2a25
update example site lists
Matt Wells
2014-03-09 20:35:45 -07:00
82db7240a3
simple print update
Matt Wells
2014-03-09 19:43:32 -07:00
f7b7274ff1
replace "exact:" directive with "seed:" really the same thing.
Matt Wells
2014-03-09 19:35:20 -07:00
f8e561e6f4
more new site list api fixes
Matt Wells
2014-03-09 18:15:57 -07:00
11e8c16878
new site list updates
Matt Wells
2014-03-09 17:53:24 -07:00
ed626b162a
more site list based spider fixes to be more like gsa
Matt Wells
2014-03-08 20:52:31 -07:00
aab165ed20
fix bad return value from function
Matt Wells
2014-03-08 19:32:56 -08:00
4cb66c31bf
get this new api spidering
Matt Wells
2014-03-08 12:02:20 -07:00